GPU COMPUTING


         Presented By

            Rajiv Kumar V
                   No -34
                     S7C
Graphics Processing Units(GPU):

Powerful
Programmable and
Highly Parallel



Jen-sun Huang ,
” GPU power is set to increase 570x whereas CPU
power would increase a mere 3x over the same time
frame of six years”
INTRODUCTION:

• GPU has powered the display of Computers
• Designed for real-time high resolution 3D graphics tasks
• Commercial GPU-based systems are becoming common
• NVIDIA and AMD expanding processor sophistication and
  software development tools
• High accuracy by higher floating point precision
• GPUs currently on a development cycle much closer to CPUs
• GPU not constrained by sockets
• Very small backwards compatibility needed in firmware while
  rest is delivered through driver implementation
GPU based S/W’s requirements
• Computational requirements are large
• Parallelism is substantial
• Throughput is more important than latency



App requirement to target GPGPU
programming:
•   Large data sets
•   High parallelism
•   Minimal dependencies between data elements
•   High arithmetic intensity
•   Lots of work to do without CPU intervention
Task Vs. Data parallelism

Task parallelism:
• Independent processes with little
  communication

Data parallelism:
• Lots of data on which the same computation is
  being executed
• No dependencies between data elements in
  each step in the computation
• Can saturate many ALUs
GPU Vs CPU

• CPU designed to process a task as fast as
  possible while GPU capable of processing a
  maximum of tasks on a large scale of data
• CPU divides work in time while GPU divides work
  in space
Graphics Pipeline:
• Input to the GPU is a list of geometric primitives
• Vertex Operations: primitives transformed into screen
  space and shaded
• Primitive Assembly: Vertices assembled into triangles
• Computing their interaction with the lights in the scene
• Rasterization: determines which screen-space pixels are
  covered by each triangle
• Fragment Operations: Using color information each
  fragment is shaded to determine its final color
• Each pixel’s color value may be computed from several
  fragments
• Composition: Fragments are assembled into a final image
  with one color per pixel
Graphics Pipeline:
Evolution of GPU Architecture:
 • Fixed function pipeline lacked generality for complex
   effects
 • Replacement of fixed function per vertex and per
   fragment operations by vertex and fragment programs
 • Increased complexity of vertex and fragment program
   as Shader Model evolved
 • Support for Unified Shader Models


 Shader Models:
  • A Shader provides a user defined programmable
    alternative to hard-coded approach in GLSL
  • A Vertex Shader describe the traits(position, colors ,
    depth value etc) of a vertex
  • A Geometry shader add volumetric detail & O/P is then
    sent to the rasterizer
  • A Pixel/fragment shader describe the traits (color, z-
    depth and alpha value) of a pixel
GPU Programming Model
 • Follows a SPMD programming model
 • Each element is independent from other
   elements in base programming Model
 • Many parallel elements processed by single
   program
 • Each element can operate on integer or
   float data with reasonably complete
   instruction set
 • Reads data from shared memory by scatter
   and gather operations
 • Code is in SIMD manner
 • Allows different execution path for each
   element
 • If elements branch in different directions
   both branches are computed
 • Computation as blocks in order of 16
   elements
 • Finally programmers branches are
   permitted but not free
GPU Architecture:NVIDIA




Nvidia 8800GTX architecture
                        (top)
         A pair of SMs(right)
Memory Architecture




• Capable of reading and writing anywhere in local
  memory(GPU) or elsewhere.
• These non cached memories having large read/write
  latencies which can be masked by the extremely long
  pipeline, if they don’t wait for a reading instruction
GPGPU Programming
Stream processing is a new paradigm to maximize the
efficiency of parallel computing. It can be decomposed in two
parts:

• Stream: It’s a collection of objects which can be operated
  in parallel and which require the same computation.

• Kernel: It’s a function applied on the entire stream, looks
  like a “for each” loop
Terminology:
Streams
-Collection of records requiring similar computation
 eg. Vertex positions, Voxels etc.
-Provide data parallelism

Kernels
–Functions applied to each element in stream
 transforms
–No dependencies between stream elements
 encourage high Arithmetic Intensity

Gather
–Indirect read from memory ( x = a[i] )
–Naturally maps to a texture fetch
–Used to access data structures and data streams

Scatter
–Indirect write to memory ( a[i] = x )
–Needed for building many data structures
–Usually done on the CPU
What can you do on GPUs other than
graphics?

• Large matrix/vector operations (BLAS)
• Protein Folding (Molecular Dynamics)
• FFT (SETI, signal processing)
• Ray Tracing
• Physics Simulation [cloth, fluid, collision]
• Sequence Matching (Hidden Markov Models)
• Speech Recognition (Hidden Markov
  Models, Neural nets)
• Databases
• Sort/Search
• Medical Imaging (image
  segmentation, processing)
And many, many more…
Future of GPU Computing:
• Higher Bandwidth PCI-E bus path between
  CPU and GPU
• AMD’s fusion and Intel’s IvyBridge places
  both CPU and GPU elements on a single chip
• Addition of AVX instructions in CPU
  architectures
• Programmable Pipelines over the current few
  programmable shading stages in the fixed
  graphics pipeline
• Flexibility of variety of rendering along with
  general purpose processing
Looking Ahead:
Problems in GPGPU Computing
• A killer App...???...??
• Programming models and Tools…Proprietary
  nature…??
• GPU in tomorrow’s Computer…Will it get
  dissolved…or absorbed???
• Relationship to other parallel H/W and S/W
• Managing Rapid Change…
• Performance Evaluation and Cliffs
• Broader Toolbox for computation and Data
  Structures…”Vertical” model for app development
• Faults and Lack of Precision…
Drawbacks:
• Power consumption
• Increasing die size
• Multi die solutions requiring inter-die connections
  increase the packaging and wafer cost
• Increasing amount of die space to control logic
  , registers and cache as GPU becomes flexible and
  programmable
• Comparing CPU to GPUs is more like comparing
  apples to oranges
• Still lots of fixed functions hardware
• Integration of multimedia fixed functions within
  the CPUs
References:
• GPU Computing Gems Emerald Edition By Wen.Mei W.
  Hwu
• Cuda By Example: An Introduction to General
  Purpose GPU Computing By J.Sanders,E.Kandrot (July
  2010)
• http://www.oxford-man.ox.ac.uk/gpuss/simd.html
• http://idlastro.gsfc.nasa.gov/idl_html_help/About_Sh
  ader_Programs.html
• GPU Computing Proceedings of IEEE,May 2008
• Evolution Of GPU By Chris Sietz
Thank You All…



  Any Questions…



           ???

GPU Computing: A brief overview

  • 1.
    GPU COMPUTING Presented By Rajiv Kumar V No -34 S7C
  • 2.
    Graphics Processing Units(GPU): Powerful Programmableand Highly Parallel Jen-sun Huang , ” GPU power is set to increase 570x whereas CPU power would increase a mere 3x over the same time frame of six years”
  • 3.
    INTRODUCTION: • GPU haspowered the display of Computers • Designed for real-time high resolution 3D graphics tasks • Commercial GPU-based systems are becoming common • NVIDIA and AMD expanding processor sophistication and software development tools • High accuracy by higher floating point precision • GPUs currently on a development cycle much closer to CPUs • GPU not constrained by sockets • Very small backwards compatibility needed in firmware while rest is delivered through driver implementation
  • 4.
    GPU based S/W’srequirements • Computational requirements are large • Parallelism is substantial • Throughput is more important than latency App requirement to target GPGPU programming: • Large data sets • High parallelism • Minimal dependencies between data elements • High arithmetic intensity • Lots of work to do without CPU intervention
  • 5.
    Task Vs. Dataparallelism Task parallelism: • Independent processes with little communication Data parallelism: • Lots of data on which the same computation is being executed • No dependencies between data elements in each step in the computation • Can saturate many ALUs
  • 6.
    GPU Vs CPU •CPU designed to process a task as fast as possible while GPU capable of processing a maximum of tasks on a large scale of data • CPU divides work in time while GPU divides work in space
  • 7.
    Graphics Pipeline: • Inputto the GPU is a list of geometric primitives • Vertex Operations: primitives transformed into screen space and shaded • Primitive Assembly: Vertices assembled into triangles • Computing their interaction with the lights in the scene • Rasterization: determines which screen-space pixels are covered by each triangle • Fragment Operations: Using color information each fragment is shaded to determine its final color • Each pixel’s color value may be computed from several fragments • Composition: Fragments are assembled into a final image with one color per pixel
  • 8.
  • 9.
    Evolution of GPUArchitecture: • Fixed function pipeline lacked generality for complex effects • Replacement of fixed function per vertex and per fragment operations by vertex and fragment programs • Increased complexity of vertex and fragment program as Shader Model evolved • Support for Unified Shader Models Shader Models: • A Shader provides a user defined programmable alternative to hard-coded approach in GLSL • A Vertex Shader describe the traits(position, colors , depth value etc) of a vertex • A Geometry shader add volumetric detail & O/P is then sent to the rasterizer • A Pixel/fragment shader describe the traits (color, z- depth and alpha value) of a pixel
  • 11.
    GPU Programming Model • Follows a SPMD programming model • Each element is independent from other elements in base programming Model • Many parallel elements processed by single program • Each element can operate on integer or float data with reasonably complete instruction set • Reads data from shared memory by scatter and gather operations • Code is in SIMD manner • Allows different execution path for each element • If elements branch in different directions both branches are computed • Computation as blocks in order of 16 elements • Finally programmers branches are permitted but not free
  • 12.
    GPU Architecture:NVIDIA Nvidia 8800GTXarchitecture (top) A pair of SMs(right)
  • 13.
    Memory Architecture • Capableof reading and writing anywhere in local memory(GPU) or elsewhere. • These non cached memories having large read/write latencies which can be masked by the extremely long pipeline, if they don’t wait for a reading instruction
  • 14.
    GPGPU Programming Stream processingis a new paradigm to maximize the efficiency of parallel computing. It can be decomposed in two parts: • Stream: It’s a collection of objects which can be operated in parallel and which require the same computation. • Kernel: It’s a function applied on the entire stream, looks like a “for each” loop
  • 15.
    Terminology: Streams -Collection of recordsrequiring similar computation eg. Vertex positions, Voxels etc. -Provide data parallelism Kernels –Functions applied to each element in stream transforms –No dependencies between stream elements encourage high Arithmetic Intensity Gather –Indirect read from memory ( x = a[i] ) –Naturally maps to a texture fetch –Used to access data structures and data streams Scatter –Indirect write to memory ( a[i] = x ) –Needed for building many data structures –Usually done on the CPU
  • 16.
    What can youdo on GPUs other than graphics? • Large matrix/vector operations (BLAS) • Protein Folding (Molecular Dynamics) • FFT (SETI, signal processing) • Ray Tracing • Physics Simulation [cloth, fluid, collision] • Sequence Matching (Hidden Markov Models) • Speech Recognition (Hidden Markov Models, Neural nets) • Databases • Sort/Search • Medical Imaging (image segmentation, processing) And many, many more…
  • 17.
    Future of GPUComputing: • Higher Bandwidth PCI-E bus path between CPU and GPU • AMD’s fusion and Intel’s IvyBridge places both CPU and GPU elements on a single chip • Addition of AVX instructions in CPU architectures • Programmable Pipelines over the current few programmable shading stages in the fixed graphics pipeline • Flexibility of variety of rendering along with general purpose processing
  • 18.
  • 19.
    Problems in GPGPUComputing • A killer App...???...?? • Programming models and Tools…Proprietary nature…?? • GPU in tomorrow’s Computer…Will it get dissolved…or absorbed??? • Relationship to other parallel H/W and S/W • Managing Rapid Change… • Performance Evaluation and Cliffs • Broader Toolbox for computation and Data Structures…”Vertical” model for app development • Faults and Lack of Precision…
  • 20.
    Drawbacks: • Power consumption •Increasing die size • Multi die solutions requiring inter-die connections increase the packaging and wafer cost • Increasing amount of die space to control logic , registers and cache as GPU becomes flexible and programmable • Comparing CPU to GPUs is more like comparing apples to oranges • Still lots of fixed functions hardware • Integration of multimedia fixed functions within the CPUs
  • 21.
    References: • GPU ComputingGems Emerald Edition By Wen.Mei W. Hwu • Cuda By Example: An Introduction to General Purpose GPU Computing By J.Sanders,E.Kandrot (July 2010) • http://www.oxford-man.ox.ac.uk/gpuss/simd.html • http://idlastro.gsfc.nasa.gov/idl_html_help/About_Sh ader_Programs.html • GPU Computing Proceedings of IEEE,May 2008 • Evolution Of GPU By Chris Sietz
  • 22.
    Thank You All… Any Questions… ???