Stream processing is a computer
programming paradigm, related to
SIMD
Stream processing is a computer
programming paradigm, related to
SIMD



It allows some applications to more
easily exploit a limited form of
parallel processing
A stream is simply a set of records
that require similar computation.
Streams provide data parallelism
A stream is simply a set of records
that require similar computation.
Streams provide data parallelism




                   Kernels are the functions that
                   are applied to each element in
                   the stream
A stream is simply a set of records
that require similar computation.
Streams provide data parallelism




                    Kernels are the functions that
                    are applied to each element in
                    the stream


For each element we can only read from the
input, perform operations on it, and write to the
output
Stream processing is especially suitable for
applications that exhibit three characteristics ---
Stream processing is especially suitable for
applications that exhibit three characteristics ---
Stream processing is especially suitable for
applications that exhibit three characteristics ---
Stream processing is especially suitable for
applications that exhibit three characteristics ---
Flynn’s Taxonomy:            SISD




Single Instruction: Only one instruction stream is being acted on
by the CPU during any one clock cycle

Single Data: Only one data stream is being used as input during
any one clock cycle
Flynn’s Taxonomy:       SIMD




Single Instruction: All processing units execute the same
instruction at any given clock cycle

Multiple Data: Each processing unit can operate on a different
data element
Flynn’s Taxonomy:         MISD




Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.

Single Data: A single data stream is fed into multiple processing
units.
Flynn’s Taxonomy:         MIMD




Multiple Instruction: Every processor may be executing a different
instruction stream

Multiple Data: Every processor may be working with a different data
stream
Stream Processors




stream processing makes use of locality of reference by explicitly
grouping related code and data together for easy fetching into the
cache
A stream processing language for programs based
on streams of data


  e.g Audio, video, DSP, networking,
  and cryptographic processing kernels




         HDTV editing, radar tracking, microphone arrays,
         cellphone base stations, graphics




                                                            [Thies 2002]
A high-level, architecture-independent language
for streaming applications

1. Improves programmer productivity (vs.
   Java, C)

2. Offers scalable performance on multicores



                                           [Thies 2002]
GPU
GPU is a single-chip processor
that creates lighting effects and
transforms objects every time a 3D
scene is redrawn

Used primarily for 3-D
applications.


                            a GPU can be present on a video
                            card, or it can be on the
                            motherboard, or in certain CPUs, on
                            the CPU die
World’s First GPu
Nvidia in 1999 marketed the GeForce 256 as "the world's
first 'GPU, a single-chip processor that is capable of
processing a minimum of 10 million polygons per second".


Rival ATI Technologies coined the term visual processing
unit or VPU with the release of the Radeon 9700 in 2002.
GPUs have a very high compute capacity
GPUs have a very high compute capacity
GPUs have a very high compute capacity




To the hardware, the accelerator
looks like another IO unit; it
communicates with the CPU using IO
commands and DMA memory transfers
GPUs have a very high compute capacity




To the hardware, the accelerator
looks like another IO unit; it          To the software, the accelerator
communicates with the CPU using IO      is another computer to which your
commands and DMA memory transfers       program sends data and routines
                                        to execute
GPGPU
This concept turns the massive floating-point computational
power of a modern graphics accelerator into general-purpose
computing power
GPGPU
This concept turns the massive floating-point computational
power of a modern graphics accelerator into general-purpose
computing power
GPGPU
This concept turns the massive floating-point computational
power of a modern graphics accelerator into general-purpose
computing power
GPGPU
This concept turns the massive floating-point computational
power of a modern graphics accelerator into general-purpose
computing power




  GPUs are stream processors – processors that can operate
  in parallel by running a single kernel on many records in
  a stream at once
GPGPU
 This concept turns the massive floating-point computational
 power of a modern graphics accelerator into general-purpose
 computing power




   GPUs are stream processors – processors that can operate
   in parallel by running a single kernel on many records in
   a stream at once



Ideal GPGPU applications have large data sets, high parallelism,
and minimal dependency between data elements
In certain circumstances the GPU calculates   forty
times faster than the conventional CPUs
In certain circumstances the GPU calculates   forty
     times faster than the conventional CPUs

AMD
Athlon 64   CPU     154 m
X2
In certain circumstances the GPU calculates   forty
     times faster than the conventional CPUs

AMD                              ATI X1950
Athlon 64   CPU     154 m                    GPU     384 m
                                 XTX
X2
In certain circumstances the GPU calculates   forty
     times faster than the conventional CPUs

AMD                              ATI X1950
Athlon 64   CPU     154 m                    GPU     384 m
                                 XTX
X2




Intel Core 2
             CPU    582 m
Quad
In certain circumstances the GPU calculates    forty
     times faster than the conventional CPUs

AMD                              ATI X1950
Athlon 64   CPU     154 m                     GPU     384 m
                                 XTX
X2




Intel Core 2                      NVIDIA
             CPU    582 m                     GPU      680 m
Quad                              G8800 GTX
“The processing power of just 5,000 ATI processors is
also enough to rival that of the existing 200,000
computers currently involved in the Folding@home project”




                                                      [Ref 1]
“The processing power of just 5,000 ATI processors is
 also enough to rival that of the existing 200,000
 computers currently involved in the Folding@home project”


“..it is estimated that if a mere 10,000 computers were to
each use an ATI processor to conduct folding research, that
the Folding@home program would effectively perform faster
than the fastest supercomputer in existence
today, surpassing the 1 petaFLOP level “- 2007



   November 10, 2011- Folding@home 6.0 petaFlop where
   8.162 petaFLOP ( K computer)                          [Ref 1]
comparing GPUs to CPUs isn't an
        apples-to-apples comparison

  The clock rates are lower




the architectures are radically
different




     the problems they're trying to solve are almost
     completely unrelated
Application Processor:

Executes application code like
MPEG decoding

Sequences the instructions and
issues them to Stream clients
e.g KEU and
DRAM interface




                                 [Kapasi 2003]
Two Stream Clients:

KEU:
Programmable Kernel Execution
Unit

DRAM interface:

Provides access to global data
storage




                                 [Kapasi 2003]
KEU:

It has two stream level
instructions:

1. load_kernel – loads
  compiled kernel function in
  the local instruction
  storage inside the KEU

2. run_kernel – executes the
  kernel




                                [Kapasi 2003]
DRAM interface:

Two stream level instructions
as well –

1. load_stream – loads an
  entire stream from SRF

2. store_stream – stores a
  stream into SRF




                                [Kapasi 2003]
Local register files (LRFs)

1. use for operands for
   arithmetic operations
   (similar to caches on CPUs)

2. exploit fine-grain locality




                                 [Kapasi 2003]
Stream register files
(SRFs)

1. capture coarse-grain
   locality
2. efficiently transfer data
   to and from the LRFs




                               [Kapasi 2003]
[Kapasi 2003]
Topics learnt today:

1. Stream Processing
   3. How modern GPUs use stream processing

                                    4. Imagine Stream Processor from
                                    Stanford
2. StreamIT language from MIT
Stream Processing

Stream Processing

  • 2.
    Stream processing isa computer programming paradigm, related to SIMD
  • 3.
    Stream processing isa computer programming paradigm, related to SIMD It allows some applications to more easily exploit a limited form of parallel processing
  • 4.
    A stream issimply a set of records that require similar computation. Streams provide data parallelism
  • 5.
    A stream issimply a set of records that require similar computation. Streams provide data parallelism Kernels are the functions that are applied to each element in the stream
  • 6.
    A stream issimply a set of records that require similar computation. Streams provide data parallelism Kernels are the functions that are applied to each element in the stream For each element we can only read from the input, perform operations on it, and write to the output
  • 7.
    Stream processing isespecially suitable for applications that exhibit three characteristics ---
  • 8.
    Stream processing isespecially suitable for applications that exhibit three characteristics ---
  • 9.
    Stream processing isespecially suitable for applications that exhibit three characteristics ---
  • 10.
    Stream processing isespecially suitable for applications that exhibit three characteristics ---
  • 11.
    Flynn’s Taxonomy: SISD Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle Single Data: Only one data stream is being used as input during any one clock cycle
  • 12.
    Flynn’s Taxonomy: SIMD Single Instruction: All processing units execute the same instruction at any given clock cycle Multiple Data: Each processing unit can operate on a different data element
  • 13.
    Flynn’s Taxonomy: MISD Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams. Single Data: A single data stream is fed into multiple processing units.
  • 14.
    Flynn’s Taxonomy: MIMD Multiple Instruction: Every processor may be executing a different instruction stream Multiple Data: Every processor may be working with a different data stream
  • 15.
    Stream Processors stream processingmakes use of locality of reference by explicitly grouping related code and data together for easy fetching into the cache
  • 16.
    A stream processinglanguage for programs based on streams of data e.g Audio, video, DSP, networking, and cryptographic processing kernels HDTV editing, radar tracking, microphone arrays, cellphone base stations, graphics [Thies 2002]
  • 17.
    A high-level, architecture-independentlanguage for streaming applications 1. Improves programmer productivity (vs. Java, C) 2. Offers scalable performance on multicores [Thies 2002]
  • 72.
    GPU GPU is asingle-chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn Used primarily for 3-D applications. a GPU can be present on a video card, or it can be on the motherboard, or in certain CPUs, on the CPU die
  • 73.
    World’s First GPu Nvidiain 1999 marketed the GeForce 256 as "the world's first 'GPU, a single-chip processor that is capable of processing a minimum of 10 million polygons per second". Rival ATI Technologies coined the term visual processing unit or VPU with the release of the Radeon 9700 in 2002.
  • 74.
    GPUs have avery high compute capacity
  • 75.
    GPUs have avery high compute capacity
  • 76.
    GPUs have avery high compute capacity To the hardware, the accelerator looks like another IO unit; it communicates with the CPU using IO commands and DMA memory transfers
  • 77.
    GPUs have avery high compute capacity To the hardware, the accelerator looks like another IO unit; it To the software, the accelerator communicates with the CPU using IO is another computer to which your commands and DMA memory transfers program sends data and routines to execute
  • 78.
    GPGPU This concept turnsthe massive floating-point computational power of a modern graphics accelerator into general-purpose computing power
  • 79.
    GPGPU This concept turnsthe massive floating-point computational power of a modern graphics accelerator into general-purpose computing power
  • 80.
    GPGPU This concept turnsthe massive floating-point computational power of a modern graphics accelerator into general-purpose computing power
  • 81.
    GPGPU This concept turnsthe massive floating-point computational power of a modern graphics accelerator into general-purpose computing power GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a stream at once
  • 82.
    GPGPU This conceptturns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a stream at once Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements
  • 83.
    In certain circumstancesthe GPU calculates forty times faster than the conventional CPUs
  • 84.
    In certain circumstancesthe GPU calculates forty times faster than the conventional CPUs AMD Athlon 64 CPU 154 m X2
  • 85.
    In certain circumstancesthe GPU calculates forty times faster than the conventional CPUs AMD ATI X1950 Athlon 64 CPU 154 m GPU 384 m XTX X2
  • 86.
    In certain circumstancesthe GPU calculates forty times faster than the conventional CPUs AMD ATI X1950 Athlon 64 CPU 154 m GPU 384 m XTX X2 Intel Core 2 CPU 582 m Quad
  • 87.
    In certain circumstancesthe GPU calculates forty times faster than the conventional CPUs AMD ATI X1950 Athlon 64 CPU 154 m GPU 384 m XTX X2 Intel Core 2 NVIDIA CPU 582 m GPU 680 m Quad G8800 GTX
  • 88.
    “The processing powerof just 5,000 ATI processors is also enough to rival that of the existing 200,000 computers currently involved in the Folding@home project” [Ref 1]
  • 89.
    “The processing powerof just 5,000 ATI processors is also enough to rival that of the existing 200,000 computers currently involved in the Folding@home project” “..it is estimated that if a mere 10,000 computers were to each use an ATI processor to conduct folding research, that the Folding@home program would effectively perform faster than the fastest supercomputer in existence today, surpassing the 1 petaFLOP level “- 2007 November 10, 2011- Folding@home 6.0 petaFlop where 8.162 petaFLOP ( K computer) [Ref 1]
  • 90.
    comparing GPUs toCPUs isn't an apples-to-apples comparison The clock rates are lower the architectures are radically different the problems they're trying to solve are almost completely unrelated
  • 92.
    Application Processor: Executes applicationcode like MPEG decoding Sequences the instructions and issues them to Stream clients e.g KEU and DRAM interface [Kapasi 2003]
  • 93.
    Two Stream Clients: KEU: ProgrammableKernel Execution Unit DRAM interface: Provides access to global data storage [Kapasi 2003]
  • 94.
    KEU: It has twostream level instructions: 1. load_kernel – loads compiled kernel function in the local instruction storage inside the KEU 2. run_kernel – executes the kernel [Kapasi 2003]
  • 95.
    DRAM interface: Two streamlevel instructions as well – 1. load_stream – loads an entire stream from SRF 2. store_stream – stores a stream into SRF [Kapasi 2003]
  • 96.
    Local register files(LRFs) 1. use for operands for arithmetic operations (similar to caches on CPUs) 2. exploit fine-grain locality [Kapasi 2003]
  • 97.
    Stream register files (SRFs) 1.capture coarse-grain locality 2. efficiently transfer data to and from the LRFs [Kapasi 2003]
  • 98.
  • 99.
    Topics learnt today: 1.Stream Processing 3. How modern GPUs use stream processing 4. Imagine Stream Processor from Stanford 2. StreamIT language from MIT