Stream processing is a computer programming paradigm that allows for parallel processing of data streams. It involves applying the same kernel function to each element in a stream. Stream processing is suitable for applications involving large datasets where each data element can be processed independently, such as audio, video, and signal processing. Modern GPUs use a stream processing approach to achieve high performance by running kernels on multiple data elements simultaneously.
3. Stream processing is a computer
programming paradigm, related to
SIMD
It allows some applications to more
easily exploit a limited form of
parallel processing
4. A stream is simply a set of records
that require similar computation.
Streams provide data parallelism
5. A stream is simply a set of records
that require similar computation.
Streams provide data parallelism
Kernels are the functions that
are applied to each element in
the stream
6. A stream is simply a set of records
that require similar computation.
Streams provide data parallelism
Kernels are the functions that
are applied to each element in
the stream
For each element we can only read from the
input, perform operations on it, and write to the
output
7. Stream processing is especially suitable for
applications that exhibit three characteristics ---
8. Stream processing is especially suitable for
applications that exhibit three characteristics ---
9. Stream processing is especially suitable for
applications that exhibit three characteristics ---
10. Stream processing is especially suitable for
applications that exhibit three characteristics ---
11. Flynn’s Taxonomy: SISD
Single Instruction: Only one instruction stream is being acted on
by the CPU during any one clock cycle
Single Data: Only one data stream is being used as input during
any one clock cycle
12. Flynn’s Taxonomy: SIMD
Single Instruction: All processing units execute the same
instruction at any given clock cycle
Multiple Data: Each processing unit can operate on a different
data element
13. Flynn’s Taxonomy: MISD
Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.
Single Data: A single data stream is fed into multiple processing
units.
14. Flynn’s Taxonomy: MIMD
Multiple Instruction: Every processor may be executing a different
instruction stream
Multiple Data: Every processor may be working with a different data
stream
15. Stream Processors
stream processing makes use of locality of reference by explicitly
grouping related code and data together for easy fetching into the
cache
16. A stream processing language for programs based
on streams of data
e.g Audio, video, DSP, networking,
and cryptographic processing kernels
HDTV editing, radar tracking, microphone arrays,
cellphone base stations, graphics
[Thies 2002]
17. A high-level, architecture-independent language
for streaming applications
1. Improves programmer productivity (vs.
Java, C)
2. Offers scalable performance on multicores
[Thies 2002]
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72. GPU
GPU is a single-chip processor
that creates lighting effects and
transforms objects every time a 3D
scene is redrawn
Used primarily for 3-D
applications.
a GPU can be present on a video
card, or it can be on the
motherboard, or in certain CPUs, on
the CPU die
73. World’s First GPu
Nvidia in 1999 marketed the GeForce 256 as "the world's
first 'GPU, a single-chip processor that is capable of
processing a minimum of 10 million polygons per second".
Rival ATI Technologies coined the term visual processing
unit or VPU with the release of the Radeon 9700 in 2002.
76. GPUs have a very high compute capacity
To the hardware, the accelerator
looks like another IO unit; it
communicates with the CPU using IO
commands and DMA memory transfers
77. GPUs have a very high compute capacity
To the hardware, the accelerator
looks like another IO unit; it To the software, the accelerator
communicates with the CPU using IO is another computer to which your
commands and DMA memory transfers program sends data and routines
to execute
78. GPGPU
This concept turns the massive floating-point computational
power of a modern graphics accelerator into general-purpose
computing power
79. GPGPU
This concept turns the massive floating-point computational
power of a modern graphics accelerator into general-purpose
computing power
80. GPGPU
This concept turns the massive floating-point computational
power of a modern graphics accelerator into general-purpose
computing power
81. GPGPU
This concept turns the massive floating-point computational
power of a modern graphics accelerator into general-purpose
computing power
GPUs are stream processors – processors that can operate
in parallel by running a single kernel on many records in
a stream at once
82. GPGPU
This concept turns the massive floating-point computational
power of a modern graphics accelerator into general-purpose
computing power
GPUs are stream processors – processors that can operate
in parallel by running a single kernel on many records in
a stream at once
Ideal GPGPU applications have large data sets, high parallelism,
and minimal dependency between data elements
84. In certain circumstances the GPU calculates forty
times faster than the conventional CPUs
AMD
Athlon 64 CPU 154 m
X2
85. In certain circumstances the GPU calculates forty
times faster than the conventional CPUs
AMD ATI X1950
Athlon 64 CPU 154 m GPU 384 m
XTX
X2
86. In certain circumstances the GPU calculates forty
times faster than the conventional CPUs
AMD ATI X1950
Athlon 64 CPU 154 m GPU 384 m
XTX
X2
Intel Core 2
CPU 582 m
Quad
87. In certain circumstances the GPU calculates forty
times faster than the conventional CPUs
AMD ATI X1950
Athlon 64 CPU 154 m GPU 384 m
XTX
X2
Intel Core 2 NVIDIA
CPU 582 m GPU 680 m
Quad G8800 GTX
88. “The processing power of just 5,000 ATI processors is
also enough to rival that of the existing 200,000
computers currently involved in the Folding@home project”
[Ref 1]
89. “The processing power of just 5,000 ATI processors is
also enough to rival that of the existing 200,000
computers currently involved in the Folding@home project”
“..it is estimated that if a mere 10,000 computers were to
each use an ATI processor to conduct folding research, that
the Folding@home program would effectively perform faster
than the fastest supercomputer in existence
today, surpassing the 1 petaFLOP level “- 2007
November 10, 2011- Folding@home 6.0 petaFlop where
8.162 petaFLOP ( K computer) [Ref 1]
90. comparing GPUs to CPUs isn't an
apples-to-apples comparison
The clock rates are lower
the architectures are radically
different
the problems they're trying to solve are almost
completely unrelated
94. KEU:
It has two stream level
instructions:
1. load_kernel – loads
compiled kernel function in
the local instruction
storage inside the KEU
2. run_kernel – executes the
kernel
[Kapasi 2003]
95. DRAM interface:
Two stream level instructions
as well –
1. load_stream – loads an
entire stream from SRF
2. store_stream – stores a
stream into SRF
[Kapasi 2003]
96. Local register files (LRFs)
1. use for operands for
arithmetic operations
(similar to caches on CPUs)
2. exploit fine-grain locality
[Kapasi 2003]
97. Stream register files
(SRFs)
1. capture coarse-grain
locality
2. efficiently transfer data
to and from the LRFs
[Kapasi 2003]
99. Topics learnt today:
1. Stream Processing
3. How modern GPUs use stream processing
4. Imagine Stream Processor from
Stanford
2. StreamIT language from MIT