L09-1
6.5930/1
Hardware Architectures for Deep Learning
GPU Computation
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
March 6, 2023
L09-2
Sze and Emer
Why are GPU interesting?
March 6, 2023
• Very successful commodity accelerator/co-processor
• GPUs combine two strategies to increase efficiency
– Massive parallelism
– Specialization
• Illustrates tension between performance and
programmability in accelerators
• Pervasively applied in deep learning applications
L09-3
Sze and Emer
Background Reading
• GPUs
– Computer Architecture: A Quantitative Approach,
by Hennessy and Patterson
• Edition 6: Ch 4: p310-336
• Edition 5: Ch 4: p288-315
All these books and their online/e-book versions are available through
MIT libraries.
March 6, 2023
L09-4
Sze and Emer
Programmability vs Efficiency
March 6, 2023
More Efficient
More Programmable
CPU GPU FPGA CGRA ASIC
FPGA => Field programmable gate array
CGRA => Coarse-grained reconfigurable array
ASIC => Application-specific integrated circuit
L09-5
Sze and Emer
CPU vs GPU Attribute Summary
March 6, 2023
Source: Stanford CS231n
L09-6
Sze and Emer
CPU vs. GPU Performance
March 6, 2023
Source: Stanford CS231n
Ratio of (partially-optimized) CPU vs. CUDA library (cuDNN)
L09-7
Sze and Emer
CPU vs. GPU Performance
March 6, 2023
Source: Stanford CS231n
Ratio of unoptimized CUDA vs. CUDA library (cuDNN)
L09-8
Sze and Emer
Single Instruction Multiple Thread
March 6, 2023
SIMT
– Many threads each with
private architectural state
or context, e.g., registers.
– Group of threads that
issue together called a
warp.
– All threads that issue
together execute same
instruction.
– Entire pipeline is an SM or
streaming multiprocessor
Mem
GPR
X
Y
+
*
PC I$ IR GPR
X
Y
+
*
Mem
M
e
m
o
r
y
green-> Nvidia terminology
Lane
L09-9
Sze and Emer
+1
2 2
Multiple Thread – Single Instruction Multiple Thread
March 6, 2023
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
L09-10
Sze and Emer
Function unit optimization
March 6, 2023
+
*
PC I$ IR GPR +
*
M
e
m
o
r
y
GPR
+
*
Specialize
Function Units
L09-11
Sze and Emer
Function unit optimization
March 6, 2023
*
PC I$ IR GPR +
Restriction: Can’t issue same operation twice in a row
M
e
m
o
r
y
GPR
IR
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
L09-12
Sze and Emer
Function unit optimization
March 6, 2023
0 1 2 3 4 5
Fetch Add0,0-1 Mul1,0-1 Add2,0-1 Mul3,0-1 …
GPR0 Add0,0 Mul1,0 Add2,0 Mul3,0 …
GPR1 Add0,1 Mul1,1 Add2,1 …
Adder Add0,0 Add0,1 Add2,0 …
Mul Mul1,0 Mul1,1 …
Key: Opcodeinum,thread(s)
U
n
i
t
Cycle
*
PC I$ IR GPR
+
M
e
m
o
r
y
GPR
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
IR
L09-13
Sze and Emer
Streaming Multiprocessor Overview
March 6, 2023
• Each SM supports 10s of
warps (e.g., 64 in Kepler) with
32 threads/warp
• Fetch 1 instr/cycle
• Issue 1 ready instr/cycle
– Simple scoreboarding: all warp
elements must be ready
• Instruction broadcast to all
lanes
• Multithreading is the main
latency-hiding mechanism
L09-14
Sze and Emer
Little’s Law
March 6, 2023
Throughput (T) = Number in Flight (N) / Latency (L)
Issue Execution
Example:
64 warps (number of instructions in flight)
1 instruction / cycle (desired throughput)

L09-15
Sze and Emer
Number of Active Warps
March 6, 2023
• SMs support a variable number of active warps
based on required registers (and shared memory).
Fewer warps allows more registers per warp, i.e., a
larger context.
– Few large contexts  Fewer register spills
– Many small contexts  More latency tolerance
– Choice left to the compiler
• Example: Kepler supports up to 64 warps
– Max: 64 warps @ <=32 registers/thread
– Min: 8 warps @ 256 registers/thread
L09-16
Sze and Emer
GPU ISA and Compilation
March 6, 2023
• GPU micro-architecture and instruction set change
very frequently
• To achieve compatibility:
– Compiler produces intermediate pseudo-assembler
language (e.g., Nvidia PTX)
– GPU driver JITs kernel, tailoring it to specific micro-
architecture
• In practice, little performance portability
– Code is often tuned to specific GPU architecture
L09-17
Sze and Emer
+1
2 2
Multiple Thread – Single Instruction Multiple Thread
March 6, 2023
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
L09-18
Sze and Emer
Many Memory Types
March 6, 2023
Mem
Thread 0
Thread 1
Thread 2
Per Thread Memory
Scratchpad Shared Memory
Global Memory
………
L09-19
Sze and Emer
Private Per Thread Memory
March 6, 2023
• Private memory
– No cross-thread sharing
– Small, fixed size memory
• Can be used for constants
– Multi-bank implementation (can be in global memory)
….
Thread 0 Thread 0 Memory
Thread 1 Thread 1 Memory
Thread 2 Thread 2 Memory
L09-20
Sze and Emer
Shared Scratchpad Memory
March 6, 2023
• Shared scratchpad memory (threads share data)
– Small, fixed size memory (16K-64K / ‘core’)
– Banked for high bandwidth
– Fed with address coalescing unit (ACU) + crossbar
• ACU can buffer/coalesce requests
….
Thread 0 Shared Memory Bank
Thread 1 Shared Memory Bank
Thread 2 Shared Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
L09-21
Sze and Emer
Memory Access Divergence
March 6, 2023
• All loads are gathers, all stores are scatters
• Address coalescing unit (ACU) detects
sequential and strided patterns, coalesces
memory requests, but complex patterns can
result in multiple lower bandwidth requests
(memory divergence)
• Writing efficient GPU code requires most
accesses to not conflict, even though
programming model allows arbitrary patterns!
L09-22
Sze and Emer
Global Memory Bank
Global Memory Bank
Global Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
….
Thread 0 Global Memory Bank
Thread 1 Global Memory Bank
Thread 2 Global Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
Shared Global Memory
March 6, 2023
• Shared global memory
– Large shared memory
– Will suffer also from memory divergence
L09-23
Sze and Emer
Shared Global Memory
March 6, 2023
• Memory hierarchy with caches
– Cache to save memory bandwidth
– Caches also enable compression/decompression of data
Global Memory Bank
Global Memory Bank
Global Memory Bank
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
Cache Tags/Data
Cache Tags/Data
Cache Tags/Data
A
C
U
+
X
b
a
r
C
r
o
s
s
b
a
r
N
e
t
w
o
r
k
Buffered Data
Buffered Data
Buffered Data
N
e
t
w
o
r
k
Hits
Misses
L09-24
Sze and Emer
Serialized cache access
March 6, 2023
• Trade latency for power/flexibility
– Only access data bank that contains data
– Facilitate more sophisticated cache organizations
• e.g., greater associatively
B
l
o
c
k
O
f
f
s
e
t
I
n
d
e
x
T
a
g
Data Store
Tag Store
Match
B
l
o
c
k
O
f
f
s
e
t
I
n
d
e
x
T
a
g
Data Store
Tag Store
Match
Combine
L09-25
Sze and Emer
GPU Programming Environments
Code for accelerated kernels
• CUDA (Nvidia-only)
– C-like language that runs on GPU
– Libraries: cuDNN, cuBLAS, cuFFT
• OpenCL (open standard)
– C-like language that runs on GPU, CPU or FPGA
– usually less optimized than CUDA
March 6, 2023
L09-26
Sze and Emer
CUDA GPU Thread Model
March 6, 2023
• Single-program multiple data (SPMD) model
• Each context is a thread
– Threads have registers
– Threads have local memory
• Parallel threads packed in blocks
– Blocks have shared memory
– Threads synchronize with barrier
– Blocks run to completion (or abort)
• Grids include independent blocks
– May execute concurrently
– Share global memory, but
– Have limited inter-block synchronization
L09-27
Sze and Emer
Hardware Scheduling
March 6, 2023
• Grids can be launched by
CPU or GPU
– Work from multiple CPU threads
and processes
• HW unit schedules grids on
SMs (labeled SMX in diagram)
– Priority-based scheduling
• 32 active grids
– More queued/paused
L09-28
Sze and Emer
GPU Kernel Execution
March 6, 2023
Transfer input data from
CPU to GPU memory
Launch kernel (grid)
Wait for kernel to finish
(if synchronous)
Transfer results to CPU
memory
CPU
Mem
GPU
Mem
3
1
2
4
1
3
2
4
• Data transfers can dominate execution time
• Integrated GPUs with unified address space
 no copies, but…
L09-29
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
March 6, 2023
L09-30
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
March 6, 2023
L09-31
Sze and Emer
Fully Connected Computation
F[M0 C0 H0 W0] F[M0 C0 H0 W1] …
F[M0 C0 H1 W0] F[M0 C0 H1 W1] …
F[M0 C0 H2 W0] F[M0 C0 H2 W1] …
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] …
F[M0 C1 H1 W0] F[M0 C1 H1 W1] …
F[M0 C1 H2 W0] F[M0 C1 H2 W1] …
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] …
F[M1 C0 H1 W0] F[M1 C0 H1 W1] …
F[M1 C0 H2 W0] F[M1 C0 H2 W1] …
.
.
.
I[C0 H0 W0] I[C0 H0 W1] …
I[C0 H1 W0] I[C0 H1 W1] …
I[C0 H2 W0] I[C0 H2 W1] …
.
.
I[C1 H0 W0] I[C1 H0 W1] …
I[C1 H1 W0] I[C1 H1 W1] …
I[C1 H2 W0] I[C1 H2 W1] …
.
.
.
March 6, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for (m = 0; m < M; m++) {
o[m] = 0;
CHWm = C*H*W*m;
for (chw = 0; chw < C*H*W; chw++) {
o[m] += i[chw] * f[CHWm + chw];
}
}
L09-32
Sze and Emer
FC - CUDA Main Program
int i[C*H*W], *gpu_i; # Input activations
int f[M*C*H*W], *gpu_f; # Filter Weights
int o[M], *gpu_o; # Output activations
# Allocate space on GPU
cudaMalloc((void**) &gpu_i, sizeof(int)*C*H*W);
cudaMalloc((void**) &gpu_f, sizeof(int)*M*C*H*W);
# Copy data to GPU
cudaMemcpy(gpu_i, i, sizeof(int)*C*H*W);
cudaMemcpy(gpu_f, f, sizeof(int)*M*C*H*W);
# Run kernel
fc<<<M/256 + 1, 256>>>(gpu_i, gpu_f, gpu_o, C*H*W, M);
# Copy result back and free device memory
cudaMemcpy(o, gpu_o, sizeof(int)*M, cudaMemCpyDevicetoHost);
cudaFree(gpu_i);
...
Thread block size
Num thread blocks
March 6, 2023
L09-33
Sze and Emer
FC – CUDA Terminology
• CUDA code launches 256 threads per block
• CUDA vs vector terminology:
– Thread = 1 iteration of scalar loop [1 element in vector loop]
– Block = Body of vectorized loop [VL=256 in this example]
• Warp size = 32 [Number of vector lanes]
– Grid = Vectorizable loop
[ vector terminology ]
March 6, 2023
L09-34
Sze and Emer
FC - CUDA Kernel
__global__
void fc(int *i, int *f, int *o, const int CHW, const int M){
int tid=threadIdx.x + blockIdx.x * blockDim.x;
int m = tid
int sum = 0;
if (m < M){
for (int chw=0; chw <CHW; chw ++) {
sum += i[chw]*f[(m*CHW)+chw];
}
o[m] = sum;
}
}
Any consequences of f[(m*CHW)+chw]?
Any consequences of f[(chw*M)+m]?
tid is “output
channel” number
March 6, 2023
Calculate “global
thead id (tid)
L09-35
Sze and Emer
Convolution (CONV) Layer
…
M
…
Many
Input fmaps (N) Many
Output fmaps (N)
…
R
S
R
S
C
C
filters
E
F
H
C
H
W
C
E
1 1
N
N
W F
Batch Size (N)
March 6, 2023
L09-36
Sze and Emer
Convolution (CONV) Layer
=
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1 2 3 4
×
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4 3
1 2 4
Chnl 1 Chnl 2
Filter 1
Filter 2
Chnl 1
Chnl 2
Chnl 1
Chnl 2
Input Fmap as
Toeplitz matrix
Output Fmap
Filter
GPU Implementation:
• Keep original input activation matrix in main memory
• Conceptually do tiled matrix-matrix multiplication
• Copy input activations into scratchpad to small Toeplitz matrix tile
• On Volta tile again to use ‘tensor core’
March 6, 2023
L09-37
Sze and Emer
GV100 – “Tensor Core”
Tensor Core….
• 120 TFLOPS (FP16)
• 400 GFLOPS/W (FP16)
New opcodes – Matrix Multiply Accumulate (HMMA)
How many FP16 operands?
How many multiplies?
How many adds?
March 6, 2023
L09-38
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
In addition to
normal function
units…
March 6, 2023
L09-39
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
Tensor
Matrix
Multiply
Accumulate
GPR
X
Y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
Cross thread
operands
March 6, 2023
L09-40
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
March 6, 2023
L09-41
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 6, 2023
L09-42
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 6, 2023
L09-43
Sze and Emer
Vector vs GPU Terminology
March 6, 2023
[H&P5, Fig 4.25]
L09-44
Sze and Emer
Summary
• GPUs are an intermediate point in the continuum of flexibility
and efficiency
• GPU architecture focuses on throughput (over latency)
– Massive thread-level parallelism, with
• Single instruction for many threads
– Memory hierarchy specialized for throughput
• Shared scratchpads with private address space
• Caches used primarily to reduce bandwidth not latency
– Specialized compute units, with
• Many computes per input operand
• Little’s Law useful for system analysis
– Shows relationship between throughput, latency and tasks in-flight
March 6, 2023
L09-45
Sze and Emer
Next Lecture: Spatial Architectures
Thank you!
March 6, 2023

L09-handout.pdf

  • 1.
    L09-1 6.5930/1 Hardware Architectures forDeep Learning GPU Computation Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science March 6, 2023
  • 2.
    L09-2 Sze and Emer Whyare GPU interesting? March 6, 2023 • Very successful commodity accelerator/co-processor • GPUs combine two strategies to increase efficiency – Massive parallelism – Specialization • Illustrates tension between performance and programmability in accelerators • Pervasively applied in deep learning applications
  • 3.
    L09-3 Sze and Emer BackgroundReading • GPUs – Computer Architecture: A Quantitative Approach, by Hennessy and Patterson • Edition 6: Ch 4: p310-336 • Edition 5: Ch 4: p288-315 All these books and their online/e-book versions are available through MIT libraries. March 6, 2023
  • 4.
    L09-4 Sze and Emer Programmabilityvs Efficiency March 6, 2023 More Efficient More Programmable CPU GPU FPGA CGRA ASIC FPGA => Field programmable gate array CGRA => Coarse-grained reconfigurable array ASIC => Application-specific integrated circuit
  • 5.
    L09-5 Sze and Emer CPUvs GPU Attribute Summary March 6, 2023 Source: Stanford CS231n
  • 6.
    L09-6 Sze and Emer CPUvs. GPU Performance March 6, 2023 Source: Stanford CS231n Ratio of (partially-optimized) CPU vs. CUDA library (cuDNN)
  • 7.
    L09-7 Sze and Emer CPUvs. GPU Performance March 6, 2023 Source: Stanford CS231n Ratio of unoptimized CUDA vs. CUDA library (cuDNN)
  • 8.
    L09-8 Sze and Emer SingleInstruction Multiple Thread March 6, 2023 SIMT – Many threads each with private architectural state or context, e.g., registers. – Group of threads that issue together called a warp. – All threads that issue together execute same instruction. – Entire pipeline is an SM or streaming multiprocessor Mem GPR X Y + * PC I$ IR GPR X Y + * Mem M e m o r y green-> Nvidia terminology Lane
  • 9.
    L09-9 Sze and Emer +1 22 Multiple Thread – Single Instruction Multiple Thread March 6, 2023 PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1
  • 10.
    L09-10 Sze and Emer Functionunit optimization March 6, 2023 + * PC I$ IR GPR + * M e m o r y GPR + * Specialize Function Units
  • 11.
    L09-11 Sze and Emer Functionunit optimization March 6, 2023 * PC I$ IR GPR + Restriction: Can’t issue same operation twice in a row M e m o r y GPR IR C r o s s b a r C r o s s b a r
  • 12.
    L09-12 Sze and Emer Functionunit optimization March 6, 2023 0 1 2 3 4 5 Fetch Add0,0-1 Mul1,0-1 Add2,0-1 Mul3,0-1 … GPR0 Add0,0 Mul1,0 Add2,0 Mul3,0 … GPR1 Add0,1 Mul1,1 Add2,1 … Adder Add0,0 Add0,1 Add2,0 … Mul Mul1,0 Mul1,1 … Key: Opcodeinum,thread(s) U n i t Cycle * PC I$ IR GPR + M e m o r y GPR C r o s s b a r C r o s s b a r IR
  • 13.
    L09-13 Sze and Emer StreamingMultiprocessor Overview March 6, 2023 • Each SM supports 10s of warps (e.g., 64 in Kepler) with 32 threads/warp • Fetch 1 instr/cycle • Issue 1 ready instr/cycle – Simple scoreboarding: all warp elements must be ready • Instruction broadcast to all lanes • Multithreading is the main latency-hiding mechanism
  • 14.
    L09-14 Sze and Emer Little’sLaw March 6, 2023 Throughput (T) = Number in Flight (N) / Latency (L) Issue Execution Example: 64 warps (number of instructions in flight) 1 instruction / cycle (desired throughput) 
  • 15.
    L09-15 Sze and Emer Numberof Active Warps March 6, 2023 • SMs support a variable number of active warps based on required registers (and shared memory). Fewer warps allows more registers per warp, i.e., a larger context. – Few large contexts  Fewer register spills – Many small contexts  More latency tolerance – Choice left to the compiler • Example: Kepler supports up to 64 warps – Max: 64 warps @ <=32 registers/thread – Min: 8 warps @ 256 registers/thread
  • 16.
    L09-16 Sze and Emer GPUISA and Compilation March 6, 2023 • GPU micro-architecture and instruction set change very frequently • To achieve compatibility: – Compiler produces intermediate pseudo-assembler language (e.g., Nvidia PTX) – GPU driver JITs kernel, tailoring it to specific micro- architecture • In practice, little performance portability – Code is often tuned to specific GPU architecture
  • 17.
    L09-17 Sze and Emer +1 22 Multiple Thread – Single Instruction Multiple Thread March 6, 2023 PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1
  • 18.
    L09-18 Sze and Emer ManyMemory Types March 6, 2023 Mem Thread 0 Thread 1 Thread 2 Per Thread Memory Scratchpad Shared Memory Global Memory ………
  • 19.
    L09-19 Sze and Emer PrivatePer Thread Memory March 6, 2023 • Private memory – No cross-thread sharing – Small, fixed size memory • Can be used for constants – Multi-bank implementation (can be in global memory) …. Thread 0 Thread 0 Memory Thread 1 Thread 1 Memory Thread 2 Thread 2 Memory
  • 20.
    L09-20 Sze and Emer SharedScratchpad Memory March 6, 2023 • Shared scratchpad memory (threads share data) – Small, fixed size memory (16K-64K / ‘core’) – Banked for high bandwidth – Fed with address coalescing unit (ACU) + crossbar • ACU can buffer/coalesce requests …. Thread 0 Shared Memory Bank Thread 1 Shared Memory Bank Thread 2 Shared Memory Bank A C U + X b a r A C U + X b a r
  • 21.
    L09-21 Sze and Emer MemoryAccess Divergence March 6, 2023 • All loads are gathers, all stores are scatters • Address coalescing unit (ACU) detects sequential and strided patterns, coalesces memory requests, but complex patterns can result in multiple lower bandwidth requests (memory divergence) • Writing efficient GPU code requires most accesses to not conflict, even though programming model allows arbitrary patterns!
  • 22.
    L09-22 Sze and Emer GlobalMemory Bank Global Memory Bank Global Memory Bank A C U + X b a r A C U + X b a r …. Thread 0 Global Memory Bank Thread 1 Global Memory Bank Thread 2 Global Memory Bank A C U + X b a r A C U + X b a r Shared Global Memory March 6, 2023 • Shared global memory – Large shared memory – Will suffer also from memory divergence
  • 23.
    L09-23 Sze and Emer SharedGlobal Memory March 6, 2023 • Memory hierarchy with caches – Cache to save memory bandwidth – Caches also enable compression/decompression of data Global Memory Bank Global Memory Bank Global Memory Bank C r o s s b a r C r o s s b a r Cache Tags/Data Cache Tags/Data Cache Tags/Data A C U + X b a r C r o s s b a r N e t w o r k Buffered Data Buffered Data Buffered Data N e t w o r k Hits Misses
  • 24.
    L09-24 Sze and Emer Serializedcache access March 6, 2023 • Trade latency for power/flexibility – Only access data bank that contains data – Facilitate more sophisticated cache organizations • e.g., greater associatively B l o c k O f f s e t I n d e x T a g Data Store Tag Store Match B l o c k O f f s e t I n d e x T a g Data Store Tag Store Match Combine
  • 25.
    L09-25 Sze and Emer GPUProgramming Environments Code for accelerated kernels • CUDA (Nvidia-only) – C-like language that runs on GPU – Libraries: cuDNN, cuBLAS, cuFFT • OpenCL (open standard) – C-like language that runs on GPU, CPU or FPGA – usually less optimized than CUDA March 6, 2023
  • 26.
    L09-26 Sze and Emer CUDAGPU Thread Model March 6, 2023 • Single-program multiple data (SPMD) model • Each context is a thread – Threads have registers – Threads have local memory • Parallel threads packed in blocks – Blocks have shared memory – Threads synchronize with barrier – Blocks run to completion (or abort) • Grids include independent blocks – May execute concurrently – Share global memory, but – Have limited inter-block synchronization
  • 27.
    L09-27 Sze and Emer HardwareScheduling March 6, 2023 • Grids can be launched by CPU or GPU – Work from multiple CPU threads and processes • HW unit schedules grids on SMs (labeled SMX in diagram) – Priority-based scheduling • 32 active grids – More queued/paused
  • 28.
    L09-28 Sze and Emer GPUKernel Execution March 6, 2023 Transfer input data from CPU to GPU memory Launch kernel (grid) Wait for kernel to finish (if synchronous) Transfer results to CPU memory CPU Mem GPU Mem 3 1 2 4 1 3 2 4 • Data transfers can dominate execution time • Integrated GPUs with unified address space  no copies, but…
  • 29.
    L09-29 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = • Matrix–Vector Multiply: • Multiply all inputs in all channels by a weight and sum March 6, 2023
  • 30.
    L09-30 Sze and Emer Fully-Connected(FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = • Matrix–Vector Multiply: • Multiply all inputs in all channels by a weight and sum March 6, 2023
  • 31.
    L09-31 Sze and Emer FullyConnected Computation F[M0 C0 H0 W0] F[M0 C0 H0 W1] … F[M0 C0 H1 W0] F[M0 C0 H1 W1] … F[M0 C0 H2 W0] F[M0 C0 H2 W1] … . . F[M0 C1 H0 W0] F[M0 C1 H0 W1] … F[M0 C1 H1 W0] F[M0 C1 H1 W1] … F[M0 C1 H2 W0] F[M0 C1 H2 W1] … . . . F[M1 C0 H0 W0] F[M1 C0 H0 W1] … F[M1 C0 H1 W0] F[M1 C0 H1 W1] … F[M1 C0 H2 W0] F[M1 C0 H2 W1] … . . . I[C0 H0 W0] I[C0 H0 W1] … I[C0 H1 W0] I[C0 H1 W1] … I[C0 H2 W0] I[C0 H2 W1] … . . I[C1 H0 W0] I[C1 H0 W1] … I[C1 H1 W0] I[C1 H1 W1] … I[C1 H2 W0] I[C1 H2 W1] … . . . March 6, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for (m = 0; m < M; m++) { o[m] = 0; CHWm = C*H*W*m; for (chw = 0; chw < C*H*W; chw++) { o[m] += i[chw] * f[CHWm + chw]; } }
  • 32.
    L09-32 Sze and Emer FC- CUDA Main Program int i[C*H*W], *gpu_i; # Input activations int f[M*C*H*W], *gpu_f; # Filter Weights int o[M], *gpu_o; # Output activations # Allocate space on GPU cudaMalloc((void**) &gpu_i, sizeof(int)*C*H*W); cudaMalloc((void**) &gpu_f, sizeof(int)*M*C*H*W); # Copy data to GPU cudaMemcpy(gpu_i, i, sizeof(int)*C*H*W); cudaMemcpy(gpu_f, f, sizeof(int)*M*C*H*W); # Run kernel fc<<<M/256 + 1, 256>>>(gpu_i, gpu_f, gpu_o, C*H*W, M); # Copy result back and free device memory cudaMemcpy(o, gpu_o, sizeof(int)*M, cudaMemCpyDevicetoHost); cudaFree(gpu_i); ... Thread block size Num thread blocks March 6, 2023
  • 33.
    L09-33 Sze and Emer FC– CUDA Terminology • CUDA code launches 256 threads per block • CUDA vs vector terminology: – Thread = 1 iteration of scalar loop [1 element in vector loop] – Block = Body of vectorized loop [VL=256 in this example] • Warp size = 32 [Number of vector lanes] – Grid = Vectorizable loop [ vector terminology ] March 6, 2023
  • 34.
    L09-34 Sze and Emer FC- CUDA Kernel __global__ void fc(int *i, int *f, int *o, const int CHW, const int M){ int tid=threadIdx.x + blockIdx.x * blockDim.x; int m = tid int sum = 0; if (m < M){ for (int chw=0; chw <CHW; chw ++) { sum += i[chw]*f[(m*CHW)+chw]; } o[m] = sum; } } Any consequences of f[(m*CHW)+chw]? Any consequences of f[(chw*M)+m]? tid is “output channel” number March 6, 2023 Calculate “global thead id (tid)
  • 35.
    L09-35 Sze and Emer Convolution(CONV) Layer … M … Many Input fmaps (N) Many Output fmaps (N) … R S R S C C filters E F H C H W C E 1 1 N N W F Batch Size (N) March 6, 2023
  • 36.
    L09-36 Sze and Emer Convolution(CONV) Layer = 1 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 1 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 1 2 3 4 × 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 3 1 2 4 Chnl 1 Chnl 2 Filter 1 Filter 2 Chnl 1 Chnl 2 Chnl 1 Chnl 2 Input Fmap as Toeplitz matrix Output Fmap Filter GPU Implementation: • Keep original input activation matrix in main memory • Conceptually do tiled matrix-matrix multiplication • Copy input activations into scratchpad to small Toeplitz matrix tile • On Volta tile again to use ‘tensor core’ March 6, 2023
  • 37.
    L09-37 Sze and Emer GV100– “Tensor Core” Tensor Core…. • 120 TFLOPS (FP16) • 400 GFLOPS/W (FP16) New opcodes – Matrix Multiply Accumulate (HMMA) How many FP16 operands? How many multiplies? How many adds? March 6, 2023
  • 38.
    L09-38 Sze and Emer +1 22 Tensor Core Integration PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 In addition to normal function units… March 6, 2023
  • 39.
    L09-39 Sze and Emer +1 22 Tensor Core Integration PC I$ IR GPR X Y Tensor Matrix Multiply Accumulate GPR X Y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 Cross thread operands March 6, 2023
  • 40.
    L09-40 Sze and Emer Fully-Connected(FC) Layer M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply March 6, 2023
  • 41.
    L09-41 Sze and Emer TiledFully-Connected (FC) Layer M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1,1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in tensor core operation and computation ordered to maximize reuse of data in scratchpad and then in cache. March 6, 2023
  • 42.
    L09-42 Sze and Emer TiledFully-Connected (FC) Layer M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1,1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in tensor core operation and computation ordered to maximize reuse of data in scratchpad and then in cache. March 6, 2023
  • 43.
    L09-43 Sze and Emer Vectorvs GPU Terminology March 6, 2023 [H&P5, Fig 4.25]
  • 44.
    L09-44 Sze and Emer Summary •GPUs are an intermediate point in the continuum of flexibility and efficiency • GPU architecture focuses on throughput (over latency) – Massive thread-level parallelism, with • Single instruction for many threads – Memory hierarchy specialized for throughput • Shared scratchpads with private address space • Caches used primarily to reduce bandwidth not latency – Specialized compute units, with • Many computes per input operand • Little’s Law useful for system analysis – Shows relationship between throughput, latency and tasks in-flight March 6, 2023
  • 45.
    L09-45 Sze and Emer NextLecture: Spatial Architectures Thank you! March 6, 2023