SlideShare a Scribd company logo
1 of 45
Download to read offline
L09-1
6.5930/1
Hardware Architectures for Deep Learning
GPU Computation
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
March 6, 2023
L09-2
Sze and Emer
Why are GPU interesting?
March 6, 2023
ā€¢ Very successful commodity accelerator/co-processor
ā€¢ GPUs combine two strategies to increase efficiency
ā€“ Massive parallelism
ā€“ Specialization
ā€¢ Illustrates tension between performance and
programmability in accelerators
ā€¢ Pervasively applied in deep learning applications
L09-3
Sze and Emer
Background Reading
ā€¢ GPUs
ā€“ Computer Architecture: A Quantitative Approach,
by Hennessy and Patterson
ā€¢ Edition 6: Ch 4: p310-336
ā€¢ Edition 5: Ch 4: p288-315
All these books and their online/e-book versions are available through
MIT libraries.
March 6, 2023
L09-4
Sze and Emer
Programmability vs Efficiency
March 6, 2023
More Efficient
More Programmable
CPU GPU FPGA CGRA ASIC
FPGA => Field programmable gate array
CGRA => Coarse-grained reconfigurable array
ASIC => Application-specific integrated circuit
L09-5
Sze and Emer
CPU vs GPU Attribute Summary
March 6, 2023
Source: Stanford CS231n
L09-6
Sze and Emer
CPU vs. GPU Performance
March 6, 2023
Source: Stanford CS231n
Ratio of (partially-optimized) CPU vs. CUDA library (cuDNN)
L09-7
Sze and Emer
CPU vs. GPU Performance
March 6, 2023
Source: Stanford CS231n
Ratio of unoptimized CUDA vs. CUDA library (cuDNN)
L09-8
Sze and Emer
Single Instruction Multiple Thread
March 6, 2023
SIMT
ā€“ Many threads each with
private architectural state
or context, e.g., registers.
ā€“ Group of threads that
issue together called a
warp.
ā€“ All threads that issue
together execute same
instruction.
ā€“ Entire pipeline is an SM or
streaming multiprocessor
Mem
GPR
X
Y
+
*
PC I$ IR GPR
X
Y
+
*
Mem
M
e
m
o
r
y
green-> Nvidia terminology
Lane
L09-9
Sze and Emer
+1
2 2
Multiple Thread ā€“ Single Instruction Multiple Thread
March 6, 2023
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
L09-10
Sze and Emer
Function unit optimization
March 6, 2023
+
*
PC I$ IR GPR +
*
M
e
m
o
r
y
GPR
+
*
Specialize
Function Units
L09-11
Sze and Emer
Function unit optimization
March 6, 2023
*
PC I$ IR GPR +
Restriction: Canā€™t issue same operation twice in a row
M
e
m
o
r
y
GPR
IR
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
L09-12
Sze and Emer
Function unit optimization
March 6, 2023
0 1 2 3 4 5
Fetch Add0,0-1 Mul1,0-1 Add2,0-1 Mul3,0-1 ā€¦
GPR0 Add0,0 Mul1,0 Add2,0 Mul3,0 ā€¦
GPR1 Add0,1 Mul1,1 Add2,1 ā€¦
Adder Add0,0 Add0,1 Add2,0 ā€¦
Mul Mul1,0 Mul1,1 ā€¦
Key: Opcodeinum,thread(s)
U
n
i
t
Cycle
*
PC I$ IR GPR
+
M
e
m
o
r
y
GPR
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
IR
L09-13
Sze and Emer
Streaming Multiprocessor Overview
March 6, 2023
ā€¢ Each SM supports 10s of
warps (e.g., 64 in Kepler) with
32 threads/warp
ā€¢ Fetch 1 instr/cycle
ā€¢ Issue 1 ready instr/cycle
ā€“ Simple scoreboarding: all warp
elements must be ready
ā€¢ Instruction broadcast to all
lanes
ā€¢ Multithreading is the main
latency-hiding mechanism
L09-14
Sze and Emer
Littleā€™s Law
March 6, 2023
Throughput (T) = Number in Flight (N) / Latency (L)
Issue Execution
Example:
64 warps (number of instructions in flight)
1 instruction / cycle (desired throughput)
ļƒž <64 cycle average instruction latency
L09-15
Sze and Emer
Number of Active Warps
March 6, 2023
ā€¢ SMs support a variable number of active warps
based on required registers (and shared memory).
Fewer warps allows more registers per warp, i.e., a
larger context.
ā€“ Few large contexts ļƒ  Fewer register spills
ā€“ Many small contexts ļƒ  More latency tolerance
ā€“ Choice left to the compiler
ā€¢ Example: Kepler supports up to 64 warps
ā€“ Max: 64 warps @ <=32 registers/thread
ā€“ Min: 8 warps @ 256 registers/thread
L09-16
Sze and Emer
GPU ISA and Compilation
March 6, 2023
ā€¢ GPU micro-architecture and instruction set change
very frequently
ā€¢ To achieve compatibility:
ā€“ Compiler produces intermediate pseudo-assembler
language (e.g., Nvidia PTX)
ā€“ GPU driver JITs kernel, tailoring it to specific micro-
architecture
ā€¢ In practice, little performance portability
ā€“ Code is often tuned to specific GPU architecture
L09-17
Sze and Emer
+1
2 2
Multiple Thread ā€“ Single Instruction Multiple Thread
March 6, 2023
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
L09-18
Sze and Emer
Many Memory Types
March 6, 2023
Mem
Thread 0
Thread 1
Thread 2
Per Thread Memory
Scratchpad Shared Memory
Global Memory
ā€¦ā€¦ā€¦
L09-19
Sze and Emer
Private Per Thread Memory
March 6, 2023
ā€¢ Private memory
ā€“ No cross-thread sharing
ā€“ Small, fixed size memory
ā€¢ Can be used for constants
ā€“ Multi-bank implementation (can be in global memory)
ā€¦.
Thread 0 Thread 0 Memory
Thread 1 Thread 1 Memory
Thread 2 Thread 2 Memory
L09-20
Sze and Emer
Shared Scratchpad Memory
March 6, 2023
ā€¢ Shared scratchpad memory (threads share data)
ā€“ Small, fixed size memory (16K-64K / ā€˜coreā€™)
ā€“ Banked for high bandwidth
ā€“ Fed with address coalescing unit (ACU) + crossbar
ā€¢ ACU can buffer/coalesce requests
ā€¦.
Thread 0 Shared Memory Bank
Thread 1 Shared Memory Bank
Thread 2 Shared Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
L09-21
Sze and Emer
Memory Access Divergence
March 6, 2023
ā€¢ All loads are gathers, all stores are scatters
ā€¢ Address coalescing unit (ACU) detects
sequential and strided patterns, coalesces
memory requests, but complex patterns can
result in multiple lower bandwidth requests
(memory divergence)
ā€¢ Writing efficient GPU code requires most
accesses to not conflict, even though
programming model allows arbitrary patterns!
L09-22
Sze and Emer
Global Memory Bank
Global Memory Bank
Global Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
ā€¦.
Thread 0 Global Memory Bank
Thread 1 Global Memory Bank
Thread 2 Global Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
Shared Global Memory
March 6, 2023
ā€¢ Shared global memory
ā€“ Large shared memory
ā€“ Will suffer also from memory divergence
L09-23
Sze and Emer
Shared Global Memory
March 6, 2023
ā€¢ Memory hierarchy with caches
ā€“ Cache to save memory bandwidth
ā€“ Caches also enable compression/decompression of data
Global Memory Bank
Global Memory Bank
Global Memory Bank
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
Cache Tags/Data
Cache Tags/Data
Cache Tags/Data
A
C
U
+
X
b
a
r
C
r
o
s
s
b
a
r
N
e
t
w
o
r
k
Buffered Data
Buffered Data
Buffered Data
N
e
t
w
o
r
k
Hits
Misses
L09-24
Sze and Emer
Serialized cache access
March 6, 2023
ā€¢ Trade latency for power/flexibility
ā€“ Only access data bank that contains data
ā€“ Facilitate more sophisticated cache organizations
ā€¢ e.g., greater associatively
B
l
o
c
k
O
f
f
s
e
t
I
n
d
e
x
T
a
g
Data Store
Tag Store
Match
B
l
o
c
k
O
f
f
s
e
t
I
n
d
e
x
T
a
g
Data Store
Tag Store
Match
Combine
L09-25
Sze and Emer
GPU Programming Environments
Code for accelerated kernels
ā€¢ CUDA (Nvidia-only)
ā€“ C-like language that runs on GPU
ā€“ Libraries: cuDNN, cuBLAS, cuFFT
ā€¢ OpenCL (open standard)
ā€“ C-like language that runs on GPU, CPU or FPGA
ā€“ usually less optimized than CUDA
March 6, 2023
L09-26
Sze and Emer
CUDA GPU Thread Model
March 6, 2023
ā€¢ Single-program multiple data (SPMD) model
ā€¢ Each context is a thread
ā€“ Threads have registers
ā€“ Threads have local memory
ā€¢ Parallel threads packed in blocks
ā€“ Blocks have shared memory
ā€“ Threads synchronize with barrier
ā€“ Blocks run to completion (or abort)
ā€¢ Grids include independent blocks
ā€“ May execute concurrently
ā€“ Share global memory, but
ā€“ Have limited inter-block synchronization
L09-27
Sze and Emer
Hardware Scheduling
March 6, 2023
ā€¢ Grids can be launched by
CPU or GPU
ā€“ Work from multiple CPU threads
and processes
ā€¢ HW unit schedules grids on
SMs (labeled SMX in diagram)
ā€“ Priority-based scheduling
ā€¢ 32 active grids
ā€“ More queued/paused
L09-28
Sze and Emer
GPU Kernel Execution
March 6, 2023
Transfer input data from
CPU to GPU memory
Launch kernel (grid)
Wait for kernel to finish
(if synchronous)
Transfer results to CPU
memory
CPU
Mem
GPU
Mem
3
1
2
4
1
3
2
4
ā€¢ Data transfers can dominate execution time
ā€¢ Integrated GPUs with unified address space
ļƒ  no copies, butā€¦
L09-29
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
Ɨ
1
Output fmaps
M
=
ā€¢ Matrixā€“Vector Multiply:
ā€¢ Multiply all inputs in all channels by a weight and sum
March 6, 2023
L09-30
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
Ɨ
1
Output fmaps
M
=
ā€¢ Matrixā€“Vector Multiply:
ā€¢ Multiply all inputs in all channels by a weight and sum
March 6, 2023
L09-31
Sze and Emer
Fully Connected Computation
F[M0 C0 H0 W0] F[M0 C0 H0 W1] ā€¦
F[M0 C0 H1 W0] F[M0 C0 H1 W1] ā€¦
F[M0 C0 H2 W0] F[M0 C0 H2 W1] ā€¦
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] ā€¦
F[M0 C1 H1 W0] F[M0 C1 H1 W1] ā€¦
F[M0 C1 H2 W0] F[M0 C1 H2 W1] ā€¦
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] ā€¦
F[M1 C0 H1 W0] F[M1 C0 H1 W1] ā€¦
F[M1 C0 H2 W0] F[M1 C0 H2 W1] ā€¦
.
.
.
I[C0 H0 W0] I[C0 H0 W1] ā€¦
I[C0 H1 W0] I[C0 H1 W1] ā€¦
I[C0 H2 W0] I[C0 H2 W1] ā€¦
.
.
I[C1 H0 W0] I[C1 H0 W1] ā€¦
I[C1 H1 W0] I[C1 H1 W1] ā€¦
I[C1 H2 W0] I[C1 H2 W1] ā€¦
.
.
.
March 6, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for (m = 0; m < M; m++) {
o[m] = 0;
CHWm = C*H*W*m;
for (chw = 0; chw < C*H*W; chw++) {
o[m] += i[chw] * f[CHWm + chw];
}
}
L09-32
Sze and Emer
FC - CUDA Main Program
int i[C*H*W], *gpu_i; # Input activations
int f[M*C*H*W], *gpu_f; # Filter Weights
int o[M], *gpu_o; # Output activations
# Allocate space on GPU
cudaMalloc((void**) &gpu_i, sizeof(int)*C*H*W);
cudaMalloc((void**) &gpu_f, sizeof(int)*M*C*H*W);
# Copy data to GPU
cudaMemcpy(gpu_i, i, sizeof(int)*C*H*W);
cudaMemcpy(gpu_f, f, sizeof(int)*M*C*H*W);
# Run kernel
fc<<<M/256 + 1, 256>>>(gpu_i, gpu_f, gpu_o, C*H*W, M);
# Copy result back and free device memory
cudaMemcpy(o, gpu_o, sizeof(int)*M, cudaMemCpyDevicetoHost);
cudaFree(gpu_i);
...
Thread block size
Num thread blocks
March 6, 2023
L09-33
Sze and Emer
FC ā€“ CUDA Terminology
ā€¢ CUDA code launches 256 threads per block
ā€¢ CUDA vs vector terminology:
ā€“ Thread = 1 iteration of scalar loop [1 element in vector loop]
ā€“ Block = Body of vectorized loop [VL=256 in this example]
ā€¢ Warp size = 32 [Number of vector lanes]
ā€“ Grid = Vectorizable loop
[ vector terminology ]
March 6, 2023
L09-34
Sze and Emer
FC - CUDA Kernel
__global__
void fc(int *i, int *f, int *o, const int CHW, const int M){
int tid=threadIdx.x + blockIdx.x * blockDim.x;
int m = tid
int sum = 0;
if (m < M){
for (int chw=0; chw <CHW; chw ++) {
sum += i[chw]*f[(m*CHW)+chw];
}
o[m] = sum;
}
}
Any consequences of f[(m*CHW)+chw]? Yes, strided references
Any consequences of f[(chw*M)+m]? Yes, different data layout
tid is ā€œoutput
channelā€ number
March 6, 2023
Calculate ā€œglobal
thead id (tid)
L09-35
Sze and Emer
Convolution (CONV) Layer
ā€¦
M
ā€¦
Many
Input fmaps (N) Many
Output fmaps (N)
ā€¦
R
S
R
S
C
C
filters
E
F
H
C
H
W
C
E
1 1
N
N
W F
Batch Size (N)
March 6, 2023
L09-36
Sze and Emer
Convolution (CONV) Layer
=
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1 2 3 4
Ɨ
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4 3
1 2 4
Chnl 1 Chnl 2
Filter 1
Filter 2
Chnl 1
Chnl 2
Chnl 1
Chnl 2
Input Fmap as
Toeplitz matrix
Output Fmap
Filter
GPU Implementation:
ā€¢ Keep original input activation matrix in main memory
ā€¢ Conceptually do tiled matrix-matrix multiplication
ā€¢ Copy input activations into scratchpad to small Toeplitz matrix tile
ā€¢ On Volta tile again to use ā€˜tensor coreā€™
March 6, 2023
L09-37
Sze and Emer
GV100 ā€“ ā€œTensor Coreā€
Tensor Coreā€¦.
ā€¢ 120 TFLOPS (FP16)
ā€¢ 400 GFLOPS/W (FP16)
New opcodes ā€“ Matrix Multiply Accumulate (HMMA)
How many FP16 operands? Inputs 48 / Outputs 16
How many multiplies? 64
How many adds? 64
March 6, 2023
L09-38
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
In addition to
normal function
unitsā€¦
March 6, 2023
L09-39
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
Tensor
Matrix
Multiply
Accumulate
GPR
X
Y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
Cross thread
operands
March 6, 2023
L09-40
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
Ɨ
N
Output fmaps
M
=
ā€¢ After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
March 6, 2023
L09-41
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
Ɨ
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 6, 2023
L09-42
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
Ɨ
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 6, 2023
L09-43
Sze and Emer
Vector vs GPU Terminology
March 6, 2023
[H&P5, Fig 4.25]
L09-44
Sze and Emer
Summary
ā€¢ GPUs are an intermediate point in the continuum of flexibility
and efficiency
ā€¢ GPU architecture focuses on throughput (over latency)
ā€“ Massive thread-level parallelism, with
ā€¢ Single instruction for many threads
ā€“ Memory hierarchy specialized for throughput
ā€¢ Shared scratchpads with private address space
ā€¢ Caches used primarily to reduce bandwidth not latency
ā€“ Specialized compute units, with
ā€¢ Many computes per input operand
ā€¢ Littleā€™s Law useful for system analysis
ā€“ Shows relationship between throughput, latency and tasks in-flight
March 6, 2023
L09-45
Sze and Emer
Next Lecture: Spatial Architectures
Thank you!
March 6, 2023

More Related Content

Similar to L09.pdf

GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptxSherazMunawar5
Ā 
Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoringIben Rodriguez
Ā 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsKoan-Sin Tan
Ā 
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren HollanderTRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollanderchiportal
Ā 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-usergdburton
Ā 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationJeff Squyres
Ā 
Amd Athlon Processors
Amd Athlon ProcessorsAmd Athlon Processors
Amd Athlon ProcessorsFazle Rabbi Ador
Ā 
Tacc Infinite Memory Engine
Tacc Infinite Memory EngineTacc Infinite Memory Engine
Tacc Infinite Memory Engineinside-BigData.com
Ā 
Full PPT Stack
Full PPT StackFull PPT Stack
Full PPT StackWendi Sapp
Ā 
Lustre File System on ARM
Lustre File System on ARMLustre File System on ARM
Lustre File System on ARMinside-BigData.com
Ā 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
Ā 
AES encryption on modern consumer architectures
AES encryption on modern consumer architecturesAES encryption on modern consumer architectures
AES encryption on modern consumer architecturesGrigore Lupescu
Ā 
Large Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based designLarge Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based designDhiman Chowdhury
Ā 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPIJeff Squyres
Ā 
Miproce2013 motherboardanduserexperience
Miproce2013 motherboardanduserexperienceMiproce2013 motherboardanduserexperience
Miproce2013 motherboardanduserexperienceBarojReal
Ā 

Similar to L09.pdf (20)

GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptx
Ā 
Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
Ā 
L06-handout.pdf
L06-handout.pdfL06-handout.pdf
L06-handout.pdf
Ā 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Ā 
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren HollanderTRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
TRACK B: Multicores & Network On Chip Architectures/ Oren Hollander
Ā 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
Ā 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
Ā 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
Ā 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
Ā 
Amd Athlon Processors
Amd Athlon ProcessorsAmd Athlon Processors
Amd Athlon Processors
Ā 
Tacc Infinite Memory Engine
Tacc Infinite Memory EngineTacc Infinite Memory Engine
Tacc Infinite Memory Engine
Ā 
Full PPT Stack
Full PPT StackFull PPT Stack
Full PPT Stack
Ā 
Lustre File System on ARM
Lustre File System on ARMLustre File System on ARM
Lustre File System on ARM
Ā 
L10.pdf
L10.pdfL10.pdf
L10.pdf
Ā 
L10.pdf
L10.pdfL10.pdf
L10.pdf
Ā 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Ā 
AES encryption on modern consumer architectures
AES encryption on modern consumer architecturesAES encryption on modern consumer architectures
AES encryption on modern consumer architectures
Ā 
Large Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based designLarge Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based design
Ā 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPI
Ā 
Miproce2013 motherboardanduserexperience
Miproce2013 motherboardanduserexperienceMiproce2013 motherboardanduserexperience
Miproce2013 motherboardanduserexperience
Ā 

More from TRNHONGLINHBCHCM (16)

L07.pdf
L07.pdfL07.pdf
L07.pdf
Ā 
L03.pdf
L03.pdfL03.pdf
L03.pdf
Ā 
L08-handout.pdf
L08-handout.pdfL08-handout.pdf
L08-handout.pdf
Ā 
L05.pdf
L05.pdfL05.pdf
L05.pdf
Ā 
L11.pdf
L11.pdfL11.pdf
L11.pdf
Ā 
L04.pdf
L04.pdfL04.pdf
L04.pdf
Ā 
L12.pdf
L12.pdfL12.pdf
L12.pdf
Ā 
L01.pdf
L01.pdfL01.pdf
L01.pdf
Ā 
L02.pdf
L02.pdfL02.pdf
L02.pdf
Ā 
L14.pdf
L14.pdfL14.pdf
L14.pdf
Ā 
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
Ā 
L15.pdf
L15.pdfL15.pdf
L15.pdf
Ā 
L08.pdf
L08.pdfL08.pdf
L08.pdf
Ā 
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
Ā 
L13.pdf
L13.pdfL13.pdf
L13.pdf
Ā 
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
Ā 

Recently uploaded

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
Ā 
BhubaneswaršŸŒ¹Call Girls Bhubaneswar ā¤Komal 9777949614 šŸ’Ÿ Full Trusted CALL GIRL...
BhubaneswaršŸŒ¹Call Girls Bhubaneswar ā¤Komal 9777949614 šŸ’Ÿ Full Trusted CALL GIRL...BhubaneswaršŸŒ¹Call Girls Bhubaneswar ā¤Komal 9777949614 šŸ’Ÿ Full Trusted CALL GIRL...
BhubaneswaršŸŒ¹Call Girls Bhubaneswar ā¤Komal 9777949614 šŸ’Ÿ Full Trusted CALL GIRL...Call Girls Mumbai
Ā 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
Ā 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
Ā 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
Ā 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
Ā 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsArindam Chakraborty, Ph.D., P.E. (CA, TX)
Ā 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
Ā 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
Ā 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
Ā 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
Ā 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
Ā 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
Ā 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
Ā 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
Ā 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
Ā 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
Ā 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
Ā 

Recently uploaded (20)

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
Ā 
BhubaneswaršŸŒ¹Call Girls Bhubaneswar ā¤Komal 9777949614 šŸ’Ÿ Full Trusted CALL GIRL...
BhubaneswaršŸŒ¹Call Girls Bhubaneswar ā¤Komal 9777949614 šŸ’Ÿ Full Trusted CALL GIRL...BhubaneswaršŸŒ¹Call Girls Bhubaneswar ā¤Komal 9777949614 šŸ’Ÿ Full Trusted CALL GIRL...
BhubaneswaršŸŒ¹Call Girls Bhubaneswar ā¤Komal 9777949614 šŸ’Ÿ Full Trusted CALL GIRL...
Ā 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
Ā 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Ā 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
Ā 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
Ā 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
Ā 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
Ā 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
Ā 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
Ā 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Ā 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
Ā 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
Ā 
Call Girls in South Ex (delhi) call me [šŸ”9953056974šŸ”] escort service 24X7
Call Girls in South Ex (delhi) call me [šŸ”9953056974šŸ”] escort service 24X7Call Girls in South Ex (delhi) call me [šŸ”9953056974šŸ”] escort service 24X7
Call Girls in South Ex (delhi) call me [šŸ”9953056974šŸ”] escort service 24X7
Ā 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
Ā 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
Ā 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Ā 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
Ā 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Ā 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
Ā 

L09.pdf

  • 1. L09-1 6.5930/1 Hardware Architectures for Deep Learning GPU Computation Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science March 6, 2023
  • 2. L09-2 Sze and Emer Why are GPU interesting? March 6, 2023 ā€¢ Very successful commodity accelerator/co-processor ā€¢ GPUs combine two strategies to increase efficiency ā€“ Massive parallelism ā€“ Specialization ā€¢ Illustrates tension between performance and programmability in accelerators ā€¢ Pervasively applied in deep learning applications
  • 3. L09-3 Sze and Emer Background Reading ā€¢ GPUs ā€“ Computer Architecture: A Quantitative Approach, by Hennessy and Patterson ā€¢ Edition 6: Ch 4: p310-336 ā€¢ Edition 5: Ch 4: p288-315 All these books and their online/e-book versions are available through MIT libraries. March 6, 2023
  • 4. L09-4 Sze and Emer Programmability vs Efficiency March 6, 2023 More Efficient More Programmable CPU GPU FPGA CGRA ASIC FPGA => Field programmable gate array CGRA => Coarse-grained reconfigurable array ASIC => Application-specific integrated circuit
  • 5. L09-5 Sze and Emer CPU vs GPU Attribute Summary March 6, 2023 Source: Stanford CS231n
  • 6. L09-6 Sze and Emer CPU vs. GPU Performance March 6, 2023 Source: Stanford CS231n Ratio of (partially-optimized) CPU vs. CUDA library (cuDNN)
  • 7. L09-7 Sze and Emer CPU vs. GPU Performance March 6, 2023 Source: Stanford CS231n Ratio of unoptimized CUDA vs. CUDA library (cuDNN)
  • 8. L09-8 Sze and Emer Single Instruction Multiple Thread March 6, 2023 SIMT ā€“ Many threads each with private architectural state or context, e.g., registers. ā€“ Group of threads that issue together called a warp. ā€“ All threads that issue together execute same instruction. ā€“ Entire pipeline is an SM or streaming multiprocessor Mem GPR X Y + * PC I$ IR GPR X Y + * Mem M e m o r y green-> Nvidia terminology Lane
  • 9. L09-9 Sze and Emer +1 2 2 Multiple Thread ā€“ Single Instruction Multiple Thread March 6, 2023 PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1
  • 10. L09-10 Sze and Emer Function unit optimization March 6, 2023 + * PC I$ IR GPR + * M e m o r y GPR + * Specialize Function Units
  • 11. L09-11 Sze and Emer Function unit optimization March 6, 2023 * PC I$ IR GPR + Restriction: Canā€™t issue same operation twice in a row M e m o r y GPR IR C r o s s b a r C r o s s b a r
  • 12. L09-12 Sze and Emer Function unit optimization March 6, 2023 0 1 2 3 4 5 Fetch Add0,0-1 Mul1,0-1 Add2,0-1 Mul3,0-1 ā€¦ GPR0 Add0,0 Mul1,0 Add2,0 Mul3,0 ā€¦ GPR1 Add0,1 Mul1,1 Add2,1 ā€¦ Adder Add0,0 Add0,1 Add2,0 ā€¦ Mul Mul1,0 Mul1,1 ā€¦ Key: Opcodeinum,thread(s) U n i t Cycle * PC I$ IR GPR + M e m o r y GPR C r o s s b a r C r o s s b a r IR
  • 13. L09-13 Sze and Emer Streaming Multiprocessor Overview March 6, 2023 ā€¢ Each SM supports 10s of warps (e.g., 64 in Kepler) with 32 threads/warp ā€¢ Fetch 1 instr/cycle ā€¢ Issue 1 ready instr/cycle ā€“ Simple scoreboarding: all warp elements must be ready ā€¢ Instruction broadcast to all lanes ā€¢ Multithreading is the main latency-hiding mechanism
  • 14. L09-14 Sze and Emer Littleā€™s Law March 6, 2023 Throughput (T) = Number in Flight (N) / Latency (L) Issue Execution Example: 64 warps (number of instructions in flight) 1 instruction / cycle (desired throughput) ļƒž <64 cycle average instruction latency
  • 15. L09-15 Sze and Emer Number of Active Warps March 6, 2023 ā€¢ SMs support a variable number of active warps based on required registers (and shared memory). Fewer warps allows more registers per warp, i.e., a larger context. ā€“ Few large contexts ļƒ  Fewer register spills ā€“ Many small contexts ļƒ  More latency tolerance ā€“ Choice left to the compiler ā€¢ Example: Kepler supports up to 64 warps ā€“ Max: 64 warps @ <=32 registers/thread ā€“ Min: 8 warps @ 256 registers/thread
  • 16. L09-16 Sze and Emer GPU ISA and Compilation March 6, 2023 ā€¢ GPU micro-architecture and instruction set change very frequently ā€¢ To achieve compatibility: ā€“ Compiler produces intermediate pseudo-assembler language (e.g., Nvidia PTX) ā€“ GPU driver JITs kernel, tailoring it to specific micro- architecture ā€¢ In practice, little performance portability ā€“ Code is often tuned to specific GPU architecture
  • 17. L09-17 Sze and Emer +1 2 2 Multiple Thread ā€“ Single Instruction Multiple Thread March 6, 2023 PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1
  • 18. L09-18 Sze and Emer Many Memory Types March 6, 2023 Mem Thread 0 Thread 1 Thread 2 Per Thread Memory Scratchpad Shared Memory Global Memory ā€¦ā€¦ā€¦
  • 19. L09-19 Sze and Emer Private Per Thread Memory March 6, 2023 ā€¢ Private memory ā€“ No cross-thread sharing ā€“ Small, fixed size memory ā€¢ Can be used for constants ā€“ Multi-bank implementation (can be in global memory) ā€¦. Thread 0 Thread 0 Memory Thread 1 Thread 1 Memory Thread 2 Thread 2 Memory
  • 20. L09-20 Sze and Emer Shared Scratchpad Memory March 6, 2023 ā€¢ Shared scratchpad memory (threads share data) ā€“ Small, fixed size memory (16K-64K / ā€˜coreā€™) ā€“ Banked for high bandwidth ā€“ Fed with address coalescing unit (ACU) + crossbar ā€¢ ACU can buffer/coalesce requests ā€¦. Thread 0 Shared Memory Bank Thread 1 Shared Memory Bank Thread 2 Shared Memory Bank A C U + X b a r A C U + X b a r
  • 21. L09-21 Sze and Emer Memory Access Divergence March 6, 2023 ā€¢ All loads are gathers, all stores are scatters ā€¢ Address coalescing unit (ACU) detects sequential and strided patterns, coalesces memory requests, but complex patterns can result in multiple lower bandwidth requests (memory divergence) ā€¢ Writing efficient GPU code requires most accesses to not conflict, even though programming model allows arbitrary patterns!
  • 22. L09-22 Sze and Emer Global Memory Bank Global Memory Bank Global Memory Bank A C U + X b a r A C U + X b a r ā€¦. Thread 0 Global Memory Bank Thread 1 Global Memory Bank Thread 2 Global Memory Bank A C U + X b a r A C U + X b a r Shared Global Memory March 6, 2023 ā€¢ Shared global memory ā€“ Large shared memory ā€“ Will suffer also from memory divergence
  • 23. L09-23 Sze and Emer Shared Global Memory March 6, 2023 ā€¢ Memory hierarchy with caches ā€“ Cache to save memory bandwidth ā€“ Caches also enable compression/decompression of data Global Memory Bank Global Memory Bank Global Memory Bank C r o s s b a r C r o s s b a r Cache Tags/Data Cache Tags/Data Cache Tags/Data A C U + X b a r C r o s s b a r N e t w o r k Buffered Data Buffered Data Buffered Data N e t w o r k Hits Misses
  • 24. L09-24 Sze and Emer Serialized cache access March 6, 2023 ā€¢ Trade latency for power/flexibility ā€“ Only access data bank that contains data ā€“ Facilitate more sophisticated cache organizations ā€¢ e.g., greater associatively B l o c k O f f s e t I n d e x T a g Data Store Tag Store Match B l o c k O f f s e t I n d e x T a g Data Store Tag Store Match Combine
  • 25. L09-25 Sze and Emer GPU Programming Environments Code for accelerated kernels ā€¢ CUDA (Nvidia-only) ā€“ C-like language that runs on GPU ā€“ Libraries: cuDNN, cuBLAS, cuFFT ā€¢ OpenCL (open standard) ā€“ C-like language that runs on GPU, CPU or FPGA ā€“ usually less optimized than CUDA March 6, 2023
  • 26. L09-26 Sze and Emer CUDA GPU Thread Model March 6, 2023 ā€¢ Single-program multiple data (SPMD) model ā€¢ Each context is a thread ā€“ Threads have registers ā€“ Threads have local memory ā€¢ Parallel threads packed in blocks ā€“ Blocks have shared memory ā€“ Threads synchronize with barrier ā€“ Blocks run to completion (or abort) ā€¢ Grids include independent blocks ā€“ May execute concurrently ā€“ Share global memory, but ā€“ Have limited inter-block synchronization
  • 27. L09-27 Sze and Emer Hardware Scheduling March 6, 2023 ā€¢ Grids can be launched by CPU or GPU ā€“ Work from multiple CPU threads and processes ā€¢ HW unit schedules grids on SMs (labeled SMX in diagram) ā€“ Priority-based scheduling ā€¢ 32 active grids ā€“ More queued/paused
  • 28. L09-28 Sze and Emer GPU Kernel Execution March 6, 2023 Transfer input data from CPU to GPU memory Launch kernel (grid) Wait for kernel to finish (if synchronous) Transfer results to CPU memory CPU Mem GPU Mem 3 1 2 4 1 3 2 4 ā€¢ Data transfers can dominate execution time ā€¢ Integrated GPUs with unified address space ļƒ  no copies, butā€¦
  • 29. L09-29 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps Ɨ 1 Output fmaps M = ā€¢ Matrixā€“Vector Multiply: ā€¢ Multiply all inputs in all channels by a weight and sum March 6, 2023
  • 30. L09-30 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps Ɨ 1 Output fmaps M = ā€¢ Matrixā€“Vector Multiply: ā€¢ Multiply all inputs in all channels by a weight and sum March 6, 2023
  • 31. L09-31 Sze and Emer Fully Connected Computation F[M0 C0 H0 W0] F[M0 C0 H0 W1] ā€¦ F[M0 C0 H1 W0] F[M0 C0 H1 W1] ā€¦ F[M0 C0 H2 W0] F[M0 C0 H2 W1] ā€¦ . . F[M0 C1 H0 W0] F[M0 C1 H0 W1] ā€¦ F[M0 C1 H1 W0] F[M0 C1 H1 W1] ā€¦ F[M0 C1 H2 W0] F[M0 C1 H2 W1] ā€¦ . . . F[M1 C0 H0 W0] F[M1 C0 H0 W1] ā€¦ F[M1 C0 H1 W0] F[M1 C0 H1 W1] ā€¦ F[M1 C0 H2 W0] F[M1 C0 H2 W1] ā€¦ . . . I[C0 H0 W0] I[C0 H0 W1] ā€¦ I[C0 H1 W0] I[C0 H1 W1] ā€¦ I[C0 H2 W0] I[C0 H2 W1] ā€¦ . . I[C1 H0 W0] I[C1 H0 W1] ā€¦ I[C1 H1 W0] I[C1 H1 W1] ā€¦ I[C1 H2 W0] I[C1 H2 W1] ā€¦ . . . March 6, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for (m = 0; m < M; m++) { o[m] = 0; CHWm = C*H*W*m; for (chw = 0; chw < C*H*W; chw++) { o[m] += i[chw] * f[CHWm + chw]; } }
  • 32. L09-32 Sze and Emer FC - CUDA Main Program int i[C*H*W], *gpu_i; # Input activations int f[M*C*H*W], *gpu_f; # Filter Weights int o[M], *gpu_o; # Output activations # Allocate space on GPU cudaMalloc((void**) &gpu_i, sizeof(int)*C*H*W); cudaMalloc((void**) &gpu_f, sizeof(int)*M*C*H*W); # Copy data to GPU cudaMemcpy(gpu_i, i, sizeof(int)*C*H*W); cudaMemcpy(gpu_f, f, sizeof(int)*M*C*H*W); # Run kernel fc<<<M/256 + 1, 256>>>(gpu_i, gpu_f, gpu_o, C*H*W, M); # Copy result back and free device memory cudaMemcpy(o, gpu_o, sizeof(int)*M, cudaMemCpyDevicetoHost); cudaFree(gpu_i); ... Thread block size Num thread blocks March 6, 2023
  • 33. L09-33 Sze and Emer FC ā€“ CUDA Terminology ā€¢ CUDA code launches 256 threads per block ā€¢ CUDA vs vector terminology: ā€“ Thread = 1 iteration of scalar loop [1 element in vector loop] ā€“ Block = Body of vectorized loop [VL=256 in this example] ā€¢ Warp size = 32 [Number of vector lanes] ā€“ Grid = Vectorizable loop [ vector terminology ] March 6, 2023
  • 34. L09-34 Sze and Emer FC - CUDA Kernel __global__ void fc(int *i, int *f, int *o, const int CHW, const int M){ int tid=threadIdx.x + blockIdx.x * blockDim.x; int m = tid int sum = 0; if (m < M){ for (int chw=0; chw <CHW; chw ++) { sum += i[chw]*f[(m*CHW)+chw]; } o[m] = sum; } } Any consequences of f[(m*CHW)+chw]? Yes, strided references Any consequences of f[(chw*M)+m]? Yes, different data layout tid is ā€œoutput channelā€ number March 6, 2023 Calculate ā€œglobal thead id (tid)
  • 35. L09-35 Sze and Emer Convolution (CONV) Layer ā€¦ M ā€¦ Many Input fmaps (N) Many Output fmaps (N) ā€¦ R S R S C C filters E F H C H W C E 1 1 N N W F Batch Size (N) March 6, 2023
  • 36. L09-36 Sze and Emer Convolution (CONV) Layer = 1 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 1 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 1 2 3 4 Ɨ 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 3 1 2 4 Chnl 1 Chnl 2 Filter 1 Filter 2 Chnl 1 Chnl 2 Chnl 1 Chnl 2 Input Fmap as Toeplitz matrix Output Fmap Filter GPU Implementation: ā€¢ Keep original input activation matrix in main memory ā€¢ Conceptually do tiled matrix-matrix multiplication ā€¢ Copy input activations into scratchpad to small Toeplitz matrix tile ā€¢ On Volta tile again to use ā€˜tensor coreā€™ March 6, 2023
  • 37. L09-37 Sze and Emer GV100 ā€“ ā€œTensor Coreā€ Tensor Coreā€¦. ā€¢ 120 TFLOPS (FP16) ā€¢ 400 GFLOPS/W (FP16) New opcodes ā€“ Matrix Multiply Accumulate (HMMA) How many FP16 operands? Inputs 48 / Outputs 16 How many multiplies? 64 How many adds? 64 March 6, 2023
  • 38. L09-38 Sze and Emer +1 2 2 Tensor Core Integration PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 In addition to normal function unitsā€¦ March 6, 2023
  • 39. L09-39 Sze and Emer +1 2 2 Tensor Core Integration PC I$ IR GPR X Y Tensor Matrix Multiply Accumulate GPR X Y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 Cross thread operands March 6, 2023
  • 40. L09-40 Sze and Emer Fully-Connected (FC) Layer M CHW CHW N Filters Input fmaps Ɨ N Output fmaps M = ā€¢ After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply March 6, 2023
  • 41. L09-41 Sze and Emer Tiled Fully-Connected (FC) Layer M CHW CHW N Filters Input fmaps Ɨ N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1,1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in tensor core operation and computation ordered to maximize reuse of data in scratchpad and then in cache. March 6, 2023
  • 42. L09-42 Sze and Emer Tiled Fully-Connected (FC) Layer M CHW CHW N Filters Input fmaps Ɨ N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1,1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in tensor core operation and computation ordered to maximize reuse of data in scratchpad and then in cache. March 6, 2023
  • 43. L09-43 Sze and Emer Vector vs GPU Terminology March 6, 2023 [H&P5, Fig 4.25]
  • 44. L09-44 Sze and Emer Summary ā€¢ GPUs are an intermediate point in the continuum of flexibility and efficiency ā€¢ GPU architecture focuses on throughput (over latency) ā€“ Massive thread-level parallelism, with ā€¢ Single instruction for many threads ā€“ Memory hierarchy specialized for throughput ā€¢ Shared scratchpads with private address space ā€¢ Caches used primarily to reduce bandwidth not latency ā€“ Specialized compute units, with ā€¢ Many computes per input operand ā€¢ Littleā€™s Law useful for system analysis ā€“ Shows relationship between throughput, latency and tasks in-flight March 6, 2023
  • 45. L09-45 Sze and Emer Next Lecture: Spatial Architectures Thank you! March 6, 2023