SlideShare a Scribd company logo
L09-1
6.5930/1
Hardware Architectures for Deep Learning
GPU Computation
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
March 6, 2023
L09-2
Sze and Emer
Why are GPU interesting?
March 6, 2023
• Very successful commodity accelerator/co-processor
• GPUs combine two strategies to increase efficiency
– Massive parallelism
– Specialization
• Illustrates tension between performance and
programmability in accelerators
• Pervasively applied in deep learning applications
L09-3
Sze and Emer
Background Reading
• GPUs
– Computer Architecture: A Quantitative Approach,
by Hennessy and Patterson
• Edition 6: Ch 4: p310-336
• Edition 5: Ch 4: p288-315
All these books and their online/e-book versions are available through
MIT libraries.
March 6, 2023
L09-4
Sze and Emer
Programmability vs Efficiency
March 6, 2023
More Efficient
More Programmable
CPU GPU FPGA CGRA ASIC
FPGA => Field programmable gate array
CGRA => Coarse-grained reconfigurable array
ASIC => Application-specific integrated circuit
L09-5
Sze and Emer
CPU vs GPU Attribute Summary
March 6, 2023
Source: Stanford CS231n
L09-6
Sze and Emer
CPU vs. GPU Performance
March 6, 2023
Source: Stanford CS231n
Ratio of (partially-optimized) CPU vs. CUDA library (cuDNN)
L09-7
Sze and Emer
CPU vs. GPU Performance
March 6, 2023
Source: Stanford CS231n
Ratio of unoptimized CUDA vs. CUDA library (cuDNN)
L09-8
Sze and Emer
Single Instruction Multiple Thread
March 6, 2023
SIMT
– Many threads each with
private architectural state
or context, e.g., registers.
– Group of threads that
issue together called a
warp.
– All threads that issue
together execute same
instruction.
– Entire pipeline is an SM or
streaming multiprocessor
Mem
GPR
X
Y
+
*
PC I$ IR GPR
X
Y
+
*
Mem
M
e
m
o
r
y
green-> Nvidia terminology
Lane
L09-9
Sze and Emer
+1
2 2
Multiple Thread – Single Instruction Multiple Thread
March 6, 2023
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
L09-10
Sze and Emer
Function unit optimization
March 6, 2023
+
*
PC I$ IR GPR +
*
M
e
m
o
r
y
GPR
+
*
Specialize
Function Units
L09-11
Sze and Emer
Function unit optimization
March 6, 2023
*
PC I$ IR GPR +
Restriction: Can’t issue same operation twice in a row
M
e
m
o
r
y
GPR
IR
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
L09-12
Sze and Emer
Function unit optimization
March 6, 2023
0 1 2 3 4 5
Fetch Add0,0-1 Mul1,0-1 Add2,0-1 Mul3,0-1 …
GPR0 Add0,0 Mul1,0 Add2,0 Mul3,0 …
GPR1 Add0,1 Mul1,1 Add2,1 …
Adder Add0,0 Add0,1 Add2,0 …
Mul Mul1,0 Mul1,1 …
Key: Opcodeinum,thread(s)
U
n
i
t
Cycle
*
PC I$ IR GPR
+
M
e
m
o
r
y
GPR
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
IR
L09-13
Sze and Emer
Streaming Multiprocessor Overview
March 6, 2023
• Each SM supports 10s of
warps (e.g., 64 in Kepler) with
32 threads/warp
• Fetch 1 instr/cycle
• Issue 1 ready instr/cycle
– Simple scoreboarding: all warp
elements must be ready
• Instruction broadcast to all
lanes
• Multithreading is the main
latency-hiding mechanism
L09-14
Sze and Emer
Little’s Law
March 6, 2023
Throughput (T) = Number in Flight (N) / Latency (L)
Issue Execution
Example:
64 warps (number of instructions in flight)
1 instruction / cycle (desired throughput)

L09-15
Sze and Emer
Number of Active Warps
March 6, 2023
• SMs support a variable number of active warps
based on required registers (and shared memory).
Fewer warps allows more registers per warp, i.e., a
larger context.
– Few large contexts  Fewer register spills
– Many small contexts  More latency tolerance
– Choice left to the compiler
• Example: Kepler supports up to 64 warps
– Max: 64 warps @ <=32 registers/thread
– Min: 8 warps @ 256 registers/thread
L09-16
Sze and Emer
GPU ISA and Compilation
March 6, 2023
• GPU micro-architecture and instruction set change
very frequently
• To achieve compatibility:
– Compiler produces intermediate pseudo-assembler
language (e.g., Nvidia PTX)
– GPU driver JITs kernel, tailoring it to specific micro-
architecture
• In practice, little performance portability
– Code is often tuned to specific GPU architecture
L09-17
Sze and Emer
+1
2 2
Multiple Thread – Single Instruction Multiple Thread
March 6, 2023
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
L09-18
Sze and Emer
Many Memory Types
March 6, 2023
Mem
Thread 0
Thread 1
Thread 2
Per Thread Memory
Scratchpad Shared Memory
Global Memory
………
L09-19
Sze and Emer
Private Per Thread Memory
March 6, 2023
• Private memory
– No cross-thread sharing
– Small, fixed size memory
• Can be used for constants
– Multi-bank implementation (can be in global memory)
….
Thread 0 Thread 0 Memory
Thread 1 Thread 1 Memory
Thread 2 Thread 2 Memory
L09-20
Sze and Emer
Shared Scratchpad Memory
March 6, 2023
• Shared scratchpad memory (threads share data)
– Small, fixed size memory (16K-64K / ‘core’)
– Banked for high bandwidth
– Fed with address coalescing unit (ACU) + crossbar
• ACU can buffer/coalesce requests
….
Thread 0 Shared Memory Bank
Thread 1 Shared Memory Bank
Thread 2 Shared Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
L09-21
Sze and Emer
Memory Access Divergence
March 6, 2023
• All loads are gathers, all stores are scatters
• Address coalescing unit (ACU) detects
sequential and strided patterns, coalesces
memory requests, but complex patterns can
result in multiple lower bandwidth requests
(memory divergence)
• Writing efficient GPU code requires most
accesses to not conflict, even though
programming model allows arbitrary patterns!
L09-22
Sze and Emer
Global Memory Bank
Global Memory Bank
Global Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
….
Thread 0 Global Memory Bank
Thread 1 Global Memory Bank
Thread 2 Global Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
Shared Global Memory
March 6, 2023
• Shared global memory
– Large shared memory
– Will suffer also from memory divergence
L09-23
Sze and Emer
Shared Global Memory
March 6, 2023
• Memory hierarchy with caches
– Cache to save memory bandwidth
– Caches also enable compression/decompression of data
Global Memory Bank
Global Memory Bank
Global Memory Bank
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
Cache Tags/Data
Cache Tags/Data
Cache Tags/Data
A
C
U
+
X
b
a
r
C
r
o
s
s
b
a
r
N
e
t
w
o
r
k
Buffered Data
Buffered Data
Buffered Data
N
e
t
w
o
r
k
Hits
Misses
L09-24
Sze and Emer
Serialized cache access
March 6, 2023
• Trade latency for power/flexibility
– Only access data bank that contains data
– Facilitate more sophisticated cache organizations
• e.g., greater associatively
B
l
o
c
k
O
f
f
s
e
t
I
n
d
e
x
T
a
g
Data Store
Tag Store
Match
B
l
o
c
k
O
f
f
s
e
t
I
n
d
e
x
T
a
g
Data Store
Tag Store
Match
Combine
L09-25
Sze and Emer
GPU Programming Environments
Code for accelerated kernels
• CUDA (Nvidia-only)
– C-like language that runs on GPU
– Libraries: cuDNN, cuBLAS, cuFFT
• OpenCL (open standard)
– C-like language that runs on GPU, CPU or FPGA
– usually less optimized than CUDA
March 6, 2023
L09-26
Sze and Emer
CUDA GPU Thread Model
March 6, 2023
• Single-program multiple data (SPMD) model
• Each context is a thread
– Threads have registers
– Threads have local memory
• Parallel threads packed in blocks
– Blocks have shared memory
– Threads synchronize with barrier
– Blocks run to completion (or abort)
• Grids include independent blocks
– May execute concurrently
– Share global memory, but
– Have limited inter-block synchronization
L09-27
Sze and Emer
Hardware Scheduling
March 6, 2023
• Grids can be launched by
CPU or GPU
– Work from multiple CPU threads
and processes
• HW unit schedules grids on
SMs (labeled SMX in diagram)
– Priority-based scheduling
• 32 active grids
– More queued/paused
L09-28
Sze and Emer
GPU Kernel Execution
March 6, 2023
Transfer input data from
CPU to GPU memory
Launch kernel (grid)
Wait for kernel to finish
(if synchronous)
Transfer results to CPU
memory
CPU
Mem
GPU
Mem
3
1
2
4
1
3
2
4
• Data transfers can dominate execution time
• Integrated GPUs with unified address space
 no copies, but…
L09-29
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
March 6, 2023
L09-30
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
March 6, 2023
L09-31
Sze and Emer
Fully Connected Computation
F[M0 C0 H0 W0] F[M0 C0 H0 W1] …
F[M0 C0 H1 W0] F[M0 C0 H1 W1] …
F[M0 C0 H2 W0] F[M0 C0 H2 W1] …
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] …
F[M0 C1 H1 W0] F[M0 C1 H1 W1] …
F[M0 C1 H2 W0] F[M0 C1 H2 W1] …
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] …
F[M1 C0 H1 W0] F[M1 C0 H1 W1] …
F[M1 C0 H2 W0] F[M1 C0 H2 W1] …
.
.
.
I[C0 H0 W0] I[C0 H0 W1] …
I[C0 H1 W0] I[C0 H1 W1] …
I[C0 H2 W0] I[C0 H2 W1] …
.
.
I[C1 H0 W0] I[C1 H0 W1] …
I[C1 H1 W0] I[C1 H1 W1] …
I[C1 H2 W0] I[C1 H2 W1] …
.
.
.
March 6, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for (m = 0; m < M; m++) {
o[m] = 0;
CHWm = C*H*W*m;
for (chw = 0; chw < C*H*W; chw++) {
o[m] += i[chw] * f[CHWm + chw];
}
}
L09-32
Sze and Emer
FC - CUDA Main Program
int i[C*H*W], *gpu_i; # Input activations
int f[M*C*H*W], *gpu_f; # Filter Weights
int o[M], *gpu_o; # Output activations
# Allocate space on GPU
cudaMalloc((void**) &gpu_i, sizeof(int)*C*H*W);
cudaMalloc((void**) &gpu_f, sizeof(int)*M*C*H*W);
# Copy data to GPU
cudaMemcpy(gpu_i, i, sizeof(int)*C*H*W);
cudaMemcpy(gpu_f, f, sizeof(int)*M*C*H*W);
# Run kernel
fc<<<M/256 + 1, 256>>>(gpu_i, gpu_f, gpu_o, C*H*W, M);
# Copy result back and free device memory
cudaMemcpy(o, gpu_o, sizeof(int)*M, cudaMemCpyDevicetoHost);
cudaFree(gpu_i);
...
Thread block size
Num thread blocks
March 6, 2023
L09-33
Sze and Emer
FC – CUDA Terminology
• CUDA code launches 256 threads per block
• CUDA vs vector terminology:
– Thread = 1 iteration of scalar loop [1 element in vector loop]
– Block = Body of vectorized loop [VL=256 in this example]
• Warp size = 32 [Number of vector lanes]
– Grid = Vectorizable loop
[ vector terminology ]
March 6, 2023
L09-34
Sze and Emer
FC - CUDA Kernel
__global__
void fc(int *i, int *f, int *o, const int CHW, const int M){
int tid=threadIdx.x + blockIdx.x * blockDim.x;
int m = tid
int sum = 0;
if (m < M){
for (int chw=0; chw <CHW; chw ++) {
sum += i[chw]*f[(m*CHW)+chw];
}
o[m] = sum;
}
}
Any consequences of f[(m*CHW)+chw]?
Any consequences of f[(chw*M)+m]?
tid is “output
channel” number
March 6, 2023
Calculate “global
thead id (tid)
L09-35
Sze and Emer
Convolution (CONV) Layer
…
M
…
Many
Input fmaps (N) Many
Output fmaps (N)
…
R
S
R
S
C
C
filters
E
F
H
C
H
W
C
E
1 1
N
N
W F
Batch Size (N)
March 6, 2023
L09-36
Sze and Emer
Convolution (CONV) Layer
=
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1 2 3 4
×
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4 3
1 2 4
Chnl 1 Chnl 2
Filter 1
Filter 2
Chnl 1
Chnl 2
Chnl 1
Chnl 2
Input Fmap as
Toeplitz matrix
Output Fmap
Filter
GPU Implementation:
• Keep original input activation matrix in main memory
• Conceptually do tiled matrix-matrix multiplication
• Copy input activations into scratchpad to small Toeplitz matrix tile
• On Volta tile again to use ‘tensor core’
March 6, 2023
L09-37
Sze and Emer
GV100 – “Tensor Core”
Tensor Core….
• 120 TFLOPS (FP16)
• 400 GFLOPS/W (FP16)
New opcodes – Matrix Multiply Accumulate (HMMA)
How many FP16 operands?
How many multiplies?
How many adds?
March 6, 2023
L09-38
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
In addition to
normal function
units…
March 6, 2023
L09-39
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
Tensor
Matrix
Multiply
Accumulate
GPR
X
Y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
Cross thread
operands
March 6, 2023
L09-40
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
March 6, 2023
L09-41
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 6, 2023
L09-42
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 6, 2023
L09-43
Sze and Emer
Vector vs GPU Terminology
March 6, 2023
[H&P5, Fig 4.25]
L09-44
Sze and Emer
Summary
• GPUs are an intermediate point in the continuum of flexibility
and efficiency
• GPU architecture focuses on throughput (over latency)
– Massive thread-level parallelism, with
• Single instruction for many threads
– Memory hierarchy specialized for throughput
• Shared scratchpads with private address space
• Caches used primarily to reduce bandwidth not latency
– Specialized compute units, with
• Many computes per input operand
• Little’s Law useful for system analysis
– Shows relationship between throughput, latency and tasks in-flight
March 6, 2023
L09-45
Sze and Emer
Next Lecture: Spatial Architectures
Thank you!
March 6, 2023

More Related Content

Similar to L09-handout.pdf

VLSI Design- Guru.ppt
VLSI Design- Guru.pptVLSI Design- Guru.ppt
VLSI Design- Guru.ppt
Ram Pavithra Guru
 
Large Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based designLarge Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based design
Dhiman Chowdhury
 
VLSI UNIT-1.1.pdf.ppt
VLSI UNIT-1.1.pdf.pptVLSI UNIT-1.1.pdf.ppt
VLSI UNIT-1.1.pdf.ppt
rajukolluri
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
Vajira Thambawita
 
A Journey to Boot Linux on Raspberry Pi
A Journey to Boot Linux on Raspberry PiA Journey to Boot Linux on Raspberry Pi
A Journey to Boot Linux on Raspberry Pi
Jian-Hong Pan
 
L10.pdf
L10.pdfL10.pdf
L10.pdf
L10.pdfL10.pdf
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
Jason Riedy
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
lecture25_fpga-conclude.ppt
lecture25_fpga-conclude.pptlecture25_fpga-conclude.ppt
lecture25_fpga-conclude.ppt
Sourav Roy
 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
TRNHONGLINHBCHCM
 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
TRNHONGLINHBCHCM
 
Field Programmable Gate Arrays : Architecture
Field Programmable Gate Arrays : ArchitectureField Programmable Gate Arrays : Architecture
Field Programmable Gate Arrays : Architecture
Usha Mehta
 
Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
Iben Rodriguez
 
L11.pdf
L11.pdfL11.pdf
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
MKKhaing
 
Introduction A digital circuit design
Introduction A digital circuit design Introduction A digital circuit design
Introduction A digital circuit design
wafawafa52
 
Ch07
Ch07Ch07
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
Fatima Qayyum
 
Floor planning
Floor planningFloor planning
Floor planning
shaik sharief
 

Similar to L09-handout.pdf (20)

VLSI Design- Guru.ppt
VLSI Design- Guru.pptVLSI Design- Guru.ppt
VLSI Design- Guru.ppt
 
Large Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based designLarge Scale Data center Solution Guide: eBGP based design
Large Scale Data center Solution Guide: eBGP based design
 
VLSI UNIT-1.1.pdf.ppt
VLSI UNIT-1.1.pdf.pptVLSI UNIT-1.1.pdf.ppt
VLSI UNIT-1.1.pdf.ppt
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
A Journey to Boot Linux on Raspberry Pi
A Journey to Boot Linux on Raspberry PiA Journey to Boot Linux on Raspberry Pi
A Journey to Boot Linux on Raspberry Pi
 
L10.pdf
L10.pdfL10.pdf
L10.pdf
 
L10.pdf
L10.pdfL10.pdf
L10.pdf
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
lecture25_fpga-conclude.ppt
lecture25_fpga-conclude.pptlecture25_fpga-conclude.ppt
lecture25_fpga-conclude.ppt
 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
 
L10-handout.pdf
L10-handout.pdfL10-handout.pdf
L10-handout.pdf
 
Field Programmable Gate Arrays : Architecture
Field Programmable Gate Arrays : ArchitectureField Programmable Gate Arrays : Architecture
Field Programmable Gate Arrays : Architecture
 
Fine grained monitoring
Fine grained monitoringFine grained monitoring
Fine grained monitoring
 
L11.pdf
L11.pdfL11.pdf
L11.pdf
 
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Studentsmulti-core Processor.ppt for IGCSE ICT and Computer Science Students
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
 
Introduction A digital circuit design
Introduction A digital circuit design Introduction A digital circuit design
Introduction A digital circuit design
 
Ch07
Ch07Ch07
Ch07
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Floor planning
Floor planningFloor planning
Floor planning
 

More from TRNHONGLINHBCHCM

L07.pdf
L07.pdfL07.pdf
L03.pdf
L03.pdfL03.pdf
L08-handout.pdf
L08-handout.pdfL08-handout.pdf
L08-handout.pdf
TRNHONGLINHBCHCM
 
L05.pdf
L05.pdfL05.pdf
L04.pdf
L04.pdfL04.pdf
L12.pdf
L12.pdfL12.pdf
L01.pdf
L01.pdfL01.pdf
L02.pdf
L02.pdfL02.pdf
L14.pdf
L14.pdfL14.pdf
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
TRNHONGLINHBCHCM
 
L08.pdf
L08.pdfL08.pdf
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
TRNHONGLINHBCHCM
 
L13.pdf
L13.pdfL13.pdf
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
TRNHONGLINHBCHCM
 

More from TRNHONGLINHBCHCM (14)

L07.pdf
L07.pdfL07.pdf
L07.pdf
 
L03.pdf
L03.pdfL03.pdf
L03.pdf
 
L08-handout.pdf
L08-handout.pdfL08-handout.pdf
L08-handout.pdf
 
L05.pdf
L05.pdfL05.pdf
L05.pdf
 
L04.pdf
L04.pdfL04.pdf
L04.pdf
 
L12.pdf
L12.pdfL12.pdf
L12.pdf
 
L01.pdf
L01.pdfL01.pdf
L01.pdf
 
L02.pdf
L02.pdfL02.pdf
L02.pdf
 
L14.pdf
L14.pdfL14.pdf
L14.pdf
 
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
 
L08.pdf
L08.pdfL08.pdf
L08.pdf
 
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
 
L13.pdf
L13.pdfL13.pdf
L13.pdf
 
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
 

Recently uploaded

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 

Recently uploaded (20)

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 

L09-handout.pdf

  • 1. L09-1 6.5930/1 Hardware Architectures for Deep Learning GPU Computation Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science March 6, 2023
  • 2. L09-2 Sze and Emer Why are GPU interesting? March 6, 2023 • Very successful commodity accelerator/co-processor • GPUs combine two strategies to increase efficiency – Massive parallelism – Specialization • Illustrates tension between performance and programmability in accelerators • Pervasively applied in deep learning applications
  • 3. L09-3 Sze and Emer Background Reading • GPUs – Computer Architecture: A Quantitative Approach, by Hennessy and Patterson • Edition 6: Ch 4: p310-336 • Edition 5: Ch 4: p288-315 All these books and their online/e-book versions are available through MIT libraries. March 6, 2023
  • 4. L09-4 Sze and Emer Programmability vs Efficiency March 6, 2023 More Efficient More Programmable CPU GPU FPGA CGRA ASIC FPGA => Field programmable gate array CGRA => Coarse-grained reconfigurable array ASIC => Application-specific integrated circuit
  • 5. L09-5 Sze and Emer CPU vs GPU Attribute Summary March 6, 2023 Source: Stanford CS231n
  • 6. L09-6 Sze and Emer CPU vs. GPU Performance March 6, 2023 Source: Stanford CS231n Ratio of (partially-optimized) CPU vs. CUDA library (cuDNN)
  • 7. L09-7 Sze and Emer CPU vs. GPU Performance March 6, 2023 Source: Stanford CS231n Ratio of unoptimized CUDA vs. CUDA library (cuDNN)
  • 8. L09-8 Sze and Emer Single Instruction Multiple Thread March 6, 2023 SIMT – Many threads each with private architectural state or context, e.g., registers. – Group of threads that issue together called a warp. – All threads that issue together execute same instruction. – Entire pipeline is an SM or streaming multiprocessor Mem GPR X Y + * PC I$ IR GPR X Y + * Mem M e m o r y green-> Nvidia terminology Lane
  • 9. L09-9 Sze and Emer +1 2 2 Multiple Thread – Single Instruction Multiple Thread March 6, 2023 PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1
  • 10. L09-10 Sze and Emer Function unit optimization March 6, 2023 + * PC I$ IR GPR + * M e m o r y GPR + * Specialize Function Units
  • 11. L09-11 Sze and Emer Function unit optimization March 6, 2023 * PC I$ IR GPR + Restriction: Can’t issue same operation twice in a row M e m o r y GPR IR C r o s s b a r C r o s s b a r
  • 12. L09-12 Sze and Emer Function unit optimization March 6, 2023 0 1 2 3 4 5 Fetch Add0,0-1 Mul1,0-1 Add2,0-1 Mul3,0-1 … GPR0 Add0,0 Mul1,0 Add2,0 Mul3,0 … GPR1 Add0,1 Mul1,1 Add2,1 … Adder Add0,0 Add0,1 Add2,0 … Mul Mul1,0 Mul1,1 … Key: Opcodeinum,thread(s) U n i t Cycle * PC I$ IR GPR + M e m o r y GPR C r o s s b a r C r o s s b a r IR
  • 13. L09-13 Sze and Emer Streaming Multiprocessor Overview March 6, 2023 • Each SM supports 10s of warps (e.g., 64 in Kepler) with 32 threads/warp • Fetch 1 instr/cycle • Issue 1 ready instr/cycle – Simple scoreboarding: all warp elements must be ready • Instruction broadcast to all lanes • Multithreading is the main latency-hiding mechanism
  • 14. L09-14 Sze and Emer Little’s Law March 6, 2023 Throughput (T) = Number in Flight (N) / Latency (L) Issue Execution Example: 64 warps (number of instructions in flight) 1 instruction / cycle (desired throughput) 
  • 15. L09-15 Sze and Emer Number of Active Warps March 6, 2023 • SMs support a variable number of active warps based on required registers (and shared memory). Fewer warps allows more registers per warp, i.e., a larger context. – Few large contexts  Fewer register spills – Many small contexts  More latency tolerance – Choice left to the compiler • Example: Kepler supports up to 64 warps – Max: 64 warps @ <=32 registers/thread – Min: 8 warps @ 256 registers/thread
  • 16. L09-16 Sze and Emer GPU ISA and Compilation March 6, 2023 • GPU micro-architecture and instruction set change very frequently • To achieve compatibility: – Compiler produces intermediate pseudo-assembler language (e.g., Nvidia PTX) – GPU driver JITs kernel, tailoring it to specific micro- architecture • In practice, little performance portability – Code is often tuned to specific GPU architecture
  • 17. L09-17 Sze and Emer +1 2 2 Multiple Thread – Single Instruction Multiple Thread March 6, 2023 PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1
  • 18. L09-18 Sze and Emer Many Memory Types March 6, 2023 Mem Thread 0 Thread 1 Thread 2 Per Thread Memory Scratchpad Shared Memory Global Memory ………
  • 19. L09-19 Sze and Emer Private Per Thread Memory March 6, 2023 • Private memory – No cross-thread sharing – Small, fixed size memory • Can be used for constants – Multi-bank implementation (can be in global memory) …. Thread 0 Thread 0 Memory Thread 1 Thread 1 Memory Thread 2 Thread 2 Memory
  • 20. L09-20 Sze and Emer Shared Scratchpad Memory March 6, 2023 • Shared scratchpad memory (threads share data) – Small, fixed size memory (16K-64K / ‘core’) – Banked for high bandwidth – Fed with address coalescing unit (ACU) + crossbar • ACU can buffer/coalesce requests …. Thread 0 Shared Memory Bank Thread 1 Shared Memory Bank Thread 2 Shared Memory Bank A C U + X b a r A C U + X b a r
  • 21. L09-21 Sze and Emer Memory Access Divergence March 6, 2023 • All loads are gathers, all stores are scatters • Address coalescing unit (ACU) detects sequential and strided patterns, coalesces memory requests, but complex patterns can result in multiple lower bandwidth requests (memory divergence) • Writing efficient GPU code requires most accesses to not conflict, even though programming model allows arbitrary patterns!
  • 22. L09-22 Sze and Emer Global Memory Bank Global Memory Bank Global Memory Bank A C U + X b a r A C U + X b a r …. Thread 0 Global Memory Bank Thread 1 Global Memory Bank Thread 2 Global Memory Bank A C U + X b a r A C U + X b a r Shared Global Memory March 6, 2023 • Shared global memory – Large shared memory – Will suffer also from memory divergence
  • 23. L09-23 Sze and Emer Shared Global Memory March 6, 2023 • Memory hierarchy with caches – Cache to save memory bandwidth – Caches also enable compression/decompression of data Global Memory Bank Global Memory Bank Global Memory Bank C r o s s b a r C r o s s b a r Cache Tags/Data Cache Tags/Data Cache Tags/Data A C U + X b a r C r o s s b a r N e t w o r k Buffered Data Buffered Data Buffered Data N e t w o r k Hits Misses
  • 24. L09-24 Sze and Emer Serialized cache access March 6, 2023 • Trade latency for power/flexibility – Only access data bank that contains data – Facilitate more sophisticated cache organizations • e.g., greater associatively B l o c k O f f s e t I n d e x T a g Data Store Tag Store Match B l o c k O f f s e t I n d e x T a g Data Store Tag Store Match Combine
  • 25. L09-25 Sze and Emer GPU Programming Environments Code for accelerated kernels • CUDA (Nvidia-only) – C-like language that runs on GPU – Libraries: cuDNN, cuBLAS, cuFFT • OpenCL (open standard) – C-like language that runs on GPU, CPU or FPGA – usually less optimized than CUDA March 6, 2023
  • 26. L09-26 Sze and Emer CUDA GPU Thread Model March 6, 2023 • Single-program multiple data (SPMD) model • Each context is a thread – Threads have registers – Threads have local memory • Parallel threads packed in blocks – Blocks have shared memory – Threads synchronize with barrier – Blocks run to completion (or abort) • Grids include independent blocks – May execute concurrently – Share global memory, but – Have limited inter-block synchronization
  • 27. L09-27 Sze and Emer Hardware Scheduling March 6, 2023 • Grids can be launched by CPU or GPU – Work from multiple CPU threads and processes • HW unit schedules grids on SMs (labeled SMX in diagram) – Priority-based scheduling • 32 active grids – More queued/paused
  • 28. L09-28 Sze and Emer GPU Kernel Execution March 6, 2023 Transfer input data from CPU to GPU memory Launch kernel (grid) Wait for kernel to finish (if synchronous) Transfer results to CPU memory CPU Mem GPU Mem 3 1 2 4 1 3 2 4 • Data transfers can dominate execution time • Integrated GPUs with unified address space  no copies, but…
  • 29. L09-29 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = • Matrix–Vector Multiply: • Multiply all inputs in all channels by a weight and sum March 6, 2023
  • 30. L09-30 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = • Matrix–Vector Multiply: • Multiply all inputs in all channels by a weight and sum March 6, 2023
  • 31. L09-31 Sze and Emer Fully Connected Computation F[M0 C0 H0 W0] F[M0 C0 H0 W1] … F[M0 C0 H1 W0] F[M0 C0 H1 W1] … F[M0 C0 H2 W0] F[M0 C0 H2 W1] … . . F[M0 C1 H0 W0] F[M0 C1 H0 W1] … F[M0 C1 H1 W0] F[M0 C1 H1 W1] … F[M0 C1 H2 W0] F[M0 C1 H2 W1] … . . . F[M1 C0 H0 W0] F[M1 C0 H0 W1] … F[M1 C0 H1 W0] F[M1 C0 H1 W1] … F[M1 C0 H2 W0] F[M1 C0 H2 W1] … . . . I[C0 H0 W0] I[C0 H0 W1] … I[C0 H1 W0] I[C0 H1 W1] … I[C0 H2 W0] I[C0 H2 W1] … . . I[C1 H0 W0] I[C1 H0 W1] … I[C1 H1 W0] I[C1 H1 W1] … I[C1 H2 W0] I[C1 H2 W1] … . . . March 6, 2023 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for (m = 0; m < M; m++) { o[m] = 0; CHWm = C*H*W*m; for (chw = 0; chw < C*H*W; chw++) { o[m] += i[chw] * f[CHWm + chw]; } }
  • 32. L09-32 Sze and Emer FC - CUDA Main Program int i[C*H*W], *gpu_i; # Input activations int f[M*C*H*W], *gpu_f; # Filter Weights int o[M], *gpu_o; # Output activations # Allocate space on GPU cudaMalloc((void**) &gpu_i, sizeof(int)*C*H*W); cudaMalloc((void**) &gpu_f, sizeof(int)*M*C*H*W); # Copy data to GPU cudaMemcpy(gpu_i, i, sizeof(int)*C*H*W); cudaMemcpy(gpu_f, f, sizeof(int)*M*C*H*W); # Run kernel fc<<<M/256 + 1, 256>>>(gpu_i, gpu_f, gpu_o, C*H*W, M); # Copy result back and free device memory cudaMemcpy(o, gpu_o, sizeof(int)*M, cudaMemCpyDevicetoHost); cudaFree(gpu_i); ... Thread block size Num thread blocks March 6, 2023
  • 33. L09-33 Sze and Emer FC – CUDA Terminology • CUDA code launches 256 threads per block • CUDA vs vector terminology: – Thread = 1 iteration of scalar loop [1 element in vector loop] – Block = Body of vectorized loop [VL=256 in this example] • Warp size = 32 [Number of vector lanes] – Grid = Vectorizable loop [ vector terminology ] March 6, 2023
  • 34. L09-34 Sze and Emer FC - CUDA Kernel __global__ void fc(int *i, int *f, int *o, const int CHW, const int M){ int tid=threadIdx.x + blockIdx.x * blockDim.x; int m = tid int sum = 0; if (m < M){ for (int chw=0; chw <CHW; chw ++) { sum += i[chw]*f[(m*CHW)+chw]; } o[m] = sum; } } Any consequences of f[(m*CHW)+chw]? Any consequences of f[(chw*M)+m]? tid is “output channel” number March 6, 2023 Calculate “global thead id (tid)
  • 35. L09-35 Sze and Emer Convolution (CONV) Layer … M … Many Input fmaps (N) Many Output fmaps (N) … R S R S C C filters E F H C H W C E 1 1 N N W F Batch Size (N) March 6, 2023
  • 36. L09-36 Sze and Emer Convolution (CONV) Layer = 1 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 1 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 1 2 3 4 × 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 3 1 2 4 Chnl 1 Chnl 2 Filter 1 Filter 2 Chnl 1 Chnl 2 Chnl 1 Chnl 2 Input Fmap as Toeplitz matrix Output Fmap Filter GPU Implementation: • Keep original input activation matrix in main memory • Conceptually do tiled matrix-matrix multiplication • Copy input activations into scratchpad to small Toeplitz matrix tile • On Volta tile again to use ‘tensor core’ March 6, 2023
  • 37. L09-37 Sze and Emer GV100 – “Tensor Core” Tensor Core…. • 120 TFLOPS (FP16) • 400 GFLOPS/W (FP16) New opcodes – Matrix Multiply Accumulate (HMMA) How many FP16 operands? How many multiplies? How many adds? March 6, 2023
  • 38. L09-38 Sze and Emer +1 2 2 Tensor Core Integration PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 In addition to normal function units… March 6, 2023
  • 39. L09-39 Sze and Emer +1 2 2 Tensor Core Integration PC I$ IR GPR X Y Tensor Matrix Multiply Accumulate GPR X Y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 Cross thread operands March 6, 2023
  • 40. L09-40 Sze and Emer Fully-Connected (FC) Layer M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply March 6, 2023
  • 41. L09-41 Sze and Emer Tiled Fully-Connected (FC) Layer M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1,1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in tensor core operation and computation ordered to maximize reuse of data in scratchpad and then in cache. March 6, 2023
  • 42. L09-42 Sze and Emer Tiled Fully-Connected (FC) Layer M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1,1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in tensor core operation and computation ordered to maximize reuse of data in scratchpad and then in cache. March 6, 2023
  • 43. L09-43 Sze and Emer Vector vs GPU Terminology March 6, 2023 [H&P5, Fig 4.25]
  • 44. L09-44 Sze and Emer Summary • GPUs are an intermediate point in the continuum of flexibility and efficiency • GPU architecture focuses on throughput (over latency) – Massive thread-level parallelism, with • Single instruction for many threads – Memory hierarchy specialized for throughput • Shared scratchpads with private address space • Caches used primarily to reduce bandwidth not latency – Specialized compute units, with • Many computes per input operand • Little’s Law useful for system analysis – Shows relationship between throughput, latency and tasks in-flight March 6, 2023
  • 45. L09-45 Sze and Emer Next Lecture: Spatial Architectures Thank you! March 6, 2023