SlideShare a Scribd company logo
1 of 65
Download to read offline
L10-1
6.5930/1
Hardware Architectures for Deep Learning
GPU Computation (continued)
Joel Emer and Vivienne Sze
Acknowledgement: Srini Devadas (MIT)
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
March 8, 2023
L10-2
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
March 8, 2022
L10-3
Sze and Emer
Fully Connected Computation
F[M0 C0 H0 W0] F[M0 C0 H0 W1] …
F[M0 C0 H1 W0] F[M0 C0 H1 W1] …
F[M0 C0 H2 W0] F[M0 C0 H2 W1] …
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] …
F[M0 C1 H1 W0] F[M0 C1 H1 W1] …
F[M0 C1 H2 W0] F[M0 C1 H2 W1] …
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] …
F[M1 C0 H1 W0] F[M1 C0 H1 W1] …
F[M1 C0 H2 W0] F[M1 C0 H2 W1] …
.
.
.
I[C0 H0 W0] I[C0 H0 W1] …
I[C0 H1 W0] I[C0 H1 W1] …
I[C0 H2 W0] I[C0 H2 W1] …
.
.
I[C1 H0 W0] I[C1 H0 W1] …
I[C1 H1 W0] I[C1 H1 W1] …
I[C1 H2 W0] I[C1 H2 W1] …
.
.
.
March 8, 2022
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for (m = 0; m < M; m++) {
o[m] = 0;
CHWm = C*H*W*m;
for (chw = 0; chw < C*H*W; chw++) {
o[m] += i[chw] * f[CHWm + chw];
}
}
L10-4
Sze and Emer
FC - CUDA Main Program
int i[C*H*W], *gpu_i; # Input activations
int f[M*C*H*W], *gpu_f; # Filter Weights
int o[M], *gpu_o; # Output activations
# Allocate space on GPU
cudaMalloc((void**) &gpu_i, sizeof(int)*C*H*W);
cudaMalloc((void**) &gpu_f, sizeof(int)*M*C*H*W);
# Copy data to GPU
cudaMemcpy(gpu_i, i, sizeof(int)*C*H*W);
cudaMemcpy(gpu_f, f, sizeof(int)*M*C*H*W);
# Run kernel
fc<<<M/256 + 1, 256>>>(gpu_i, gpu_f, gpu_o, C*H*W, M);
# Copy result back and free device memory
cudaMemcpy(o, gpu_o, sizeof(int)*M, cudaMemCpyDevicetoHost);
cudaFree(gpu_i);
...
Thread block size
Num thread blocks
March 8, 2022
L10-5
Sze and Emer
FC – CUDA Terminology
• CUDA code launches 256 threads per block
• CUDA vs vector terminology:
– Thread = 1 iteration of scalar loop [1 element in vector loop]
– Block = Body of vectorized loop [VL=256 in this example]
• Warp size = 32 [Number of vector lanes]
– Grid = Vectorizable loop
[ vector terminology ]
March 8, 2022
L10-6
Sze and Emer
FC - CUDA Kernel
__global__
void fc(int *i, int *f, int *o, const int CHW, const int M){
int tid=threadIdx.x + blockIdx.x * blockDim.x;
int m = tid
int sum = 0;
if (m < M){
for (int chw=0; chw <CHW; chw ++) {
sum += i[chw]*f[(m*CHW)+chw];
}
o[m] = sum;
}
}
Any consequences of f[(m*CHW)+chw]? Yes, strided references
Any consequences of f[(chw*M)+m]? Yes, different data layout
tid is “output
channel” number
March 8, 2022
Calculate “global”
thread id (tid)
Code for
one thread
L10-7
Sze and Emer
Convolution (CONV) Layer
…
M
…
Many
Input fmaps (N) Many
Output fmaps (N)
…
R
S
R
S
C
C
filters
E
F
H
C
H
W
C
E
1 1
N
N
W F
Batch Size (N)
March 8, 2022
L10-8
Sze and Emer
Convolution (CONV) Layer
=
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1 2 3 4
×
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4 3
1 2 4
Chnl 1 Chnl 2
Filter 1
Filter 2
Chnl 1
Chnl 2
Chnl 1
Chnl 2
Input Fmap as
Toeplitz matrix
Output Fmap
Filter
GPU Implementation:
• Keep original input activation matrix in main memory
• Conceptually do tiled matrix-matrix multiplication
• Copy input activations into scratchpad to small Toeplitz matrix tile
• On Volta tile again to use ‘tensor core’
March 8, 2022
L10-9
Sze and Emer
GV100 – “Tensor Core”
Tensor Core….
• 120 TFLOPS (FP16)
• 400 GFLOPS/W (FP16)
New opcodes – Matrix Multiply Accumulate (HMMA)
How many FP16 operands? Inputs 48 / Outputs 16
How many multiplies? 64
How many adds? 64
March 8, 2022
L10-10
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
In addition to
normal function
units…
March 8, 2022
L10-11
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
Tensor
Matrix
Multiply
Accumulate
GPR
X
Y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
Cross thread
operands
March 8, 2022
L10-12
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
March 8, 2022
L10-13
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 8, 2022
L10-14
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 8, 2022
L10-15
Sze and Emer
Vector vs GPU Terminology
March 8, 2022
[H&P5, Fig 4.25]
L10-16
Sze and Emer
Summary
• GPUs are an intermediate point in the continuum of flexibility
and efficiency
• GPU architecture focuses on throughput (over latency)
– Massive thread-level parallelism, with
• Single instruction for many threads
– Memory hierarchy specialized for throughput
• Shared scratchpads with private address space
• Caches used primarily to reduce bandwidth not latency
– Specialized compute units, with
• Many computes per input operand
• Little’s Law useful for system analysis
– Shows relationship between throughput, latency and tasks in-flight
March 8, 2022
L10-17
6.5930/1
Hardware Architectures for Deep Learning
Accelerator Architecture
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
March 8, 2023
L10-18
Sze and Emer
Moore’s (Original) Law
The complexity for
minimum component
costs has increased at
a rate of roughly a
factor of two per
year.
- Gordon Moore
Moore, “Cramming more components onto integrated circuits”, Electronics 1965.]
March 8, 2022
L10-19
Sze and Emer
Moore’s (Transistor) Law
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1.00E+10
1970 1975 1980 1985 1990 1995 2000 2005 2010
3
4
5
6
7
8
9
1O
log
1O
(transistors)
Intel processor chips
Year
[Moore, Progress in digital integrated electronics, IEDM 1975]
Number of transistors has been doubling
March 8, 2022
L10-20
Sze and Emer
Moore’s (Performance) Law
[Leiserson et al., There's Plenty of Room at the Top, Science]
March 8, 2022
L10-21
Sze and Emer
Energy and Power Consumption
• Energy Consumption (Joules) = α x C x V2
• Power Consumption (Joules/sec or Watts) = α x C x V2 x f
Switching
activity factor
(between 0 to 1)
Capacitance Voltage
Frequency
March 8, 2022
L10-22
Sze and Emer
Capacitance and Energy Storage
Energy Consumption = α x C x V2
March 8, 2022
L10-23
Sze and Emer
Analogy: Water = Charge
SOURCE DRAIN
GATE
Full Bucket =
‘1’
Size of bucket
proportional to
load
capacitance
What affects the rate of flow of the water (i.e. current)?
March 8, 2022
L10-24
Sze and Emer
Analogy: Water = Charge
Transistor has multiple states
Transistor ON
High Flow, Low Resistance
Transistor OFF
Low Flow, High Resistance
Flow Control
SOURCE DRAIN
GATE
VGS
VDS
Increasing
V
GS

March 8, 2022
L10-25
Sze and Emer
Dennard Scaling (idealized)
Gen X Gen X+1
1.0
Gate Width
1.0
Device Area/
Capacitance
1.0
Voltage 1.0
Energy ~ 1.0 x 1.02 = 1.0
Delay 1.0
Frequency 1/1.0 = 1.0
Power ~ 1.0 x 1.02 x 1.0 = 1.0
0.7
0.7
~ 2 x 0.7 x 0.72 = 0.65
0.7
1/0.7 = 1.4
~ 2 x 0.7 x 0.72 x 1.4 = 1.0
0.5
0.7
0.5
0.7
[ Dennard et al., "Design of ion-implanted MOSFET's with very small physical dimensions“, JSSC 1974 ]
March 8, 2022
L10-26
Sze and Emer
The End of Historic Scaling
[Leiserson et al., There's Plenty of Room at the Top, Science]
Voltage scaling slowed down  Power density increasing!
March 8, 2022
L10-27
Sze and Emer
During the Moore + Dennard’s Law Era
• Instruction-level parallelism (ILP) was
largely mined out by early 2000s
• Voltage (Dennard) scaling ended in 2005
• Hit the power limit wall in 2005
• Performance is coming from parallelism
using more transistors since ~2007
• But….
March 8, 2022
L10-28
Sze and Emer
Moore’s Law in DRAMs
Image source: John Hennessy
After multi-core, specialization (i.e., accelerators) seems to be
the most attractive architectural option to cope with the end of Moore’s Law
March 8, 2022
L10-29
Sze and Emer
The High Cost of Data Movement
Fetching operands more expensive than computing on them
Image source: Bill Dally
Now the key is how we use our transistors most effectively.
March 8, 2022
L10-30
Sze and Emer
Accelerator Design Attributes
• Integration into system
– How is the accelerator integrated into the system?
• Operation specification/sequencing
– How is activity managed within the accelerator?
• Data management
– How is data movement orchestrated in the accelerator?
Remember, however, the world is ‘fractal’!
March 8, 2022
Image: Wikipedia
L10-31
Sze and Emer
System Integration
March 8, 2022
L10-32
Sze and Emer
Accelerator Architectural Choices
• State
– Local state – Is there local state? (i.e. context)
– Global state – e.g., main memory shared with CPU
• Data Operations
– Custom data operations in the accelerator
• Memory Access Operations
– Operations to access shared global memory – Do they exist?
• Control Flow Operations
– How is accelerator sequenced relative to CPU?
March 8, 2022
L10-33
Sze and Emer
How to Sequence Accelerator Logic?
• Synchronous
– Accelerated operations inside processor pipeline
• E.g., as a separate function unit
– Control handled by standard control instructions
• Asynchronous
– A standalone logical machine
• Accelerator started by processor that continues running
What factors mitigate for one form
or the other?
Existence of concurrent activity
Latency of operation
Size of logic
Size of operands
March 8, 2022
L10-34
Sze and Emer
Eight Alternatives
Architectural semantics
Asynchronous
Accesses
memory
Has context
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Characteristics of “no
memory access” choice?
Good for smaller operands
Simpler, e.g., no virtual memory
No ‘Little’s Law’ storage requirement
March 8, 2022
L10-35
Sze and Emer
Eight Alternatives
Architectural semantics
Asynchronous
Accesses
memory
Has context
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Characteristics of “no
context” choice?
Simpler, no context switch mechanism
Long operations run to completion
More limited reuse opportunities
March 8, 2022
L10-36
Sze and Emer
Accelerator Architectural Alternatives
Architectural semantics
Example
Asynchr
onous
Accesses
memory
Has
context
0 0 0
New function unit,
like tensor core in GPU
0 0 1
Accumulating data reduction
instruction
0 1 0 Memory-to-memory vector unit
0 1 1
Register-based vector unit
including load store ops
1 0 0 Complex function calculator?
1 0 1 Security co-processor (TPM)
1 1 0 Network adapter
1 1 1 GPU with virtual memory
March 8, 2022
L10-37
Sze and Emer
Architectural semantics
Example
Asynchr
onous
Accesses
memory
Has
context
0 0 0
New function unit,
like tensor core in GPU
0 0 1
Accumulating data reduction
instruction
0 1 0 Memory-to-memory vector unit
0 1 1
Register-based vector unit
including load store ops
1 0 0 Complex function calculator?
1 0 1 Security co-processor (TPM)
1 1 0 Network adapter
1 1 1 GPU with virtual memory
Accelerator Architectural Alternatives
March 8, 2022
L10-38
Sze and Emer
Operation Sequencing
March 8, 2022
L10-39
Sze and Emer
Accelerator Taxonomy
Accelerator
Architecture
Temporally
Programmed
CPU
GPU
March 8, 2022
L10-40
Sze and Emer
Multiprocessor
L2 L2 L2 L2
L3 L3
Memory (DRAM)
Inter-processing element
communication is
through cache hierarchy
March 8, 2022
L10-41
Sze and Emer
Highly-Parallel Compute Paradigms
Temporal Architecture
(SIMD/SIMT)
Spatial Architecture
(Dataflow Processing)
Register File
Memory Hierarchy
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Control
Memory Hierarchy
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
March 8, 2022
L10-42
Sze and Emer
Spatial Architecture for DNN
Processing
Element (PE)
Global Buffer (100 – 500 kB)
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
DRAM
Local Memory Hierarchy
• Global Buffer
• Direct inter-PE network
• PE-local memory (RF)
Control
Reg File 0.5 – 1.0 kB
March 8, 2022
L10-43
Sze and Emer
Parallel Control: Temporal vs Spatial
• Style dictates many later hardware choices
• Temporal: Dynamic instruction stream, shared memory
• Spatial: Tiny processing elements (PEs), direct communication
Iteration:
Program:
Temporal Spatial
March 8, 2022
L10-44
Sze and Emer
Memory Access: Spatial vs Temporal
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1x12x23x34x45x56x67x78x8
Num
Loads
and
Stores
Num ALUs
Edit Distance
Temporal
Spatial
Algorithmic Min
0
1,000,000
2,000,000
3,000,000
4,000,000
1x12x23x34x45x56x67x78x8
Num
Loads
and
Stores
Num ALUs
Gaussian Blur
Temporal
Spatial
Algorithmic Min
0
500,000
1,000,000
1,500,000
1x12x23x34x45x56x67x78x8
Num
Loads
and
Stores
Num ALUs
Desaturate
Temporal
Spatial
Algorithmic Min
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
1x12x23x34x45x56x67x78x8
Num
Loads
and
Stores
Num ALUs
Malvar Demosaic
Temporal
Spatial
Algorithmic Min
Large benefits with even small array sizes
March 8, 2022
L10-45
Sze and Emer
Accelerator Taxonomy
Accelerator
Architecture
Temporally
Programmed
Spatially
Programmed
CPU
GPU FPGA
Triggered
Instructions
RAW
AsAP
PicoChip
TRIPS
WaveScalar
DySER
TTA
March 8, 2022
L10-46
Sze and Emer
Accelerator Taxonomy
Accelerator
Architecture
Temporally
Programmed
Spatially
Programmed
CPU
GPU
Fine (logic)
Grained
FPGA
March 8, 2022
L10-47
Sze and Emer
Field Programmable Gate Arrays
LUT
Latch
RAM .....
And
00 0
01 0
10 0
11 1
Or
00 0
01 0
10 1
11 1
March 8, 2022
Look Up Table (LUT)
L10-48
Sze and Emer
Microsoft Project Catapult
Configurable Cloud (MICRO 2016) for Azure
Accelerate and reduce latency for
• Bing search
• Software defined network
• Encryption and Decryption
March 8, 2022
L10-49
Sze and Emer
Microsoft Brainwave Neural Processor
March 8, 2022
Source: Microsoft
L10-50
Sze and Emer
Heterogeneous Blocks
• Add specific purpose logic on FPGA
– Efficient if used (better area, speed, power),
wasted if not
• Soft fabric
– LUT, flops, addition, subtraction, carry logic
– Convert LUT to memories or shift registers
• Memory block (BRAM)
– Configure word and address size (aspect ratio)
– Combine memory blocks to large blocks
– Significant part for FPGA area
– Dual port memories (FIFO)
• Multipliers /MACs  DSP
• CPUs and processing elements
March 8, 2022
L10-51
Sze and Emer
Accelerator Taxonomy
Accelerator
Architecture
Temporally
Programmed
Spatially
Programmed
CPU
GPU
Fine (logic)
Grained
Coarse (ALU)
Grained
FPGA
Triggered
Instructions
RAW
AsAP
PicoChip
TRIPS
WaveScalar
DySER
TTA
March 8, 2022
L10-52
Sze and Emer
Programmable Accelerators
Many Programmable Accelerators look like an array of PEs, but have dramatically
different architectures, programming models and capabilities
PE PE
PE PE
PE PE
PE PE
PE PE
PE PE
PE PE
PE PE
Processing
Element
...
...
...
...
March 8, 2022
L10-53
Sze and Emer
Accelerator Taxonomy
Accelerator
Architecture
Temporally
Programmed
Spatially
Programmed
CPU
GPU
Fine (logic)
Grained
Coarse (ALU)
Grained
FPGA
Fixed-operation
TPU
NVDLA
March 8, 2022
L10-54
Sze and Emer
Fixed Operation - Systolic Array
• Each PE hard-wired to one operation
• Purely pipelined operation
– no backpressure in pipeline
• Attributes
– High-concurrency
– Regular design, but
– Regular parallelism only!
March 8, 2022
L10-55
Sze and Emer
Configurable Systolic Array - WARP
March 8, 2022
Source: WARP Architecture and Implementation, ISCA 1986
L10-56
Sze and Emer
Fixed Operation - Google TPU
March 8, 2022
Systolic array does 8-bit 256x256 matrix-multiply accumulate
Source: Google
L10-57
Sze and Emer
Accelerator Taxonomy
Accelerator
Architecture
Temporally
Programmed
Spatially
Programmed
CPU
GPU
Fine (logic)
Grained
Coarse (ALU)
Grained
FPGA
Fixed-operation
WARP
DySER
TRIPS
WaveScalar
TTA
Configured-operation
TPU
NVDLA
March 8, 2022
L10-58
Sze and Emer
Single Configured Operation - Dyser
March 8, 2022
Source: Dynamically Specialized Datapaths for Energy Efficient Computing. HPCA11
L10-59
Sze and Emer
Accelerator Taxonomy
Accelerator
Architecture
Temporally
Programmed
Spatially
Programmed
CPU
GPU
Fine (logic)
Grained
Coarse (ALU)
Grained
FPGA
Wave
RAW
AsAP
PicoChip
Fixed-operation
PC-based
WARP
DySER
TRIPS
WaveScalar
TTA
Configured-operation
TPU
NVDLA
March 8, 2022
L10-60
Sze and Emer
PC-based Control – Wave Computing
March 8, 2022
Source: Wave Computing, Hot Chips ‘17
L10-61
Sze and Emer
Accelerator Taxonomy
Accelerator
Architecture
Temporally
Programmed
Spatially
Programmed
CPU
GPU
Fine (logic)
Grained
Coarse (ALU)
Grained
FPGA
Triggered
Instructions
Wave
RAW
AsAP
PicoChip
Fixed-operation
PC-based
WARP
DySER
TRIPS
WaveScalar
TTA
Configured-operation
Triggered operations
TPU
NVDLA
March 8, 2022
L10-62
Sze and Emer
Guarded Actions
• Program consists of rules that may perform
computations and read/write state
• Each rule specifies conditions (guard) under
which it is allowed to fire
• Separates description and execution of data
(rule body) from control (guards)
• A scheduler is generated (or provided by
hardware) that evaluates the guards and
schedules rule execution
• Sources of Parallelism
– Intra-Rule parallelism
– Inter-Rule parallelism
– Scheduler overlap with Rule execution
– Parallel access to state
rule X (A > 0 && B != C)
{
A <= B + 1;
B <= B - 1;
C <= B * A;
}
Scheduler
reg A; reg B; reg C;
rule Y (…) {…}
rule Z (…) {…}
March 8, 2022
L10-63
Sze and Emer
Triggered Instructions (TI)
• Restrict guarded actions down to efficient ISA core:
%out0.data = %in0.first;
%in0.deq;
p_did_cmp = false;
doPass when (p_did_cmp && !p_cur_is_larger)
Trigger
read any number of
1-bit predicates
Operation
read/write data regs
channel control
Predicate Op
write 1-bit preds
(data-dependent)
When can this
happen?
What does it do?
What can happen
next?
No program counter or branch instructions
March 8, 2022
L10-64
Sze and Emer
Triggered Instruction Scheduler
• Use combinational logic to evaluate triggers in parallel
• Decide winners if more than one instruction is ready
• Based on architectural fairness policy
• Could pick multiple non-conflicting instructions to issue (superscalar)
• Note: no wires toggle unless status changes
Trigger 0
Trigger 1
Trigger 2
“can trigger” “will trigger”
to datapath
p0 p1 p2 p3
Trigger n
...
Operation 0
Operation 1
Operation 2
Operation n
...
Trigger
Resolution
Priority
March 8, 2022
L10-65
Sze and Emer
Next Lecture: Dataflows
Thank you!
March 8, 2022

More Related Content

Similar to L10.pdf

Hardware Complexity of Microprocessor Design According to Moore's Law
Hardware Complexity of Microprocessor Design According to Moore's LawHardware Complexity of Microprocessor Design According to Moore's Law
Hardware Complexity of Microprocessor Design According to Moore's Lawcsandit
 
What is Multiplexer? Two Channels Multiplexer
What is Multiplexer? Two Channels Multiplexer What is Multiplexer? Two Channels Multiplexer
What is Multiplexer? Two Channels Multiplexer Atta Ullah Khan
 
Design And Analysis Of 64-Bit Adders In Cadence Using Different Logic Families
Design And Analysis Of 64-Bit Adders In Cadence Using Different Logic FamiliesDesign And Analysis Of 64-Bit Adders In Cadence Using Different Logic Families
Design And Analysis Of 64-Bit Adders In Cadence Using Different Logic FamiliesIRJET Journal
 
Three dimensional integration of cmos inverter
Three dimensional integration of cmos inverterThree dimensional integration of cmos inverter
Three dimensional integration of cmos inverterIAEME Publication
 
ASML Investor Day 2021-Technology Strategy - Martin van den Brink.pdf
ASML Investor Day 2021-Technology Strategy - Martin van den Brink.pdfASML Investor Day 2021-Technology Strategy - Martin van den Brink.pdf
ASML Investor Day 2021-Technology Strategy - Martin van den Brink.pdfJoeSlow
 
Design the High Speed Kogge-Stone Adder by Using
Design the High Speed Kogge-Stone Adder by UsingDesign the High Speed Kogge-Stone Adder by Using
Design the High Speed Kogge-Stone Adder by UsingIJERA Editor
 
09.50 Ernst Vrolijks
09.50 Ernst Vrolijks09.50 Ernst Vrolijks
09.50 Ernst VrolijksThemadagen
 
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...Ryohei Kobayashi
 
TRACK D: Advanced design regardless of process technology/ Marco Casale-Rossi
TRACK D: Advanced design regardless of process technology/ Marco Casale-RossiTRACK D: Advanced design regardless of process technology/ Marco Casale-Rossi
TRACK D: Advanced design regardless of process technology/ Marco Casale-Rossichiportal
 
1 Unit-1 DEC B.Tech ECE III Sem Syllabus & Intro.pptx
1 Unit-1 DEC B.Tech ECE III Sem Syllabus & Intro.pptx1 Unit-1 DEC B.Tech ECE III Sem Syllabus & Intro.pptx
1 Unit-1 DEC B.Tech ECE III Sem Syllabus & Intro.pptxSatish Chandra
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
The Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANThe Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANLarry Smarr
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...RISC-V International
 
unit 1vlsi notes.pdf
unit 1vlsi notes.pdfunit 1vlsi notes.pdf
unit 1vlsi notes.pdfAcademicICECE
 

Similar to L10.pdf (20)

Hardware Complexity of Microprocessor Design According to Moore's Law
Hardware Complexity of Microprocessor Design According to Moore's LawHardware Complexity of Microprocessor Design According to Moore's Law
Hardware Complexity of Microprocessor Design According to Moore's Law
 
L02.pdf
L02.pdfL02.pdf
L02.pdf
 
What is Multiplexer? Two Channels Multiplexer
What is Multiplexer? Two Channels Multiplexer What is Multiplexer? Two Channels Multiplexer
What is Multiplexer? Two Channels Multiplexer
 
Design And Analysis Of 64-Bit Adders In Cadence Using Different Logic Families
Design And Analysis Of 64-Bit Adders In Cadence Using Different Logic FamiliesDesign And Analysis Of 64-Bit Adders In Cadence Using Different Logic Families
Design And Analysis Of 64-Bit Adders In Cadence Using Different Logic Families
 
Three dimensional integration of cmos inverter
Three dimensional integration of cmos inverterThree dimensional integration of cmos inverter
Three dimensional integration of cmos inverter
 
ASML Investor Day 2021-Technology Strategy - Martin van den Brink.pdf
ASML Investor Day 2021-Technology Strategy - Martin van den Brink.pdfASML Investor Day 2021-Technology Strategy - Martin van den Brink.pdf
ASML Investor Day 2021-Technology Strategy - Martin van den Brink.pdf
 
Design the High Speed Kogge-Stone Adder by Using
Design the High Speed Kogge-Stone Adder by UsingDesign the High Speed Kogge-Stone Adder by Using
Design the High Speed Kogge-Stone Adder by Using
 
L06.pdf
L06.pdfL06.pdf
L06.pdf
 
L06-handout.pdf
L06-handout.pdfL06-handout.pdf
L06-handout.pdf
 
09.50 Ernst Vrolijks
09.50 Ernst Vrolijks09.50 Ernst Vrolijks
09.50 Ernst Vrolijks
 
L09-handout.pdf
L09-handout.pdfL09-handout.pdf
L09-handout.pdf
 
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
 
TRACK D: Advanced design regardless of process technology/ Marco Casale-Rossi
TRACK D: Advanced design regardless of process technology/ Marco Casale-RossiTRACK D: Advanced design regardless of process technology/ Marco Casale-Rossi
TRACK D: Advanced design regardless of process technology/ Marco Casale-Rossi
 
Pasta documentation
Pasta  documentationPasta  documentation
Pasta documentation
 
1 Unit-1 DEC B.Tech ECE III Sem Syllabus & Intro.pptx
1 Unit-1 DEC B.Tech ECE III Sem Syllabus & Intro.pptx1 Unit-1 DEC B.Tech ECE III Sem Syllabus & Intro.pptx
1 Unit-1 DEC B.Tech ECE III Sem Syllabus & Intro.pptx
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
The Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LANThe Optiputer - Toward a Terabit LAN
The Optiputer - Toward a Terabit LAN
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
unit 1vlsi notes.pdf
unit 1vlsi notes.pdfunit 1vlsi notes.pdf
unit 1vlsi notes.pdf
 
HVTpaperDI2003v5
HVTpaperDI2003v5HVTpaperDI2003v5
HVTpaperDI2003v5
 

More from TRNHONGLINHBCHCM (13)

L07.pdf
L07.pdfL07.pdf
L07.pdf
 
L03.pdf
L03.pdfL03.pdf
L03.pdf
 
L05.pdf
L05.pdfL05.pdf
L05.pdf
 
L11.pdf
L11.pdfL11.pdf
L11.pdf
 
L04.pdf
L04.pdfL04.pdf
L04.pdf
 
L12.pdf
L12.pdfL12.pdf
L12.pdf
 
L01.pdf
L01.pdfL01.pdf
L01.pdf
 
L14.pdf
L14.pdfL14.pdf
L14.pdf
 
Projects.pdf
Projects.pdfProjects.pdf
Projects.pdf
 
L10.pdf
L10.pdfL10.pdf
L10.pdf
 
PaperReview.pdf
PaperReview.pdfPaperReview.pdf
PaperReview.pdf
 
L13.pdf
L13.pdfL13.pdf
L13.pdf
 
L07-handout.pdf
L07-handout.pdfL07-handout.pdf
L07-handout.pdf
 

Recently uploaded

Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfsmsksolar
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 

Recently uploaded (20)

Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 

L10.pdf

  • 1. L10-1 6.5930/1 Hardware Architectures for Deep Learning GPU Computation (continued) Joel Emer and Vivienne Sze Acknowledgement: Srini Devadas (MIT) Massachusetts Institute of Technology Electrical Engineering & Computer Science March 8, 2023
  • 2. L10-2 Sze and Emer Fully-Connected (FC) Layer M CHW CHW 1 Filters Input fmaps × 1 Output fmaps M = • Matrix–Vector Multiply: • Multiply all inputs in all channels by a weight and sum March 8, 2022
  • 3. L10-3 Sze and Emer Fully Connected Computation F[M0 C0 H0 W0] F[M0 C0 H0 W1] … F[M0 C0 H1 W0] F[M0 C0 H1 W1] … F[M0 C0 H2 W0] F[M0 C0 H2 W1] … . . F[M0 C1 H0 W0] F[M0 C1 H0 W1] … F[M0 C1 H1 W0] F[M0 C1 H1 W1] … F[M0 C1 H2 W0] F[M0 C1 H2 W1] … . . . F[M1 C0 H0 W0] F[M1 C0 H0 W1] … F[M1 C0 H1 W0] F[M1 C0 H1 W1] … F[M1 C0 H2 W0] F[M1 C0 H2 W1] … . . . I[C0 H0 W0] I[C0 H0 W1] … I[C0 H1 W0] I[C0 H1 W1] … I[C0 H2 W0] I[C0 H2 W1] … . . I[C1 H0 W0] I[C1 H0 W1] … I[C1 H1 W0] I[C1 H1 W1] … I[C1 H2 W0] I[C1 H2 W1] … . . . March 8, 2022 int i[C*H*W]; # Input activations int f[M*C*H*W]; # Filter Weights int o[M]; # Output activations for (m = 0; m < M; m++) { o[m] = 0; CHWm = C*H*W*m; for (chw = 0; chw < C*H*W; chw++) { o[m] += i[chw] * f[CHWm + chw]; } }
  • 4. L10-4 Sze and Emer FC - CUDA Main Program int i[C*H*W], *gpu_i; # Input activations int f[M*C*H*W], *gpu_f; # Filter Weights int o[M], *gpu_o; # Output activations # Allocate space on GPU cudaMalloc((void**) &gpu_i, sizeof(int)*C*H*W); cudaMalloc((void**) &gpu_f, sizeof(int)*M*C*H*W); # Copy data to GPU cudaMemcpy(gpu_i, i, sizeof(int)*C*H*W); cudaMemcpy(gpu_f, f, sizeof(int)*M*C*H*W); # Run kernel fc<<<M/256 + 1, 256>>>(gpu_i, gpu_f, gpu_o, C*H*W, M); # Copy result back and free device memory cudaMemcpy(o, gpu_o, sizeof(int)*M, cudaMemCpyDevicetoHost); cudaFree(gpu_i); ... Thread block size Num thread blocks March 8, 2022
  • 5. L10-5 Sze and Emer FC – CUDA Terminology • CUDA code launches 256 threads per block • CUDA vs vector terminology: – Thread = 1 iteration of scalar loop [1 element in vector loop] – Block = Body of vectorized loop [VL=256 in this example] • Warp size = 32 [Number of vector lanes] – Grid = Vectorizable loop [ vector terminology ] March 8, 2022
  • 6. L10-6 Sze and Emer FC - CUDA Kernel __global__ void fc(int *i, int *f, int *o, const int CHW, const int M){ int tid=threadIdx.x + blockIdx.x * blockDim.x; int m = tid int sum = 0; if (m < M){ for (int chw=0; chw <CHW; chw ++) { sum += i[chw]*f[(m*CHW)+chw]; } o[m] = sum; } } Any consequences of f[(m*CHW)+chw]? Yes, strided references Any consequences of f[(chw*M)+m]? Yes, different data layout tid is “output channel” number March 8, 2022 Calculate “global” thread id (tid) Code for one thread
  • 7. L10-7 Sze and Emer Convolution (CONV) Layer … M … Many Input fmaps (N) Many Output fmaps (N) … R S R S C C filters E F H C H W C E 1 1 N N W F Batch Size (N) March 8, 2022
  • 8. L10-8 Sze and Emer Convolution (CONV) Layer = 1 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 1 2 4 5 2 3 5 6 4 5 7 8 5 6 8 9 1 2 3 4 × 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 3 1 2 4 Chnl 1 Chnl 2 Filter 1 Filter 2 Chnl 1 Chnl 2 Chnl 1 Chnl 2 Input Fmap as Toeplitz matrix Output Fmap Filter GPU Implementation: • Keep original input activation matrix in main memory • Conceptually do tiled matrix-matrix multiplication • Copy input activations into scratchpad to small Toeplitz matrix tile • On Volta tile again to use ‘tensor core’ March 8, 2022
  • 9. L10-9 Sze and Emer GV100 – “Tensor Core” Tensor Core…. • 120 TFLOPS (FP16) • 400 GFLOPS/W (FP16) New opcodes – Matrix Multiply Accumulate (HMMA) How many FP16 operands? Inputs 48 / Outputs 16 How many multiplies? 64 How many adds? 64 March 8, 2022
  • 10. L10-10 Sze and Emer +1 2 2 Tensor Core Integration PC I$ IR GPR X Y + * GPR X Y + * M e m o r y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 In addition to normal function units… March 8, 2022
  • 11. L10-11 Sze and Emer +1 2 2 Tensor Core Integration PC I$ IR GPR X Y Tensor Matrix Multiply Accumulate GPR X Y PC 1 PC 1 PC 1 PC 1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 GPR1 Cross thread operands March 8, 2022
  • 12. L10-12 Sze and Emer Fully-Connected (FC) Layer M CHW CHW N Filters Input fmaps × N Output fmaps M = • After flattening, having a batch size of N turns the matrix-vector operation into a matrix-matrix multiply March 8, 2022
  • 13. L10-13 Sze and Emer Tiled Fully-Connected (FC) Layer M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1,1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in tensor core operation and computation ordered to maximize reuse of data in scratchpad and then in cache. March 8, 2022
  • 14. L10-14 Sze and Emer Tiled Fully-Connected (FC) Layer M CHW CHW N Filters Input fmaps × N Output fmaps M = F0,0 F0,1 F1,0 F1,1 I0,0 I0,1 I1,0 I1,1 F0,0I0,0 + F0,1I1,0 F1,0I0,0 + F1,1I1,0 F0,0I0,1 + F0,1I1,1 F1,0I0,1 + F1,1I1,1 Matrix multiply tiled to fit in tensor core operation and computation ordered to maximize reuse of data in scratchpad and then in cache. March 8, 2022
  • 15. L10-15 Sze and Emer Vector vs GPU Terminology March 8, 2022 [H&P5, Fig 4.25]
  • 16. L10-16 Sze and Emer Summary • GPUs are an intermediate point in the continuum of flexibility and efficiency • GPU architecture focuses on throughput (over latency) – Massive thread-level parallelism, with • Single instruction for many threads – Memory hierarchy specialized for throughput • Shared scratchpads with private address space • Caches used primarily to reduce bandwidth not latency – Specialized compute units, with • Many computes per input operand • Little’s Law useful for system analysis – Shows relationship between throughput, latency and tasks in-flight March 8, 2022
  • 17. L10-17 6.5930/1 Hardware Architectures for Deep Learning Accelerator Architecture Joel Emer and Vivienne Sze Massachusetts Institute of Technology Electrical Engineering & Computer Science March 8, 2023
  • 18. L10-18 Sze and Emer Moore’s (Original) Law The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. - Gordon Moore Moore, “Cramming more components onto integrated circuits”, Electronics 1965.] March 8, 2022
  • 19. L10-19 Sze and Emer Moore’s (Transistor) Law 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10 1970 1975 1980 1985 1990 1995 2000 2005 2010 3 4 5 6 7 8 9 1O log 1O (transistors) Intel processor chips Year [Moore, Progress in digital integrated electronics, IEDM 1975] Number of transistors has been doubling March 8, 2022
  • 20. L10-20 Sze and Emer Moore’s (Performance) Law [Leiserson et al., There's Plenty of Room at the Top, Science] March 8, 2022
  • 21. L10-21 Sze and Emer Energy and Power Consumption • Energy Consumption (Joules) = α x C x V2 • Power Consumption (Joules/sec or Watts) = α x C x V2 x f Switching activity factor (between 0 to 1) Capacitance Voltage Frequency March 8, 2022
  • 22. L10-22 Sze and Emer Capacitance and Energy Storage Energy Consumption = α x C x V2 March 8, 2022
  • 23. L10-23 Sze and Emer Analogy: Water = Charge SOURCE DRAIN GATE Full Bucket = ‘1’ Size of bucket proportional to load capacitance What affects the rate of flow of the water (i.e. current)? March 8, 2022
  • 24. L10-24 Sze and Emer Analogy: Water = Charge Transistor has multiple states Transistor ON High Flow, Low Resistance Transistor OFF Low Flow, High Resistance Flow Control SOURCE DRAIN GATE VGS VDS Increasing V GS  March 8, 2022
  • 25. L10-25 Sze and Emer Dennard Scaling (idealized) Gen X Gen X+1 1.0 Gate Width 1.0 Device Area/ Capacitance 1.0 Voltage 1.0 Energy ~ 1.0 x 1.02 = 1.0 Delay 1.0 Frequency 1/1.0 = 1.0 Power ~ 1.0 x 1.02 x 1.0 = 1.0 0.7 0.7 ~ 2 x 0.7 x 0.72 = 0.65 0.7 1/0.7 = 1.4 ~ 2 x 0.7 x 0.72 x 1.4 = 1.0 0.5 0.7 0.5 0.7 [ Dennard et al., "Design of ion-implanted MOSFET's with very small physical dimensions“, JSSC 1974 ] March 8, 2022
  • 26. L10-26 Sze and Emer The End of Historic Scaling [Leiserson et al., There's Plenty of Room at the Top, Science] Voltage scaling slowed down  Power density increasing! March 8, 2022
  • 27. L10-27 Sze and Emer During the Moore + Dennard’s Law Era • Instruction-level parallelism (ILP) was largely mined out by early 2000s • Voltage (Dennard) scaling ended in 2005 • Hit the power limit wall in 2005 • Performance is coming from parallelism using more transistors since ~2007 • But…. March 8, 2022
  • 28. L10-28 Sze and Emer Moore’s Law in DRAMs Image source: John Hennessy After multi-core, specialization (i.e., accelerators) seems to be the most attractive architectural option to cope with the end of Moore’s Law March 8, 2022
  • 29. L10-29 Sze and Emer The High Cost of Data Movement Fetching operands more expensive than computing on them Image source: Bill Dally Now the key is how we use our transistors most effectively. March 8, 2022
  • 30. L10-30 Sze and Emer Accelerator Design Attributes • Integration into system – How is the accelerator integrated into the system? • Operation specification/sequencing – How is activity managed within the accelerator? • Data management – How is data movement orchestrated in the accelerator? Remember, however, the world is ‘fractal’! March 8, 2022 Image: Wikipedia
  • 31. L10-31 Sze and Emer System Integration March 8, 2022
  • 32. L10-32 Sze and Emer Accelerator Architectural Choices • State – Local state – Is there local state? (i.e. context) – Global state – e.g., main memory shared with CPU • Data Operations – Custom data operations in the accelerator • Memory Access Operations – Operations to access shared global memory – Do they exist? • Control Flow Operations – How is accelerator sequenced relative to CPU? March 8, 2022
  • 33. L10-33 Sze and Emer How to Sequence Accelerator Logic? • Synchronous – Accelerated operations inside processor pipeline • E.g., as a separate function unit – Control handled by standard control instructions • Asynchronous – A standalone logical machine • Accelerator started by processor that continues running What factors mitigate for one form or the other? Existence of concurrent activity Latency of operation Size of logic Size of operands March 8, 2022
  • 34. L10-34 Sze and Emer Eight Alternatives Architectural semantics Asynchronous Accesses memory Has context 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 Characteristics of “no memory access” choice? Good for smaller operands Simpler, e.g., no virtual memory No ‘Little’s Law’ storage requirement March 8, 2022
  • 35. L10-35 Sze and Emer Eight Alternatives Architectural semantics Asynchronous Accesses memory Has context 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 Characteristics of “no context” choice? Simpler, no context switch mechanism Long operations run to completion More limited reuse opportunities March 8, 2022
  • 36. L10-36 Sze and Emer Accelerator Architectural Alternatives Architectural semantics Example Asynchr onous Accesses memory Has context 0 0 0 New function unit, like tensor core in GPU 0 0 1 Accumulating data reduction instruction 0 1 0 Memory-to-memory vector unit 0 1 1 Register-based vector unit including load store ops 1 0 0 Complex function calculator? 1 0 1 Security co-processor (TPM) 1 1 0 Network adapter 1 1 1 GPU with virtual memory March 8, 2022
  • 37. L10-37 Sze and Emer Architectural semantics Example Asynchr onous Accesses memory Has context 0 0 0 New function unit, like tensor core in GPU 0 0 1 Accumulating data reduction instruction 0 1 0 Memory-to-memory vector unit 0 1 1 Register-based vector unit including load store ops 1 0 0 Complex function calculator? 1 0 1 Security co-processor (TPM) 1 1 0 Network adapter 1 1 1 GPU with virtual memory Accelerator Architectural Alternatives March 8, 2022
  • 38. L10-38 Sze and Emer Operation Sequencing March 8, 2022
  • 39. L10-39 Sze and Emer Accelerator Taxonomy Accelerator Architecture Temporally Programmed CPU GPU March 8, 2022
  • 40. L10-40 Sze and Emer Multiprocessor L2 L2 L2 L2 L3 L3 Memory (DRAM) Inter-processing element communication is through cache hierarchy March 8, 2022
  • 41. L10-41 Sze and Emer Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT) Spatial Architecture (Dataflow Processing) Register File Memory Hierarchy ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Control Memory Hierarchy ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU March 8, 2022
  • 42. L10-42 Sze and Emer Spatial Architecture for DNN Processing Element (PE) Global Buffer (100 – 500 kB) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU DRAM Local Memory Hierarchy • Global Buffer • Direct inter-PE network • PE-local memory (RF) Control Reg File 0.5 – 1.0 kB March 8, 2022
  • 43. L10-43 Sze and Emer Parallel Control: Temporal vs Spatial • Style dictates many later hardware choices • Temporal: Dynamic instruction stream, shared memory • Spatial: Tiny processing elements (PEs), direct communication Iteration: Program: Temporal Spatial March 8, 2022
  • 44. L10-44 Sze and Emer Memory Access: Spatial vs Temporal 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1x12x23x34x45x56x67x78x8 Num Loads and Stores Num ALUs Edit Distance Temporal Spatial Algorithmic Min 0 1,000,000 2,000,000 3,000,000 4,000,000 1x12x23x34x45x56x67x78x8 Num Loads and Stores Num ALUs Gaussian Blur Temporal Spatial Algorithmic Min 0 500,000 1,000,000 1,500,000 1x12x23x34x45x56x67x78x8 Num Loads and Stores Num ALUs Desaturate Temporal Spatial Algorithmic Min 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 1x12x23x34x45x56x67x78x8 Num Loads and Stores Num ALUs Malvar Demosaic Temporal Spatial Algorithmic Min Large benefits with even small array sizes March 8, 2022
  • 45. L10-45 Sze and Emer Accelerator Taxonomy Accelerator Architecture Temporally Programmed Spatially Programmed CPU GPU FPGA Triggered Instructions RAW AsAP PicoChip TRIPS WaveScalar DySER TTA March 8, 2022
  • 46. L10-46 Sze and Emer Accelerator Taxonomy Accelerator Architecture Temporally Programmed Spatially Programmed CPU GPU Fine (logic) Grained FPGA March 8, 2022
  • 47. L10-47 Sze and Emer Field Programmable Gate Arrays LUT Latch RAM ..... And 00 0 01 0 10 0 11 1 Or 00 0 01 0 10 1 11 1 March 8, 2022 Look Up Table (LUT)
  • 48. L10-48 Sze and Emer Microsoft Project Catapult Configurable Cloud (MICRO 2016) for Azure Accelerate and reduce latency for • Bing search • Software defined network • Encryption and Decryption March 8, 2022
  • 49. L10-49 Sze and Emer Microsoft Brainwave Neural Processor March 8, 2022 Source: Microsoft
  • 50. L10-50 Sze and Emer Heterogeneous Blocks • Add specific purpose logic on FPGA – Efficient if used (better area, speed, power), wasted if not • Soft fabric – LUT, flops, addition, subtraction, carry logic – Convert LUT to memories or shift registers • Memory block (BRAM) – Configure word and address size (aspect ratio) – Combine memory blocks to large blocks – Significant part for FPGA area – Dual port memories (FIFO) • Multipliers /MACs  DSP • CPUs and processing elements March 8, 2022
  • 51. L10-51 Sze and Emer Accelerator Taxonomy Accelerator Architecture Temporally Programmed Spatially Programmed CPU GPU Fine (logic) Grained Coarse (ALU) Grained FPGA Triggered Instructions RAW AsAP PicoChip TRIPS WaveScalar DySER TTA March 8, 2022
  • 52. L10-52 Sze and Emer Programmable Accelerators Many Programmable Accelerators look like an array of PEs, but have dramatically different architectures, programming models and capabilities PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Processing Element ... ... ... ... March 8, 2022
  • 53. L10-53 Sze and Emer Accelerator Taxonomy Accelerator Architecture Temporally Programmed Spatially Programmed CPU GPU Fine (logic) Grained Coarse (ALU) Grained FPGA Fixed-operation TPU NVDLA March 8, 2022
  • 54. L10-54 Sze and Emer Fixed Operation - Systolic Array • Each PE hard-wired to one operation • Purely pipelined operation – no backpressure in pipeline • Attributes – High-concurrency – Regular design, but – Regular parallelism only! March 8, 2022
  • 55. L10-55 Sze and Emer Configurable Systolic Array - WARP March 8, 2022 Source: WARP Architecture and Implementation, ISCA 1986
  • 56. L10-56 Sze and Emer Fixed Operation - Google TPU March 8, 2022 Systolic array does 8-bit 256x256 matrix-multiply accumulate Source: Google
  • 57. L10-57 Sze and Emer Accelerator Taxonomy Accelerator Architecture Temporally Programmed Spatially Programmed CPU GPU Fine (logic) Grained Coarse (ALU) Grained FPGA Fixed-operation WARP DySER TRIPS WaveScalar TTA Configured-operation TPU NVDLA March 8, 2022
  • 58. L10-58 Sze and Emer Single Configured Operation - Dyser March 8, 2022 Source: Dynamically Specialized Datapaths for Energy Efficient Computing. HPCA11
  • 59. L10-59 Sze and Emer Accelerator Taxonomy Accelerator Architecture Temporally Programmed Spatially Programmed CPU GPU Fine (logic) Grained Coarse (ALU) Grained FPGA Wave RAW AsAP PicoChip Fixed-operation PC-based WARP DySER TRIPS WaveScalar TTA Configured-operation TPU NVDLA March 8, 2022
  • 60. L10-60 Sze and Emer PC-based Control – Wave Computing March 8, 2022 Source: Wave Computing, Hot Chips ‘17
  • 61. L10-61 Sze and Emer Accelerator Taxonomy Accelerator Architecture Temporally Programmed Spatially Programmed CPU GPU Fine (logic) Grained Coarse (ALU) Grained FPGA Triggered Instructions Wave RAW AsAP PicoChip Fixed-operation PC-based WARP DySER TRIPS WaveScalar TTA Configured-operation Triggered operations TPU NVDLA March 8, 2022
  • 62. L10-62 Sze and Emer Guarded Actions • Program consists of rules that may perform computations and read/write state • Each rule specifies conditions (guard) under which it is allowed to fire • Separates description and execution of data (rule body) from control (guards) • A scheduler is generated (or provided by hardware) that evaluates the guards and schedules rule execution • Sources of Parallelism – Intra-Rule parallelism – Inter-Rule parallelism – Scheduler overlap with Rule execution – Parallel access to state rule X (A > 0 && B != C) { A <= B + 1; B <= B - 1; C <= B * A; } Scheduler reg A; reg B; reg C; rule Y (…) {…} rule Z (…) {…} March 8, 2022
  • 63. L10-63 Sze and Emer Triggered Instructions (TI) • Restrict guarded actions down to efficient ISA core: %out0.data = %in0.first; %in0.deq; p_did_cmp = false; doPass when (p_did_cmp && !p_cur_is_larger) Trigger read any number of 1-bit predicates Operation read/write data regs channel control Predicate Op write 1-bit preds (data-dependent) When can this happen? What does it do? What can happen next? No program counter or branch instructions March 8, 2022
  • 64. L10-64 Sze and Emer Triggered Instruction Scheduler • Use combinational logic to evaluate triggers in parallel • Decide winners if more than one instruction is ready • Based on architectural fairness policy • Could pick multiple non-conflicting instructions to issue (superscalar) • Note: no wires toggle unless status changes Trigger 0 Trigger 1 Trigger 2 “can trigger” “will trigger” to datapath p0 p1 p2 p3 Trigger n ... Operation 0 Operation 1 Operation 2 Operation n ... Trigger Resolution Priority March 8, 2022
  • 65. L10-65 Sze and Emer Next Lecture: Dataflows Thank you! March 8, 2022