Large Scale Data center Solution Guide: eBGP based designDhiman Chowdhury
Network Automation provides IT administrators and network operators significant benefits. This solution guide
describes an approach to build data centers using Layer3 BGP routing protocol.
It also summarizes on some design philosophies for data center and why E-BGP is better suited.
■ Large-scale data center requirements
■ Large-scale data center topologies
■ Large-scale data center routing
■ EBGP-routed large-scale Clos topology-based data cente
A Journey to Boot Linux on Raspberry PiJian-Hong Pan
Each processor/chip architecture has its own procedure to boot the kernel. It works with desgined partition layout and vendor specific firmwares/bootloaders in the boot partition. We can learn the related knowledge from the Raspbian image for Raspberry Pi, which is the board we can obtain easily. However, the diversity between the special booting procedures with specific firmwares/bootloaders increases the complexity for distribution maintainers. It will be great if there is a way to make it more generic that can be applied to most of the chip architectures/boards to boot up the system.
After referring to some Linux distributions, we learned U-Boot may play a role in the solution. It splits the booting procedure into hardware specific and generic system parts. This helps distribution maintainers deploy the generic system with OSTree, including device trees.
Let’s deep dive into this magic booting procedure!
One of the great challenges of of monitoring any large cluster is how much data to collect and how often to collect it. Those responsible for managing the cloud infrastructure want to see everything collected centrally which places limits on how much and how often. Developers on the other hand want to see as much detail as they can at as high a frequency as reasonable without impacting the overall cloud performance.
To address what seems to be conflicting requirements, we've chosen a hybrid model at HP. Like many others, we have a centralized monitoring system that records a set of key system metrics for all servers at the granularity of 1 minute, but at the same time we do fine-grained local monitoring on each server of hundreds of metrics every second so when there are problems that need more details than are available centrally, one can go to the servers in question to see exactly what was going on at any specific time.
The tool of choice for this fine-grained monitoring is the open source tool collectl, which additionally has an extensible api. It is through this api that we've developed a swift monitoring capability to not only capture the number of gets, put, etc every second, but using collectl's colmux utility, we can also display these in a top-like formact to see exactly what all the object and/or proxy servers are doing in real-time.
We've also developer a second cability that allows one to see what the Virtual Machines are doing on each compute node in terms of CPU, disk and network traffic. This data can also be displayed in real-time with colmux.
This talk will briefly introduce the audience to collectl's capabilities but more importantly show how it's used to augment any existing centralized monitoring infrastructure.
Speakers
Mark Seger
Large Scale Data center Solution Guide: eBGP based designDhiman Chowdhury
Network Automation provides IT administrators and network operators significant benefits. This solution guide
describes an approach to build data centers using Layer3 BGP routing protocol.
It also summarizes on some design philosophies for data center and why E-BGP is better suited.
■ Large-scale data center requirements
■ Large-scale data center topologies
■ Large-scale data center routing
■ EBGP-routed large-scale Clos topology-based data cente
A Journey to Boot Linux on Raspberry PiJian-Hong Pan
Each processor/chip architecture has its own procedure to boot the kernel. It works with desgined partition layout and vendor specific firmwares/bootloaders in the boot partition. We can learn the related knowledge from the Raspbian image for Raspberry Pi, which is the board we can obtain easily. However, the diversity between the special booting procedures with specific firmwares/bootloaders increases the complexity for distribution maintainers. It will be great if there is a way to make it more generic that can be applied to most of the chip architectures/boards to boot up the system.
After referring to some Linux distributions, we learned U-Boot may play a role in the solution. It splits the booting procedure into hardware specific and generic system parts. This helps distribution maintainers deploy the generic system with OSTree, including device trees.
Let’s deep dive into this magic booting procedure!
One of the great challenges of of monitoring any large cluster is how much data to collect and how often to collect it. Those responsible for managing the cloud infrastructure want to see everything collected centrally which places limits on how much and how often. Developers on the other hand want to see as much detail as they can at as high a frequency as reasonable without impacting the overall cloud performance.
To address what seems to be conflicting requirements, we've chosen a hybrid model at HP. Like many others, we have a centralized monitoring system that records a set of key system metrics for all servers at the granularity of 1 minute, but at the same time we do fine-grained local monitoring on each server of hundreds of metrics every second so when there are problems that need more details than are available centrally, one can go to the servers in question to see exactly what was going on at any specific time.
The tool of choice for this fine-grained monitoring is the open source tool collectl, which additionally has an extensible api. It is through this api that we've developed a swift monitoring capability to not only capture the number of gets, put, etc every second, but using collectl's colmux utility, we can also display these in a top-like formact to see exactly what all the object and/or proxy servers are doing in real-time.
We've also developer a second cability that allows one to see what the Virtual Machines are doing on each compute node in terms of CPU, disk and network traffic. This data can also be displayed in real-time with colmux.
This talk will briefly introduce the audience to collectl's capabilities but more importantly show how it's used to augment any existing centralized monitoring infrastructure.
Speakers
Mark Seger
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
L09-handout.pdf
1. L09-1
6.5930/1
Hardware Architectures for Deep Learning
GPU Computation
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
March 6, 2023
2. L09-2
Sze and Emer
Why are GPU interesting?
March 6, 2023
• Very successful commodity accelerator/co-processor
• GPUs combine two strategies to increase efficiency
– Massive parallelism
– Specialization
• Illustrates tension between performance and
programmability in accelerators
• Pervasively applied in deep learning applications
3. L09-3
Sze and Emer
Background Reading
• GPUs
– Computer Architecture: A Quantitative Approach,
by Hennessy and Patterson
• Edition 6: Ch 4: p310-336
• Edition 5: Ch 4: p288-315
All these books and their online/e-book versions are available through
MIT libraries.
March 6, 2023
4. L09-4
Sze and Emer
Programmability vs Efficiency
March 6, 2023
More Efficient
More Programmable
CPU GPU FPGA CGRA ASIC
FPGA => Field programmable gate array
CGRA => Coarse-grained reconfigurable array
ASIC => Application-specific integrated circuit
6. L09-6
Sze and Emer
CPU vs. GPU Performance
March 6, 2023
Source: Stanford CS231n
Ratio of (partially-optimized) CPU vs. CUDA library (cuDNN)
7. L09-7
Sze and Emer
CPU vs. GPU Performance
March 6, 2023
Source: Stanford CS231n
Ratio of unoptimized CUDA vs. CUDA library (cuDNN)
8. L09-8
Sze and Emer
Single Instruction Multiple Thread
March 6, 2023
SIMT
– Many threads each with
private architectural state
or context, e.g., registers.
– Group of threads that
issue together called a
warp.
– All threads that issue
together execute same
instruction.
– Entire pipeline is an SM or
streaming multiprocessor
Mem
GPR
X
Y
+
*
PC I$ IR GPR
X
Y
+
*
Mem
M
e
m
o
r
y
green-> Nvidia terminology
Lane
9. L09-9
Sze and Emer
+1
2 2
Multiple Thread – Single Instruction Multiple Thread
March 6, 2023
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
10. L09-10
Sze and Emer
Function unit optimization
March 6, 2023
+
*
PC I$ IR GPR +
*
M
e
m
o
r
y
GPR
+
*
Specialize
Function Units
11. L09-11
Sze and Emer
Function unit optimization
March 6, 2023
*
PC I$ IR GPR +
Restriction: Can’t issue same operation twice in a row
M
e
m
o
r
y
GPR
IR
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
12. L09-12
Sze and Emer
Function unit optimization
March 6, 2023
0 1 2 3 4 5
Fetch Add0,0-1 Mul1,0-1 Add2,0-1 Mul3,0-1 …
GPR0 Add0,0 Mul1,0 Add2,0 Mul3,0 …
GPR1 Add0,1 Mul1,1 Add2,1 …
Adder Add0,0 Add0,1 Add2,0 …
Mul Mul1,0 Mul1,1 …
Key: Opcodeinum,thread(s)
U
n
i
t
Cycle
*
PC I$ IR GPR
+
M
e
m
o
r
y
GPR
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
IR
13. L09-13
Sze and Emer
Streaming Multiprocessor Overview
March 6, 2023
• Each SM supports 10s of
warps (e.g., 64 in Kepler) with
32 threads/warp
• Fetch 1 instr/cycle
• Issue 1 ready instr/cycle
– Simple scoreboarding: all warp
elements must be ready
• Instruction broadcast to all
lanes
• Multithreading is the main
latency-hiding mechanism
14. L09-14
Sze and Emer
Little’s Law
March 6, 2023
Throughput (T) = Number in Flight (N) / Latency (L)
Issue Execution
Example:
64 warps (number of instructions in flight)
1 instruction / cycle (desired throughput)
15. L09-15
Sze and Emer
Number of Active Warps
March 6, 2023
• SMs support a variable number of active warps
based on required registers (and shared memory).
Fewer warps allows more registers per warp, i.e., a
larger context.
– Few large contexts Fewer register spills
– Many small contexts More latency tolerance
– Choice left to the compiler
• Example: Kepler supports up to 64 warps
– Max: 64 warps @ <=32 registers/thread
– Min: 8 warps @ 256 registers/thread
16. L09-16
Sze and Emer
GPU ISA and Compilation
March 6, 2023
• GPU micro-architecture and instruction set change
very frequently
• To achieve compatibility:
– Compiler produces intermediate pseudo-assembler
language (e.g., Nvidia PTX)
– GPU driver JITs kernel, tailoring it to specific micro-
architecture
• In practice, little performance portability
– Code is often tuned to specific GPU architecture
17. L09-17
Sze and Emer
+1
2 2
Multiple Thread – Single Instruction Multiple Thread
March 6, 2023
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
18. L09-18
Sze and Emer
Many Memory Types
March 6, 2023
Mem
Thread 0
Thread 1
Thread 2
Per Thread Memory
Scratchpad Shared Memory
Global Memory
………
19. L09-19
Sze and Emer
Private Per Thread Memory
March 6, 2023
• Private memory
– No cross-thread sharing
– Small, fixed size memory
• Can be used for constants
– Multi-bank implementation (can be in global memory)
….
Thread 0 Thread 0 Memory
Thread 1 Thread 1 Memory
Thread 2 Thread 2 Memory
20. L09-20
Sze and Emer
Shared Scratchpad Memory
March 6, 2023
• Shared scratchpad memory (threads share data)
– Small, fixed size memory (16K-64K / ‘core’)
– Banked for high bandwidth
– Fed with address coalescing unit (ACU) + crossbar
• ACU can buffer/coalesce requests
….
Thread 0 Shared Memory Bank
Thread 1 Shared Memory Bank
Thread 2 Shared Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
21. L09-21
Sze and Emer
Memory Access Divergence
March 6, 2023
• All loads are gathers, all stores are scatters
• Address coalescing unit (ACU) detects
sequential and strided patterns, coalesces
memory requests, but complex patterns can
result in multiple lower bandwidth requests
(memory divergence)
• Writing efficient GPU code requires most
accesses to not conflict, even though
programming model allows arbitrary patterns!
22. L09-22
Sze and Emer
Global Memory Bank
Global Memory Bank
Global Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
….
Thread 0 Global Memory Bank
Thread 1 Global Memory Bank
Thread 2 Global Memory Bank
A
C
U
+
X
b
a
r
A
C
U
+
X
b
a
r
Shared Global Memory
March 6, 2023
• Shared global memory
– Large shared memory
– Will suffer also from memory divergence
23. L09-23
Sze and Emer
Shared Global Memory
March 6, 2023
• Memory hierarchy with caches
– Cache to save memory bandwidth
– Caches also enable compression/decompression of data
Global Memory Bank
Global Memory Bank
Global Memory Bank
C
r
o
s
s
b
a
r
C
r
o
s
s
b
a
r
Cache Tags/Data
Cache Tags/Data
Cache Tags/Data
A
C
U
+
X
b
a
r
C
r
o
s
s
b
a
r
N
e
t
w
o
r
k
Buffered Data
Buffered Data
Buffered Data
N
e
t
w
o
r
k
Hits
Misses
24. L09-24
Sze and Emer
Serialized cache access
March 6, 2023
• Trade latency for power/flexibility
– Only access data bank that contains data
– Facilitate more sophisticated cache organizations
• e.g., greater associatively
B
l
o
c
k
O
f
f
s
e
t
I
n
d
e
x
T
a
g
Data Store
Tag Store
Match
B
l
o
c
k
O
f
f
s
e
t
I
n
d
e
x
T
a
g
Data Store
Tag Store
Match
Combine
25. L09-25
Sze and Emer
GPU Programming Environments
Code for accelerated kernels
• CUDA (Nvidia-only)
– C-like language that runs on GPU
– Libraries: cuDNN, cuBLAS, cuFFT
• OpenCL (open standard)
– C-like language that runs on GPU, CPU or FPGA
– usually less optimized than CUDA
March 6, 2023
26. L09-26
Sze and Emer
CUDA GPU Thread Model
March 6, 2023
• Single-program multiple data (SPMD) model
• Each context is a thread
– Threads have registers
– Threads have local memory
• Parallel threads packed in blocks
– Blocks have shared memory
– Threads synchronize with barrier
– Blocks run to completion (or abort)
• Grids include independent blocks
– May execute concurrently
– Share global memory, but
– Have limited inter-block synchronization
27. L09-27
Sze and Emer
Hardware Scheduling
March 6, 2023
• Grids can be launched by
CPU or GPU
– Work from multiple CPU threads
and processes
• HW unit schedules grids on
SMs (labeled SMX in diagram)
– Priority-based scheduling
• 32 active grids
– More queued/paused
28. L09-28
Sze and Emer
GPU Kernel Execution
March 6, 2023
Transfer input data from
CPU to GPU memory
Launch kernel (grid)
Wait for kernel to finish
(if synchronous)
Transfer results to CPU
memory
CPU
Mem
GPU
Mem
3
1
2
4
1
3
2
4
• Data transfers can dominate execution time
• Integrated GPUs with unified address space
no copies, but…
29. L09-29
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
March 6, 2023
30. L09-30
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
March 6, 2023
32. L09-32
Sze and Emer
FC - CUDA Main Program
int i[C*H*W], *gpu_i; # Input activations
int f[M*C*H*W], *gpu_f; # Filter Weights
int o[M], *gpu_o; # Output activations
# Allocate space on GPU
cudaMalloc((void**) &gpu_i, sizeof(int)*C*H*W);
cudaMalloc((void**) &gpu_f, sizeof(int)*M*C*H*W);
# Copy data to GPU
cudaMemcpy(gpu_i, i, sizeof(int)*C*H*W);
cudaMemcpy(gpu_f, f, sizeof(int)*M*C*H*W);
# Run kernel
fc<<<M/256 + 1, 256>>>(gpu_i, gpu_f, gpu_o, C*H*W, M);
# Copy result back and free device memory
cudaMemcpy(o, gpu_o, sizeof(int)*M, cudaMemCpyDevicetoHost);
cudaFree(gpu_i);
...
Thread block size
Num thread blocks
March 6, 2023
33. L09-33
Sze and Emer
FC – CUDA Terminology
• CUDA code launches 256 threads per block
• CUDA vs vector terminology:
– Thread = 1 iteration of scalar loop [1 element in vector loop]
– Block = Body of vectorized loop [VL=256 in this example]
• Warp size = 32 [Number of vector lanes]
– Grid = Vectorizable loop
[ vector terminology ]
March 6, 2023
34. L09-34
Sze and Emer
FC - CUDA Kernel
__global__
void fc(int *i, int *f, int *o, const int CHW, const int M){
int tid=threadIdx.x + blockIdx.x * blockDim.x;
int m = tid
int sum = 0;
if (m < M){
for (int chw=0; chw <CHW; chw ++) {
sum += i[chw]*f[(m*CHW)+chw];
}
o[m] = sum;
}
}
Any consequences of f[(m*CHW)+chw]?
Any consequences of f[(chw*M)+m]?
tid is “output
channel” number
March 6, 2023
Calculate “global
thead id (tid)
35. L09-35
Sze and Emer
Convolution (CONV) Layer
…
M
…
Many
Input fmaps (N) Many
Output fmaps (N)
…
R
S
R
S
C
C
filters
E
F
H
C
H
W
C
E
1 1
N
N
W F
Batch Size (N)
March 6, 2023
36. L09-36
Sze and Emer
Convolution (CONV) Layer
=
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1
2
4
5
2
3
5
6
4
5
7
8
5
6
8
9
1 2 3 4
×
1 2 3 4 1 2 3 4
1 2 3 4 1 2 3 4 3
1 2 4
Chnl 1 Chnl 2
Filter 1
Filter 2
Chnl 1
Chnl 2
Chnl 1
Chnl 2
Input Fmap as
Toeplitz matrix
Output Fmap
Filter
GPU Implementation:
• Keep original input activation matrix in main memory
• Conceptually do tiled matrix-matrix multiplication
• Copy input activations into scratchpad to small Toeplitz matrix tile
• On Volta tile again to use ‘tensor core’
March 6, 2023
37. L09-37
Sze and Emer
GV100 – “Tensor Core”
Tensor Core….
• 120 TFLOPS (FP16)
• 400 GFLOPS/W (FP16)
New opcodes – Matrix Multiply Accumulate (HMMA)
How many FP16 operands?
How many multiplies?
How many adds?
March 6, 2023
38. L09-38
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
+
*
GPR
X
Y
+
*
M
e
m
o
r
y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
In addition to
normal function
units…
March 6, 2023
39. L09-39
Sze and Emer
+1
2 2
Tensor Core Integration
PC I$ IR GPR
X
Y
Tensor
Matrix
Multiply
Accumulate
GPR
X
Y
PC
1
PC
1
PC
1
PC
1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
Cross thread
operands
March 6, 2023
40. L09-40
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
March 6, 2023
41. L09-41
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 6, 2023
42. L09-42
Sze and Emer
Tiled Fully-Connected (FC) Layer
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in tensor core operation
and computation ordered to maximize reuse of data in scratchpad
and then in cache.
March 6, 2023
44. L09-44
Sze and Emer
Summary
• GPUs are an intermediate point in the continuum of flexibility
and efficiency
• GPU architecture focuses on throughput (over latency)
– Massive thread-level parallelism, with
• Single instruction for many threads
– Memory hierarchy specialized for throughput
• Shared scratchpads with private address space
• Caches used primarily to reduce bandwidth not latency
– Specialized compute units, with
• Many computes per input operand
• Little’s Law useful for system analysis
– Shows relationship between throughput, latency and tasks in-flight
March 6, 2023