The document discusses hardware architectures for deep learning and the impact of memory hierarchy. It covers goals such as understanding the impact of memory hierarchy, caching techniques like tiling to improve cache performance, and storage technologies. The lecture will cover caches, how algorithms can be structured to work well within caches using tiling, and readings on efficient processing of neural networks. It then discusses techniques like caching, prefetching, and multi-level caches to reduce memory access latency and improve computational intensity and performance of operations like fully-connected layers.
Architecture and implementation issues of multi core processors and caching –...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
About Cache Memory
working of cache memory
levels of cache memory
mapping techniques for cache memory
1. direct mapping techniques
2. Fully associative mapping techniques
3. set associative mapping techniques
Cache memroy organization
cache coherency
every thing in detail
Architecture and implementation issues of multi core processors and caching –...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
About Cache Memory
working of cache memory
levels of cache memory
mapping techniques for cache memory
1. direct mapping techniques
2. Fully associative mapping techniques
3. set associative mapping techniques
Cache memroy organization
cache coherency
every thing in detail
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
1. L06-1
Sze and Emer
6.5930/1
Hardware Architectures for Deep Learning
Kernel Computation -
Impact of Memory Hierarchy
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 22, 2023
2. L06-2
Sze and Emer
Sze and Emer
Goals of Today’s Lecture
• Understand impact of memory hierarchy
– Overview of caches
– Structuring algorithms to work well in caches using tiling
– Storage technologies
February 22, 2023
3. L06-3
Sze and Emer
Sze and Emer
Readings for this Week
• Efficient Processing of Deep Neural Networks
– Chapter 4 of https://doi.org/10.1007/978-3-031-01766-7
.
February 22, 2023
4. L06-4
Sze and Emer
Sze and Emer
Simple Pipelined µArchitecture
February 22, 2023
PC
I
M
E
M
IR GPR
X
Y
+
*
D
M
E
M
Warning: Objects in PowerPoint may
be larger than they appear
What are consequences of putting
large memory (e.g., megabytes)
directly in pipeline?
Long latency => dependency stalls
Large energy consumption
5. L06-5
Sze and Emer
Sze and Emer
Pipelined µArchitecture with Caches
February 22, 2023
PC
I
$
IR GPR
X
Y
+
* D
$
Memory
Instruction cache (I$) and data cache (D$) hold memory data
for reuse in small energy efficient buffer
6. L06-6
Sze and Emer
Sze and Emer
Direct Mapped Cache
February 22, 2023
Tag Data Block
V
=
Offset
Tag Index
t k b
t
HIT
Data Word or Byte
2k
lines
Block number Block offset
Valid bit
indicates data
block is valid
Data block consists
of multiple words
Valid and tag
match means
data is in cache Offset selects desired word
Address partitioned
into multiple fields
Index
picks
row
7. L06-7
Sze and Emer
Sze and Emer
Cache Operation
February 22, 2023
Look at data address, search cache tags to find match.
Then if…
Found in cache
a.k.a. HIT
Return copy
of data from
cache
Not in cache
a.k.a. MISS
Read block of data from
Main Memory
Wait …
Return data to processor
and update cache
Metric: Hit Rate = #Hits / (#Hits + #Misses)
8. L06-8
Sze and Emer
Sze and Emer
Treatment of Writes
• Cache hit:
– write through: write both cache & memory
• generally higher traffic but simplifies cache in processor pipeline
– write back: write cache only
(memory is written only when the entry is evicted)
• a dirty bit per block can further reduce the traffic
• Cache miss:
– no write allocate: only write to main memory
– write allocate (aka fetch on write): fetch into cache
• Common combinations:
– write through and no write allocate
– write back with write allocate
February 22, 2023
9. L06-9
Sze and Emer
Sze and Emer
Cache Locality
Caches implicitly try to optimize data movement by trying to exploit
two common properties of memory references:
– Spatial Locality: If a location is referenced it is likely that locations near it
will be referenced in the near future.
• Exploited by having block size larger than a word, which also amortizes fill
overheads by getting more bytes with one access
– Temporal Locality: If a location is referenced it is likely to be referenced
again in the near future.
• Exploited by holding blocks for future access
February 22, 2023
10. L06-10
Sze and Emer
Sze and Emer
Fully Connected (FC) Computation
February 22, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
o[m] = 0
CHWm += CHW
for chw in [0, CHW):
o[m] += i[chw] * f[CHWm + chw]
M iterations
C*H*W iterations
M*C*H*W loads
M*C*H*W loads of each
weight and input activation
11. L06-11
Sze and Emer
Sze and Emer
Impact of spatial locality
February 22, 2023
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block
Hit rate of long sequential reference
streams due to spatial locality?
15/16 => ~94%
12. L06-12
Sze and Emer
Sze and Emer
FC – Data Reference Pattern
February 22, 2023
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]…
Not drawn to scale
Weight locality…
Spatial? Yes
Input activation locality…
Spatial? Yes
CHW
M*CHW
m=0 1 2 3
Temporal? No. No reuse!
Temporal? It depends.
14. L06-14
Sze and Emer
Sze and Emer
Amount of temporal locality
• Typical layer size:
– H, W = 256 C = 128
February 22, 2023
Size of input activations? 256x256x128x4 => 32MB
What does this imply for
input activation hit rate?
No temporal locality
since 32MB > 64K bytes
• Typical in-pipeline cache size
– 64K bytes => 16K FP32 words
– 64 byte blocks => 16 FP32 words/block
15. L06-15
Sze and Emer
Sze and Emer
Computational Intensity – Naïve FC
Number MACS: M*C*H*W
Input activation accesses M*C*H*W
Filter weight accesses M*C*H*W
Output activation accesses M
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀
=
1
2 +
1
𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
~
1
2
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHWm + chw]
February 22, 2023
Computational Intensity =
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
Computational Intensity =
16. L06-16
Sze and Emer
Sze and Emer
Computational Intensity – Ideal FC
Number MACS: M*C*H*W
Input activation accesses: C*H*W
Filter weight accesses: M*C*H*W
Output activation accesses: M
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
𝑀𝑀 × 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊 + 𝑀𝑀
=
1
1 +
1
𝑀𝑀 +
1
𝐶𝐶 × 𝐻𝐻 × 𝑊𝑊
~ 1
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHWm + chw]
February 22, 2023
Computational Intensity =
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
Computational Intensity =
17. L06-17
Sze and Emer
Sze and Emer
Einsum for strip mined FC
𝐼𝐼𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑤𝑤 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
February 22, 2023
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤/𝑇𝑇,𝑐𝑐ℎ𝑤𝑤𝑤𝑇𝑇 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤1,𝑐𝑐ℎ𝑤𝑤0
𝑐𝑐ℎ𝑤𝑤1 𝑐𝑐ℎ𝑤𝑤1
𝑐𝑐ℎ𝑤𝑤0 𝑐𝑐ℎ𝑤𝑤0
18. L06-18
Sze and Emer
Sze and Emer
// Level 1
for chw1 in [0, CHW1):
for m in [0, M):
// Level 0
for chw0 in [0, CHW0):
chw = CHW0*chw1+chw0
o[m] += i[chw] * f[CHW*m + chw]
Fully Connected – Strip Mined
February 22, 2023
for m in [0, M):
for chw in [0, C*H*W):
o[m] += i[chw] * f[CHW*m + chw]
Just considering input activations,
what value should CHW0 be?
Less than cache size
Inner loop
working set = X
Inner loop working
set = CHW0
CHW1*CHW0 =
C*H*W
19. L06-19
Sze and Emer
Sze and Emer
FC - Strip Mined Data Reference Pattern
February 22, 2023
Untiled Tiled
Cache
Hits?
F[M0 ------]
F[M1 ------]
F[M2 ------]
F[M3 ------]
F[M4 ------]
I[C0 H0 W0]
I[C0 H0 W1]
…
Not drawn to scale
CHW
M*CHW
21. L06-21
Sze and Emer
Sze and Emer
Computational Intensity – Strip Mined
Number MACS: M*C*H*W
Input activation accesses: C*H*W
Filter weight accesses: M*C*H*W
Output activation accesses M
𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊
𝑀𝑀×𝐶𝐶×𝐻𝐻×𝑊𝑊+𝐶𝐶×𝐻𝐻×𝑊𝑊+𝑀𝑀
=
1
1+
1
𝑀𝑀
+
1
𝐶𝐶×𝐻𝐻×𝑊𝑊
~ 1
February 22, 2023
// Level 1
for chw1 in [0, CHW1):
for m in [0, M):
// Level 0
for chw0 in [0, CHW0):
chw = CHW0*chw1+chw0
o[m] += i[chw] * f[CHW*m + chw]
Computational Intensity =
22. L06-22
Sze and Emer
Sze and Emer
Associative Cache
February 22, 2023
Tag Data Block
V
=
Block
Offset
Tag Index
t
k
b
HIT
Tag Data Block
V
Data
Word
or Byte
=
t
Allows multiple
streams to be
resident at
same time
Pick data from
‘way’ that ‘hits’
Pick data from
‘way’ that ‘hits’
23. L06-23
Sze and Emer
Sze and Emer
Cache Miss Pipeline Diagram
February 22, 2023
ld r6, w(r5)
mul r7,r4,r6
Time (cycles)
IF ID RF EX D$ WB
IF ID RF EX D$ WB
ld r6, w(r5)
mul r7,r4,r6
IF ID RF EX D$ MISS - MEM WB
IF ID RF EX D$ WB
HIT
MISS
stall
stall
24. L06-24
Sze and Emer
Sze and Emer
Avoiding Cache Miss Stalls
• Reorganize code so loads are far ahead of use
– Requires huge amount of unrolling
– Consumes lots of registers
• Add ‘prefetch’ instructions that just load cache
– Consumes instruction issue slots
• Add hardware that automatically loads cache
February 22, 2023
25. L06-25
Sze and Emer
Sze and Emer
Hardware Data Prefetching
February 22, 2023
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b
• One Block Lookahead (OBL) scheme
– Initiate prefetch for block b + 1 when block b is accessed
– Can extend to N block lookahead
• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc.
Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor,
prefetching 12 lines ahead of current access
26. L06-26
Sze and Emer
Sze and Emer
Multi-level Caches
February 22, 2023
• A memory cannot be large and fast
• Add level of cache to reduce miss penalty
– Each level can have longer latency than level above
– So, increase sizes of cache at each level
CPU L1 L2 DRAM
Metrics:
Local miss rate = misses in cache/ accesses to cache
Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of instructions
28. L06-28
Sze and Emer
Sze and Emer
H
W
C
N
FC Layer – Multichannel
February 22, 2023
…
M
…
input fmaps
output fmaps
…
filters
H
C
1
1 1
1
1
N
W 1
H
C
1
W
H
W
C
M
29. L06-29
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
30. L06-30
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
31. L06-31
Sze and Emer
Sze and Emer
FC Einsum Notation
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
𝑂𝑂𝑛𝑛,𝑚𝑚 = 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 × 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤
32. L06-32
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
33. L06-33
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
●●●●
●●●●
34. L06-34
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
35. L06-35
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
• After flattening, having a batch size of N turns the
matrix-vector operation into a matrix-matrix multiply
How much temporal locality for naïve implementation? None
●●●●
●●●●
●
●
●
●
●
●
●
●
38. L06-38
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
39. L06-39
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
40. L06-40
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
●●●●
●●●●
*Dotted line means partial result
41. L06-41
Sze and Emer
Sze and Emer
Tiled Fully-Connected (FC) Layer
February 22, 2023
M
CHW
CHW
N
Filters Input fmaps
×
N
Output fmaps
M
=
F0,0 F0,1
F1,0 F1,1
I0,0 I0,1
I1,0 I1,
1
F0,0I0,0
+
F0,1I1,0
F1,0I0,0
+
F1,1I1,0
F0,0I0,1
+
F0,1I1,1
F1,0I0,1
+
F1,1I1,1
Matrix multiply tiled to fit in cache
and computation ordered to maximize reuse of data in cache
42. L06-42
Sze and Emer
Sze and Emer
Einsum for tiled FC
February 22, 2023
𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 → 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤1
𝑂𝑂𝑚𝑚 = 𝐼𝐼𝑛𝑛,𝑐𝑐ℎ𝑤𝑤 × 𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤
𝐹𝐹𝑚𝑚,𝑐𝑐ℎ𝑤𝑤 → 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0,𝑐𝑐ℎ𝑤𝑤0
𝑂𝑂𝑚𝑚1,𝑚𝑚0 = 𝐼𝐼𝑛𝑛1,𝑐𝑐ℎ𝑤𝑤1,𝑛𝑛0,𝑐𝑐ℎ𝑤𝑤0 × 𝐹𝐹𝑚𝑚1,𝑐𝑐ℎ𝑤𝑤1,𝑚𝑚0𝑐𝑐ℎ𝑤𝑤0
43. L06-43
Sze and Emer
Sze and Emer
Fully-Connected (FC) Layer
February 22, 2023
• Implementation: Matrix Multiplication (GEMM)
• CPU: OpenBLAS, Intel MKL, etc
• GPU: cuBLAS, cuDNN, etc
• Library will note shape of the matrix multiply
and select implementation optimized for that
shape.
• Optimization usually involves proper tiling to
storage hierarchy
45. L06-45
Sze and Emer
Sze and Emer
Overview of Memories
Memory consist of arrays of cells that hold a value.
• Types of Memories/Storage
– Latches/Flip Flops (Registers)
– SRAM (Register File, Caches)
– DRAM (Main Memory)
– Flash (Storage)
February 22, 2023
46. L06-46
Sze and Emer
Sze and Emer
Elements of Memory Operation
Implementations vary based on:
– How a memory cell holds a value?
– How is a value obtained from a memory cell?
– How is a value set in a memory cell?
– How is array constructed out of individual cells?
• Results in tradeoffs between cost, density, speed, energy and
power consumption
February 22, 2023
47. L06-47
Sze and Emer
Sze and Emer
Latches/Flip Flops
• Fast and low latency
• Located with logic
February 22, 2023
D$
PC I$ IR GPR
X
Y
+
*
Example from CPU pipeline
D-flip flop
Image source: 6.111
48. L06-48
Sze and Emer
Sze and Emer
Latches/Flip Flops (< 0.5 kB)
• Fast and low latency
• Located with logic
• Not very dense
– 10+ transistors per bit
– Usually use for arrays smaller than 0.5kB
February 22, 2023
Array of Flip flops
D-flip flop
Image source: 6.111
Read
Address
[A2:A0]
49. L06-49
Sze and Emer
Sze and Emer
Latches/Flip Flops (< 0.5 kB)
February 22, 2023
Array of Flip flops
Read
Address
[A2:A0]
PC I$ IR GPR
X
Y
+
*
D$
50. L06-50
Sze and Emer
Sze and Emer
SRAM
• Higher density than register
– Usually, 6 transistors per bit-cell
• Less robust and slower than latches/flip-flop
February 22, 2023
Bit cell size 0.75um2 in 14nm
IC wafer
53. L06-53
Sze and Emer
Sze and Emer
SRAM Power Dominated by Bit Line
February 22, 2023
56%
6%
15%
22%
Bit-lines (BL)
Word-line (WL)
Sensing Ntwk.
Other
Measured SRAM Power Breakdown
@VDD=0.6V
Larger array Longer bit-lines
Higher capacitance Higher power
Image Source: Mahmut Sinangil
54. L06-54
Sze and Emer
Sze and Emer
DRAM
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
February 22, 2023
55. L06-55
Sze and Emer
Sze and Emer
DRAM (GB)
• Higher density than SRAM
– 1 transistor per bit-cell
– Needs periodic refresh
• Special device process
– Usually off-chip (except eDRAM – which is pricey!)
– Off-chip interconnect has much higher capacitance
February 22, 2023
p
J
nJ
56. L06-56
Sze and Emer
Sze and Emer
Flash (100GB to TB)
• More dense than DRAM
• Non-volatile
– Needs high powered write (change VTH of transistor)
February 22, 2023
57. L06-57
Sze and Emer
Sze and Emer
Flash Memory
Multi-levels
cell (MLC)
48 layer, Ternary
level cell (TLC)
Aug 2015
256 Gb per die (for SSD)
Single Level
Cell (SLC)
Single Level Cell (SLC)
Multi-levels cell (MLC)
February 22, 2023
58. L06-58
Sze and Emer
Sze and Emer
Memory Tradeoffs
February 22, 2023
Density
Function of circuit type
(smaller → denser)
Cost/bit
Function of circuit type (smaller → cheaper)
Energy/access/bit
Function of total capacity
(smaller → less energy)
and circuit type
(smaller → less energy)
Latency
Function of circuit type
(smaller → slower)
and total capacity
(smaller → faster)
Bandwidth
Increases with
parallelism
Most attributes tend to improve with technology scaling,
lower voltage and sometimes smaller capacitors
59. L06-59
Sze and Emer
Sze and Emer
Summary
• Reduce main memory access with caches
– Main memory (i.e., DRAM) is slow and has high energy consumption
– Exploits spatial and temporal locality
• Tiling to reduce cache misses
– Possible since processing order does not affect result (MACs are commutative)
– Add levels to loop nest to improve temporal locality
– Size of tile depends on cache size and cache associativity
• Tradeoffs in storage technology
– Various tradeoffs in cost, speed, energy, capacity…
– Different technologies appropriate at different spots in the design
February 22, 2023