Cache optimization

Cache Optimizations
Vani Kandhasamy
AP, PSG Tech

Cache Performance Measure
AMAT = Hit Time + Miss Rate * Miss Penalty

Cache Performance Measure
• Hit rate: fraction of cache access that result in a hit
• Miss rate: 1- Hit rate
• Hit time: time to access the cache
• Miss penalty: time to place/replace a block from lower
level plus time taken to transfer to Cache
–access time: time to access lower level
–transfer time: time to transfer block

Improving Cache Performance
AMAT = Hit Time + Miss Rate * Miss Penalty
• Reducing miss rate
• Reducing cache miss Penalty
• Reducing hit time

Three Cs Model
❖ Compulsory:
• The very first access to a block cant be in the cache
(unless pre-fetched)
• cold-start or first reference misses
❖ Capacity:
• Cache cannot contain all blocks accessed by the
program
❖ Conflict (collision):
• Multiple memory locations mapped to the same cache
location

6 Basic Cache Optimizations
• Reducing Miss Rate
1. Larger Block size (compulsory misses)
2. Larger Cache size (capacity misses)
3. Higher Associativity (conflict misses)
• Reducing Miss Penalty
4. Multilevel Caches
5. Giving Reads Priority over Writes
6. Avoiding Address Translation during Indexing of the
Cache

#1. Larger Block size
❖Advantage
+ Reduce compulsory misses
(by better utilizing spatial locality)
❖Disadvantage
- Increase miss penalty
‐ Could increase conflict and capacity misses
Larger block size → no. of blocks ↓

#1. Larger Block size
• Optimal block size depends on the latency and
bandwidth of lower‐level memory
- High bandwidth and high latency encourage large
blocks
- Low bandwidth and low latency encourage smaller
block sizes

Problem
Assume the memory subsystem takes 80 Clock cycles of
overhead and then delivers 16 bytes every 2 clock cycles.
Which block size has the smallest AMAT for each cache
size given below? Consider hit time is 1 clock cycle.

#2. Larger Cache size
Cache size↑ → miss rate↓; hit time↑
❖Advantage
+ Large cache reduce capacity misses
❖Disadvantage
‐ Longer hit time
‐ Higher power and cost

#3. Higher Associativity
Associativity ↑ → miss rate↓; hit time↑
2:1 rule of thumb:
Direct mapped cache of size N has the same miss rate
as two‐way associative cache of size N/2
❖Advantage
+ Reduce conflict misses
❖Disadvantage
‐ Increase hit time
‐ Increased power consumption

#4. Multilevel caches
✓CPU gets faster
✓Caches need to be BIGGER and FASTER
Solution:
• Fast 1st level cache (matching the clock cycle
time of fast processor)
• Large 2nd level cache (to catch misses in the
1st level cache)

Q1. How do we analyze performance?
AMAT = Hit Time L1 + Miss
Rate L1 * Miss PenaltyL1
Miss Penalty L1 = Hit Time L2 +
Miss Rate L2 *Miss Penalty L2

Local miss rate: # misses / # local number of accesses
Global miss rate: # misses / # total number of access

Alternative measure:

Q2. How to place blocks in multilevel caches?
• Multilevel Inclusion
• Multilevel Exclusion
❖Advantage
+ Reduce miss penalty
❖Disadvantage
‐ High complexity

Solution
1. AMAT = Hit Time L1 + Miss Rate L1 * (Hit Time
L2 + Miss Rate L2 *Miss Penalty L2)
= 1 + 4% * (10 + 50% * 200) = 5.4 Clock Cycles
2. Average memory stalls per instruction = Misses per
instruction L1 * Hit time L2 + Misses per
instruction L2 * Miss penalty L2
= (60/1000) * 10 + (30/1000) * 200 = 6.6 Clock Cycles

#5. Giving Reads Priority over Writes
SW R3, 512(R0) ;cache index 0
LW R1, 1024(R0) ;cache index 0
LW R2, 512(R0) ;cache index 0
• Assume direct-mapped write through cache where
1024 and 512 are mapped to the same block.
• R2 == R3

#5. Giving Reads Priority over Writes
• Write Through
✓ Wait until the write buffer is
empty
✓ Check the contents of the write
buffer on a read miss, if no
conflicts, continue read miss.
• Write Back
✓ Read miss replacing dirty block
– Normal: Write dirty block to
memory, and then do the read
– Instead, copy the dirty block to a
write buffer, then do the
read, and then write

#6. Avoiding Address Translation during Indexing of the
Cache
• CPU → Virtual address
• Virtual address → TLB
• TLB → Physical address

Cache
• Page Table Entry (PTE)
is checked in TLB

Cache
• PTE is checked in TLB
• Load PTE from PT

Cache
• Cache is indexed and
tagged by Physical
address
• TLB is indexed by
Virtual address
• Very slow

Cache
• Cache is indexed and
tagged by Virtual
address
Virtual address
• Synonyms problem

Cache

Cache
• Cache is indexed by
Virtual address but
tagged by Physical
address
Virtual address

Advanced Optimization Techniques
1. Small and simple
caches
2. Way prediction
• Increasing cache
bandwidth
3. Pipelined caches
4. Multi banked caches
5. Non blocking caches
• Reducing Miss Penalty
6. Critical word first
7. Merging write buffers
• Reducing Miss Rate
8. Compiler optimizations
• Reducing miss penalty
or miss rate via
parallelism
9. Hardware pre fetching
10. Compiler pre fetching

#1 Small and simple first level caches
Cache Size → Hit Time

L1 cache size and associativity

#2 Way Prediction
Q1. How to combine fast hit time of Direct Mapped and have
the lower conflict misses of 2‐way SA cache?
• Way prediction: predict the
way to pre-set multiplexer
- keep extra bits in cache to
predict the “way,” or block
within the set, of next cache
access.
Set1
Set2

#3 Pipelined caches
• Pipeline cache access to maintain bandwidth
(but higher latency)
• Simple 3-stage pipeline
✓access the cache
✓tag check & hit/miss logic
✓data transfer back to CPU

#4. Multi banked caches
• Organize cache as independent banks to
support simultaneous access

#5. Non blocking caches
• A blocking cache stalls the pipeline on a cache
miss
• A non-blocking cache permits subsequent cache
accesses on a miss
• Allows the CPU to continue executing
instructions while a miss is handled
“hit under miss” → processors allow only 1
outstanding miss
“miss under miss” → processors allow multiple
misses outstanding

#5. Non blocking caches
Challenges
–Maintaining order when multiple misses that
might return out of order
–Load or Store to an already pending miss
address (need merge)

#6. Critical Word First
• Critical word first
– Request missed word from memory first
– Send it to the processor as soon as it arrives
– Rest of cache line comes after “critical word”

#6. Early Restart
• Early restart
– Data returns from memory in order
– Send missed work to the processor as soon as it arrives
– Processor Restarts when needed word is returned

#7. Merging write buffers
• Write buffer lets processor continue while
waiting for write to complete
• If buffer contains modified blocks, addresses
can be checked to see if new data matches that
of some write buffer entry
• If so, new data combined with that entry

#8.Compiler Optimizations
• Group data access together to improve temporal locality
• Re- order data access to improve spatial locality
– Loop Interchange: change nesting of loops to access
data in order stored in memory
– Loop fusion: Group the data access
– Blocking: Improve temporal and spatial locality by
accessing “blocks” of data repeatedly vs. going down
whole columns or rows

Loop Interchange Example
/* Before */
for (j = 0; j < 3; j = j+1)
for (i = 0; i < 3; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (i = 0; i < 3; i = i+1)
for (j = 0; j < 3; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every
word
What type of locality does this improves?

50
Blocking
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two inner loops:
– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]

#9. Hardware Pre fetching
• Pre fetch on miss
- Pre fetch b+1 upon miss on b
• One Block Lookahead (OBL) scheme
- Initiate pre fetch for block b+1 when block b is
accessed

#10. Compiler Controlled Pre fetch
• No prefetching
for (i = 0; i < N;
i++) {
sum += a[i]*b[i];
}
• Simple prefetching
for (i = 0; i < N;
i++) {
fetch (&a[i+1]);
fetch (&b[i+1]);
sum += a[i]*b[i];
}

#10. Compiler Controlled Pre fetch

Cache optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cache optimization

Similar to Cache optimization (20)

More from Vani Kandhasamy

More from Vani Kandhasamy (11)

Recently uploaded

Recently uploaded (20)

Cache optimization