Cache Optimizations
Vani Kandhasamy
AP, PSG Tech
Cache Performance Measure
AMAT = Hit Time + Miss Rate * Miss Penalty
Cache Performance Measure
• Hit rate: fraction of cache access that result in a hit
• Miss rate: 1- Hit rate
• Hit time: time to access the cache
• Miss penalty: time to place/replace a block from lower
level plus time taken to transfer to Cache
–access time: time to access lower level
–transfer time: time to transfer block
Improving Cache Performance
AMAT = Hit Time + Miss Rate * Miss Penalty
• Reducing miss rate
• Reducing cache miss Penalty
• Reducing hit time
Three Cs Model
❖ Compulsory:
• The very first access to a block cant be in the cache
(unless pre-fetched)
• cold-start or first reference misses
❖ Capacity:
• Cache cannot contain all blocks accessed by the
program
❖ Conflict (collision):
• Multiple memory locations mapped to the same cache
location
6 Basic Cache Optimizations
• Reducing Miss Rate
1. Larger Block size (compulsory misses)
2. Larger Cache size (capacity misses)
3. Higher Associativity (conflict misses)
• Reducing Miss Penalty
4. Multilevel Caches
5. Giving Reads Priority over Writes
• Reducing hit time
6. Avoiding Address Translation during Indexing of the
Cache
#1. Larger Block size
❖Advantage
+ Reduce compulsory misses
(by better utilizing spatial locality)
❖Disadvantage
- Increase miss penalty
‐ Could increase conflict and capacity misses
Larger block size → no. of blocks ↓
#1. Larger Block size
• Optimal block size depends on the latency and
bandwidth of lower‐level memory
- High bandwidth and high latency encourage large
blocks
- Low bandwidth and low latency encourage smaller
block sizes
Problem
Assume the memory subsystem takes 80 Clock cycles of
overhead and then delivers 16 bytes every 2 clock cycles.
Which block size has the smallest AMAT for each cache
size given below? Consider hit time is 1 clock cycle.
Solution
#2. Larger Cache size
Cache size↑ → miss rate↓; hit time↑
❖Advantage
+ Large cache reduce capacity misses
❖Disadvantage
‐ Longer hit time
‐ Higher power and cost
#3. Higher Associativity
Associativity ↑ → miss rate↓; hit time↑
2:1 rule of thumb:
Direct mapped cache of size N has the same miss rate
as two‐way associative cache of size N/2
❖Advantage
+ Reduce conflict misses
❖Disadvantage
‐ Increase hit time
‐ Increased power consumption
AMAT vs Associativity
#4. Multilevel caches
✓CPU gets faster
✓Caches need to be BIGGER and FASTER
Solution:
• Fast 1st level cache (matching the clock cycle
time of fast processor)
• Large 2nd level cache (to catch misses in the
1st level cache)
#4. Multilevel caches
Q1. How do we analyze performance?
AMAT = Hit Time L1 + Miss
Rate L1 * Miss PenaltyL1
Miss Penalty L1 = Hit Time L2 +
Miss Rate L2 *Miss Penalty L2
#4. Multilevel caches
Local miss rate: # misses / # local number of accesses
Global miss rate: # misses / # total number of access
#4. Multilevel caches
Alternative measure:
#4. Multilevel caches
Q2. How to place blocks in multilevel caches?
• Multilevel Inclusion
• Multilevel Exclusion
❖Advantage
+ Reduce miss penalty
❖Disadvantage
‐ High complexity
Problem
Solution
1. AMAT = Hit Time L1 + Miss Rate L1 * (Hit Time
L2 + Miss Rate L2 *Miss Penalty L2)
= 1 + 4% * (10 + 50% * 200) = 5.4 Clock Cycles
2. Average memory stalls per instruction = Misses per
instruction L1 * Hit time L2 + Misses per
instruction L2 * Miss penalty L2
= (60/1000) * 10 + (30/1000) * 200 = 6.6 Clock Cycles
#5. Giving Reads Priority over Writes
SW R3, 512(R0) ;cache index 0
LW R1, 1024(R0) ;cache index 0
LW R2, 512(R0) ;cache index 0
• Assume direct-mapped write through cache where
1024 and 512 are mapped to the same block.
• R2 == R3
#5. Giving Reads Priority over Writes
• Write Through
✓ Wait until the write buffer is
empty
✓ Check the contents of the write
buffer on a read miss, if no
conflicts, continue read miss.
• Write Back
✓ Read miss replacing dirty block
– Normal: Write dirty block to
memory, and then do the read
– Instead, copy the dirty block to a
write buffer, then do the
read, and then write
#6. Avoiding Address Translation during Indexing of the
Cache
• CPU → Virtual address
• Virtual address → TLB
• TLB → Physical address
#6. Avoiding Address Translation during Indexing of the
Cache
• Page Table Entry (PTE)
is checked in TLB
#6. Avoiding Address Translation during Indexing of the
Cache
• PTE is checked in TLB
• Load PTE from PT
#6. Avoiding Address Translation during Indexing of the
Cache
• Cache is indexed and
tagged by Physical
address
• TLB is indexed by
Virtual address
• Very slow
#6. Avoiding Address Translation during Indexing of the
Cache
• Cache is indexed and
tagged by Virtual
address
• TLB is indexed by
Virtual address
• Synonyms problem
#6. Avoiding Address Translation during Indexing of the
Cache
#6. Avoiding Address Translation during Indexing of the
Cache
• Cache is indexed by
Virtual address but
tagged by Physical
address
• TLB is indexed by
Virtual address
Summary
Advanced Optimization Techniques
• Reducing hit time
1. Small and simple
caches
2. Way prediction
• Increasing cache
bandwidth
3. Pipelined caches
4. Multi banked caches
5. Non blocking caches
• Reducing Miss Penalty
6. Critical word first
7. Merging write buffers
• Reducing Miss Rate
8. Compiler optimizations
• Reducing miss penalty
or miss rate via
parallelism
9. Hardware pre fetching
10. Compiler pre fetching
#1 Small and simple first level caches
Cache Size → Hit Time
L1 cache size and associativity
L1 cache size and associativity
#2 Way Prediction
Q1. How to combine fast hit time of Direct Mapped and have
the lower conflict misses of 2‐way SA cache?
• Way prediction: predict the
way to pre-set multiplexer
- keep extra bits in cache to
predict the “way,” or block
within the set, of next cache
access.
Set1
Set2
#3 Pipelined caches
• Pipeline cache access to maintain bandwidth
(but higher latency)
• Simple 3-stage pipeline
✓access the cache
✓tag check & hit/miss logic
✓data transfer back to CPU
#3 Pipelined caches
#4. Multi banked caches
• Organize cache as independent banks to
support simultaneous access
#5. Non blocking caches
• A blocking cache stalls the pipeline on a cache
miss
• A non-blocking cache permits subsequent cache
accesses on a miss
• Allows the CPU to continue executing
instructions while a miss is handled
“hit under miss” → processors allow only 1
outstanding miss
“miss under miss” → processors allow multiple
misses outstanding
#5. Non blocking caches
Challenges
–Maintaining order when multiple misses that
might return out of order
–Load or Store to an already pending miss
address (need merge)
#6. Critical Word First
• Critical word first
– Request missed word from memory first
– Send it to the processor as soon as it arrives
– Rest of cache line comes after “critical word”
#6. Early Restart
• Early restart
– Data returns from memory in order
– Send missed work to the processor as soon as it arrives
– Processor Restarts when needed word is returned
#7. Merging write buffers
• Write buffer lets processor continue while
waiting for write to complete
• If buffer contains modified blocks, addresses
can be checked to see if new data matches that
of some write buffer entry
• If so, new data combined with that entry
#7. Merging write buffers
#8.Compiler Optimizations
• Group data access together to improve temporal locality
• Re- order data access to improve spatial locality
– Loop Interchange: change nesting of loops to access
data in order stored in memory
– Loop fusion: Group the data access
– Blocking: Improve temporal and spatial locality by
accessing “blocks” of data repeatedly vs. going down
whole columns or rows
Loop Interchange Example
/* Before */
for (j = 0; j < 3; j = j+1)
for (i = 0; i < 3; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (i = 0; i < 3; i = i+1)
for (j = 0; j < 3; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every
word
What type of locality does this improves?
Loop Fusion
50
Blocking
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two inner loops:
– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
How Pre fetching Works?
#9. Hardware Pre fetching
• Pre fetch on miss
- Pre fetch b+1 upon miss on b
• One Block Lookahead (OBL) scheme
- Initiate pre fetch for block b+1 when block b is
accessed
#10. Compiler Controlled Pre fetch
• No prefetching
for (i = 0; i < N;
i++) {
sum += a[i]*b[i];
}
• Simple prefetching
for (i = 0; i < N;
i++) {
fetch (&a[i+1]);
fetch (&b[i+1]);
sum += a[i]*b[i];
}
#10. Compiler Controlled Pre fetch
Summary

Cache optimization

  • 1.
  • 2.
    Cache Performance Measure AMAT= Hit Time + Miss Rate * Miss Penalty
  • 3.
    Cache Performance Measure •Hit rate: fraction of cache access that result in a hit • Miss rate: 1- Hit rate • Hit time: time to access the cache • Miss penalty: time to place/replace a block from lower level plus time taken to transfer to Cache –access time: time to access lower level –transfer time: time to transfer block
  • 4.
    Improving Cache Performance AMAT= Hit Time + Miss Rate * Miss Penalty • Reducing miss rate • Reducing cache miss Penalty • Reducing hit time
  • 5.
    Three Cs Model ❖Compulsory: • The very first access to a block cant be in the cache (unless pre-fetched) • cold-start or first reference misses ❖ Capacity: • Cache cannot contain all blocks accessed by the program ❖ Conflict (collision): • Multiple memory locations mapped to the same cache location
  • 6.
    6 Basic CacheOptimizations • Reducing Miss Rate 1. Larger Block size (compulsory misses) 2. Larger Cache size (capacity misses) 3. Higher Associativity (conflict misses) • Reducing Miss Penalty 4. Multilevel Caches 5. Giving Reads Priority over Writes • Reducing hit time 6. Avoiding Address Translation during Indexing of the Cache
  • 7.
    #1. Larger Blocksize ❖Advantage + Reduce compulsory misses (by better utilizing spatial locality) ❖Disadvantage - Increase miss penalty ‐ Could increase conflict and capacity misses Larger block size → no. of blocks ↓
  • 8.
    #1. Larger Blocksize • Optimal block size depends on the latency and bandwidth of lower‐level memory - High bandwidth and high latency encourage large blocks - Low bandwidth and low latency encourage smaller block sizes
  • 9.
    Problem Assume the memorysubsystem takes 80 Clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. Which block size has the smallest AMAT for each cache size given below? Consider hit time is 1 clock cycle.
  • 10.
  • 11.
    #2. Larger Cachesize Cache size↑ → miss rate↓; hit time↑ ❖Advantage + Large cache reduce capacity misses ❖Disadvantage ‐ Longer hit time ‐ Higher power and cost
  • 12.
    #3. Higher Associativity Associativity↑ → miss rate↓; hit time↑ 2:1 rule of thumb: Direct mapped cache of size N has the same miss rate as two‐way associative cache of size N/2 ❖Advantage + Reduce conflict misses ❖Disadvantage ‐ Increase hit time ‐ Increased power consumption
  • 13.
  • 14.
    #4. Multilevel caches ✓CPUgets faster ✓Caches need to be BIGGER and FASTER Solution: • Fast 1st level cache (matching the clock cycle time of fast processor) • Large 2nd level cache (to catch misses in the 1st level cache)
  • 15.
    #4. Multilevel caches Q1.How do we analyze performance? AMAT = Hit Time L1 + Miss Rate L1 * Miss PenaltyL1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 *Miss Penalty L2
  • 16.
    #4. Multilevel caches Localmiss rate: # misses / # local number of accesses Global miss rate: # misses / # total number of access
  • 17.
  • 18.
    #4. Multilevel caches Q2.How to place blocks in multilevel caches? • Multilevel Inclusion • Multilevel Exclusion ❖Advantage + Reduce miss penalty ❖Disadvantage ‐ High complexity
  • 19.
  • 20.
    Solution 1. AMAT =Hit Time L1 + Miss Rate L1 * (Hit Time L2 + Miss Rate L2 *Miss Penalty L2) = 1 + 4% * (10 + 50% * 200) = 5.4 Clock Cycles 2. Average memory stalls per instruction = Misses per instruction L1 * Hit time L2 + Misses per instruction L2 * Miss penalty L2 = (60/1000) * 10 + (30/1000) * 200 = 6.6 Clock Cycles
  • 21.
    #5. Giving ReadsPriority over Writes SW R3, 512(R0) ;cache index 0 LW R1, 1024(R0) ;cache index 0 LW R2, 512(R0) ;cache index 0 • Assume direct-mapped write through cache where 1024 and 512 are mapped to the same block. • R2 == R3
  • 22.
    #5. Giving ReadsPriority over Writes • Write Through ✓ Wait until the write buffer is empty ✓ Check the contents of the write buffer on a read miss, if no conflicts, continue read miss. • Write Back ✓ Read miss replacing dirty block – Normal: Write dirty block to memory, and then do the read – Instead, copy the dirty block to a write buffer, then do the read, and then write
  • 23.
    #6. Avoiding AddressTranslation during Indexing of the Cache • CPU → Virtual address • Virtual address → TLB • TLB → Physical address
  • 24.
    #6. Avoiding AddressTranslation during Indexing of the Cache • Page Table Entry (PTE) is checked in TLB
  • 25.
    #6. Avoiding AddressTranslation during Indexing of the Cache • PTE is checked in TLB • Load PTE from PT
  • 26.
    #6. Avoiding AddressTranslation during Indexing of the Cache • Cache is indexed and tagged by Physical address • TLB is indexed by Virtual address • Very slow
  • 27.
    #6. Avoiding AddressTranslation during Indexing of the Cache • Cache is indexed and tagged by Virtual address • TLB is indexed by Virtual address • Synonyms problem
  • 28.
    #6. Avoiding AddressTranslation during Indexing of the Cache
  • 29.
    #6. Avoiding AddressTranslation during Indexing of the Cache • Cache is indexed by Virtual address but tagged by Physical address • TLB is indexed by Virtual address
  • 30.
  • 31.
    Advanced Optimization Techniques •Reducing hit time 1. Small and simple caches 2. Way prediction • Increasing cache bandwidth 3. Pipelined caches 4. Multi banked caches 5. Non blocking caches • Reducing Miss Penalty 6. Critical word first 7. Merging write buffers • Reducing Miss Rate 8. Compiler optimizations • Reducing miss penalty or miss rate via parallelism 9. Hardware pre fetching 10. Compiler pre fetching
  • 32.
    #1 Small andsimple first level caches Cache Size → Hit Time
  • 33.
    L1 cache sizeand associativity
  • 34.
    L1 cache sizeand associativity
  • 35.
    #2 Way Prediction Q1.How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2‐way SA cache? • Way prediction: predict the way to pre-set multiplexer - keep extra bits in cache to predict the “way,” or block within the set, of next cache access. Set1 Set2
  • 36.
    #3 Pipelined caches •Pipeline cache access to maintain bandwidth (but higher latency) • Simple 3-stage pipeline ✓access the cache ✓tag check & hit/miss logic ✓data transfer back to CPU
  • 37.
  • 38.
    #4. Multi bankedcaches • Organize cache as independent banks to support simultaneous access
  • 39.
    #5. Non blockingcaches • A blocking cache stalls the pipeline on a cache miss • A non-blocking cache permits subsequent cache accesses on a miss • Allows the CPU to continue executing instructions while a miss is handled “hit under miss” → processors allow only 1 outstanding miss “miss under miss” → processors allow multiple misses outstanding
  • 41.
    #5. Non blockingcaches Challenges –Maintaining order when multiple misses that might return out of order –Load or Store to an already pending miss address (need merge)
  • 42.
    #6. Critical WordFirst • Critical word first – Request missed word from memory first – Send it to the processor as soon as it arrives – Rest of cache line comes after “critical word”
  • 43.
    #6. Early Restart •Early restart – Data returns from memory in order – Send missed work to the processor as soon as it arrives – Processor Restarts when needed word is returned
  • 44.
    #7. Merging writebuffers • Write buffer lets processor continue while waiting for write to complete • If buffer contains modified blocks, addresses can be checked to see if new data matches that of some write buffer entry • If so, new data combined with that entry
  • 45.
  • 46.
    #8.Compiler Optimizations • Groupdata access together to improve temporal locality • Re- order data access to improve spatial locality – Loop Interchange: change nesting of loops to access data in order stored in memory – Loop fusion: Group the data access – Blocking: Improve temporal and spatial locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows
  • 47.
    Loop Interchange Example /*Before */ for (j = 0; j < 3; j = j+1) for (i = 0; i < 3; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (i = 0; i < 3; i = i+1) for (j = 0; j < 3; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every word What type of locality does this improves?
  • 48.
  • 49.
    50 Blocking for (i =0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; • Two inner loops: – Read all NxN elements of z[] – Read N elements of 1 row of y[] repeatedly – Write N elements of 1 row of x[]
  • 56.
  • 58.
    #9. Hardware Prefetching • Pre fetch on miss - Pre fetch b+1 upon miss on b • One Block Lookahead (OBL) scheme - Initiate pre fetch for block b+1 when block b is accessed
  • 59.
    #10. Compiler ControlledPre fetch • No prefetching for (i = 0; i < N; i++) { sum += a[i]*b[i]; } • Simple prefetching for (i = 0; i < N; i++) { fetch (&a[i+1]); fetch (&b[i+1]); sum += a[i]*b[i]; }
  • 60.
  • 61.