SlideShare a Scribd company logo
1
Multicolumn Cache: aiming for both
high hit rate and low latency
Xiaodong Zhang
CSE 3421 additional notes for Memory Systems
CSE3421 Slide 2
 Cache design is memory-address centric
 The default length is 32 bits, each memory address is unique and
points to a byte (byte addressable)
 Each memory address connects to the first byte of the content, and
its default length is a word (4 Bytes or 32 bits)
 Cache is a small storage inclusively built in the CPU chip
 It contains copies of the data from memory (and disks)
 A cache-block storage can be multiple-words long
 Besides data storage, each block needs tags V bit, and others
 For a given memory address, CPU knows where it is
 By direct, set-associative or fully associative mapping
 How do handle write hit, write miss, read hit, and read miss?
 How to address the conflicts between high hit rates and low latency?
Basic concepts of hardware cache
CSE3421 Slide 3
Three steps for a cache miss:
 Sending a memory address (bus)
 Loading data from memory (no bus)
 Transferring data to CPU (bus)
Connections between CPU-Caches and DRAM memory
CSE3421 Slide 6
Trade-offs between High Hit Ratios and Low Latency
 Set- associative cache achieves high hit-ratios: 30% higher than
that of direct-mapped cache.
 But it suffers high access times due to
 Multiplexing logic delay during the selection.
 Tag checking, selection, and data dispatching are sequential.
 Direct-mapped cache loads data and checks tag in parallel:
minimizing the access time.
 High power consumption is another concern for set associative
cache
 Can we get both high hit ratios and low latency in low power?
Best Case of Way Prediction: First Hit
Cost: way prediction only
tag set offset
tag
way0
data
way1 way2 way3
Way-prediction
=?
Mux 4:1 To CPU
Way Prediction: Non-first Hit (in Set)
Cost: way prediction + selection in set
tag set offset
tag
way0
data
way1 way2 way3
Way-prediction
=?
Mux 4:1
≠ To CPU
Worst Case of Way-prediction: Miss
Cost: way prediction + selection in set + miss
tag set offset
tag
way0
data
way1 way2 way3
Way-prediction
=?
Mux 4:1
≠
≠
Lower level Memory
To CPU
Make a low-latency and high hit-rate cache to be a reality
 A write to a set-associative cache
1. Map to the cache set
2. Find an empty way (block) to write
3. Or write to the least recent used (LRU) way (block)
 A read in a set-associative cache
1. Map to the cache set
2. Compare with the tags, then select the way if matches
3. Otherwise, serve the cache miss by going to memory
 Why a way-prediction is hard?
 We do not know which way the cache block is located
 Can we enforce which way to go in both write and read?
Basic Ideas
 Conventional Set Associative Cache
 Mapping to a set of multiple ways
 find a way (LRU) to write
 search all ways for a read
 Latency comes from “finding” or “searching” among ways
 Can we decisively or directly find a way?
 If so, the search would be direct to the way
 How do we make this happen?
 Multi-column cache
Multi-column Caches: Fast and High Hit Ratio
 Zhang, et. al., IEEE Micro, 1997.
 Objectives:
 Each first hit is equivalent to a direct-mapped access.
 Maximize the number of first hits.
 Minimize the latency of non-first-hits.
 Additional hardware should be simple and low cost.
CSE3421 Slide 17
Range of Set Associative Caches
 For a fixed size cache, each time the associativity increases it
doubles the number of blocks per set (or the number of ways)
 halves the number of sets – decreases the set index by 1 bit
 increases the size of the tag by 1 bit
 Note: word offset is for multiple words in a block. word offset + byte offset = long cache block
Byte offset
Set Index
Tag
Decreasing associativity
Fully associative
(only one set)
Tag is all the bits except
block and byte offset
Direct mapped
(only one way)
Smaller tags, only a
single comparator
Increasing associativity
Selects the set
Used for tag to compare Cache Block Size
Multi-Column Caches: Major Location
 The bits given to tag and ``set bits” generate a
direct-mapped location: Major Location.
 A major location mapping = direct-mapping.
Set
Tag Offset
Address
Between direct-map and set-
associative, there is a set of
unused bits for tags.
Let’s use the bits direct-mapping
to the block (way) in the set
CSE3421 Slide 19
Comparisons of Three Caches
Implicit direct mapping in set associative cache
 Standard set associative cache only maps to a set
 Multicolumn cache uses the full direct map bits to
write and read in a set associative cache
 Set index remains the same
 The bits given to the tag from direct map are used for
which way in the set
 These two sets of bits are available and free
 Major location contains a most recent used (MRU)
block either loaded from memory or just accessed.
Multi-column Caches: Selected Locations
 Multiple blocks can be direct-mapped to the same
major location, but only MRU is the major.
 The non-MRU blocks are stored in other empty
locations in the set: Selected Locations.
 If ``other locations” are used for their own major
locations, there will be no space for selected ones.
 Swap:
 A block in selected location is swapped to major location
as it becomes MRU.
 A block in major location is swapped to a selected
location after a new block is loaded in it from memory.
Multi-Column: Indexing Selected Locations
 The selected locations associated with its
major location are indexed for a fast search.
Location 0
Location 1
Location 2
Location 3
0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0
Bit Vector 3
(way3)
Bit Vector 2
(way 2)
Bit Vector 1
(way 1)
Bit Vector 0
(way 0)
Major Location 1 has two
selected locations at 0 and 2
Summary of Multi-column Caches
 A major location in a set is the direct-mapped
location of MRU.
 A selected location is the direct-mapped
location but non-MRU.
 A selected location index is maintained for
each major location.
 A ``swap” is used to ensure the block in the
major location is always MRU.
Multi-column Cache Operations
10
0001
Reference 1
Way 0 Way 1 Way 2 Way 3
0000 1101 0111 1011
0000 0000 0000 0010
Selected location
Not at major location !
1101
10
1011
Reference 2
0000
No selected location!
0001
Place 0001 at the
major location !
0000 0001 0111 1011
1011
First Hit!
Performance Summary of Multi-column Caches
 Hit ratio to the major locations is about 90%.
 The hit ratio is higher than that of direct-mapped cache due
to high associativity while keeps low access time of direct-
mapped cache in average.
 First-hit is equivalent to direct-mapped.
 Non-first hits are faster than set-associative caches.
 Outperforming set associative caches
Comparing First-hit Ratios between
Multicolumn and MRU Caches
64KB Data Cache Hit Rate (Program mgrid)
0%
20%
40%
60%
80%
100%
120%
4-way 8-way 16-way
Hit
Rate
overall
MRU first-hit
Multicolumn first-
hit
Note: What is an MRU cache?
An MRU bit map is used for a set associative cache, (1 bit for way, 2 bits for 4-way,
…), which remembers the most recent used way as the prediction of the next access.
This bit map is used for the way prediction.
Multi-column Technique is Critical for
Low Power Caches by targeting specific ways
0.1
1
10
100
1000
10000
8
5
8
7
8
9
9
1
9
3
9
5
9
7
9
9
0
1
0
3
Frequency
(MHz)
Transistor
(M)
Power
(W)
Frequency
Transistor
Power
Source: Intel.com
80386
80486 Pentium
Pentium Pro
Pentium II
Pentium III
Pentium 4
The impact of multicolumn caches
 The concept of multicolumn caches has influenced
the R&D in the set-associative cache products:
 A tag-less cache
 K-way direct mapped cache
 The direct-to-data cache
 Highly associative caches for low power processors
 A way memorization cache
 Way selection based on tag bits
 In ARM Cortex R chip, cache mapping “directly access
one of the ways in a set”

More Related Content

Similar to 9.1-CSE3421-multicolumn-cache.pdf

cache memory
 cache memory cache memory
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
Jeanie Delos Arcos
 
Cache memory
Cache memoryCache memory
Cache memory
Shailesh Tanwar
 
hierarchical memory technology.pptx
hierarchical memory technology.pptxhierarchical memory technology.pptx
hierarchical memory technology.pptx
2105986
 
cache memory
cache memorycache memory
waserdtfgfiuerhiuerwehfiuerghzsdfghyguhijdrtyunit5.pptx
waserdtfgfiuerhiuerwehfiuerghzsdfghyguhijdrtyunit5.pptxwaserdtfgfiuerhiuerwehfiuerghzsdfghyguhijdrtyunit5.pptx
waserdtfgfiuerhiuerwehfiuerghzsdfghyguhijdrtyunit5.pptx
abcxyz19691969
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
ibmlenovo2021
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
RohitPaul71
 
computer-memory
computer-memorycomputer-memory
computer-memory
Bablu Shofi
 
Cache recap
Cache recapCache recap
Cache recap
Tony Nguyen
 
Cache recap
Cache recapCache recap
Cache recap
Luis Goldster
 
Cache recap
Cache recapCache recap
Cache recap
Fraboni Ec
 
Cache recap
Cache recapCache recap
Cache recap
Harry Potter
 
Cache recap
Cache recapCache recap
Cache recap
James Wong
 
Cache recap
Cache recapCache recap
Cache recap
Young Alista
 
Cache recap
Cache recapCache recap
Cache recap
Hoang Nguyen
 
Elements of cache design
Elements of cache designElements of cache design
Elements of cache design
Rohail Butt
 
Cache memory
Cache memoryCache memory
Cache memory
Muhammad Ishaq
 
Cache memory
Cache memoryCache memory
Cache memory
Muhammad Ishaq
 
Architecture Assignment Help
Architecture Assignment HelpArchitecture Assignment Help
Architecture Assignment Help
Architecture Assignment Help
 

Similar to 9.1-CSE3421-multicolumn-cache.pdf (20)

cache memory
 cache memory cache memory
cache memory
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
Cache memory
Cache memoryCache memory
Cache memory
 
hierarchical memory technology.pptx
hierarchical memory technology.pptxhierarchical memory technology.pptx
hierarchical memory technology.pptx
 
cache memory
cache memorycache memory
cache memory
 
waserdtfgfiuerhiuerwehfiuerghzsdfghyguhijdrtyunit5.pptx
waserdtfgfiuerhiuerwehfiuerghzsdfghyguhijdrtyunit5.pptxwaserdtfgfiuerhiuerwehfiuerghzsdfghyguhijdrtyunit5.pptx
waserdtfgfiuerhiuerwehfiuerghzsdfghyguhijdrtyunit5.pptx
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
 
computer-memory
computer-memorycomputer-memory
computer-memory
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Elements of cache design
Elements of cache designElements of cache design
Elements of cache design
 
Cache memory
Cache memoryCache memory
Cache memory
 
Cache memory
Cache memoryCache memory
Cache memory
 
Architecture Assignment Help
Architecture Assignment HelpArchitecture Assignment Help
Architecture Assignment Help
 

Recently uploaded

Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
cannyengineerings
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
sydezfe
 
AI-Based Home Security System : Home security
AI-Based Home Security System : Home securityAI-Based Home Security System : Home security
AI-Based Home Security System : Home security
AIRCC Publishing Corporation
 
2. protection of river banks and bed erosion protection works.ppt
2. protection of river banks and bed erosion protection works.ppt2. protection of river banks and bed erosion protection works.ppt
2. protection of river banks and bed erosion protection works.ppt
abdatawakjira
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
Roger Rozario
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
Addu25809
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
Kamal Acharya
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
vmspraneeth
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 

Recently uploaded (20)

Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
一比一原版(uoft毕业证书)加拿大多伦多大学毕业证如何办理
 
AI-Based Home Security System : Home security
AI-Based Home Security System : Home securityAI-Based Home Security System : Home security
AI-Based Home Security System : Home security
 
2. protection of river banks and bed erosion protection works.ppt
2. protection of river banks and bed erosion protection works.ppt2. protection of river banks and bed erosion protection works.ppt
2. protection of river banks and bed erosion protection works.ppt
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
Blood finder application project report (1).pdf
Blood finder application project report (1).pdfBlood finder application project report (1).pdf
Blood finder application project report (1).pdf
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICSUNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
UNIT 4 LINEAR INTEGRATED CIRCUITS-DIGITAL ICS
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 

9.1-CSE3421-multicolumn-cache.pdf

  • 1. 1 Multicolumn Cache: aiming for both high hit rate and low latency Xiaodong Zhang CSE 3421 additional notes for Memory Systems
  • 2. CSE3421 Slide 2  Cache design is memory-address centric  The default length is 32 bits, each memory address is unique and points to a byte (byte addressable)  Each memory address connects to the first byte of the content, and its default length is a word (4 Bytes or 32 bits)  Cache is a small storage inclusively built in the CPU chip  It contains copies of the data from memory (and disks)  A cache-block storage can be multiple-words long  Besides data storage, each block needs tags V bit, and others  For a given memory address, CPU knows where it is  By direct, set-associative or fully associative mapping  How do handle write hit, write miss, read hit, and read miss?  How to address the conflicts between high hit rates and low latency? Basic concepts of hardware cache
  • 3. CSE3421 Slide 3 Three steps for a cache miss:  Sending a memory address (bus)  Loading data from memory (no bus)  Transferring data to CPU (bus) Connections between CPU-Caches and DRAM memory
  • 4. CSE3421 Slide 6 Trade-offs between High Hit Ratios and Low Latency  Set- associative cache achieves high hit-ratios: 30% higher than that of direct-mapped cache.  But it suffers high access times due to  Multiplexing logic delay during the selection.  Tag checking, selection, and data dispatching are sequential.  Direct-mapped cache loads data and checks tag in parallel: minimizing the access time.  High power consumption is another concern for set associative cache  Can we get both high hit ratios and low latency in low power?
  • 5. Best Case of Way Prediction: First Hit Cost: way prediction only tag set offset tag way0 data way1 way2 way3 Way-prediction =? Mux 4:1 To CPU
  • 6. Way Prediction: Non-first Hit (in Set) Cost: way prediction + selection in set tag set offset tag way0 data way1 way2 way3 Way-prediction =? Mux 4:1 ≠ To CPU
  • 7. Worst Case of Way-prediction: Miss Cost: way prediction + selection in set + miss tag set offset tag way0 data way1 way2 way3 Way-prediction =? Mux 4:1 ≠ ≠ Lower level Memory To CPU
  • 8. Make a low-latency and high hit-rate cache to be a reality  A write to a set-associative cache 1. Map to the cache set 2. Find an empty way (block) to write 3. Or write to the least recent used (LRU) way (block)  A read in a set-associative cache 1. Map to the cache set 2. Compare with the tags, then select the way if matches 3. Otherwise, serve the cache miss by going to memory  Why a way-prediction is hard?  We do not know which way the cache block is located  Can we enforce which way to go in both write and read?
  • 9. Basic Ideas  Conventional Set Associative Cache  Mapping to a set of multiple ways  find a way (LRU) to write  search all ways for a read  Latency comes from “finding” or “searching” among ways  Can we decisively or directly find a way?  If so, the search would be direct to the way  How do we make this happen?  Multi-column cache
  • 10. Multi-column Caches: Fast and High Hit Ratio  Zhang, et. al., IEEE Micro, 1997.  Objectives:  Each first hit is equivalent to a direct-mapped access.  Maximize the number of first hits.  Minimize the latency of non-first-hits.  Additional hardware should be simple and low cost.
  • 11. CSE3421 Slide 17 Range of Set Associative Caches  For a fixed size cache, each time the associativity increases it doubles the number of blocks per set (or the number of ways)  halves the number of sets – decreases the set index by 1 bit  increases the size of the tag by 1 bit  Note: word offset is for multiple words in a block. word offset + byte offset = long cache block Byte offset Set Index Tag Decreasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset Direct mapped (only one way) Smaller tags, only a single comparator Increasing associativity Selects the set Used for tag to compare Cache Block Size
  • 12. Multi-Column Caches: Major Location  The bits given to tag and ``set bits” generate a direct-mapped location: Major Location.  A major location mapping = direct-mapping. Set Tag Offset Address Between direct-map and set- associative, there is a set of unused bits for tags. Let’s use the bits direct-mapping to the block (way) in the set
  • 13. CSE3421 Slide 19 Comparisons of Three Caches
  • 14. Implicit direct mapping in set associative cache  Standard set associative cache only maps to a set  Multicolumn cache uses the full direct map bits to write and read in a set associative cache  Set index remains the same  The bits given to the tag from direct map are used for which way in the set  These two sets of bits are available and free  Major location contains a most recent used (MRU) block either loaded from memory or just accessed.
  • 15. Multi-column Caches: Selected Locations  Multiple blocks can be direct-mapped to the same major location, but only MRU is the major.  The non-MRU blocks are stored in other empty locations in the set: Selected Locations.  If ``other locations” are used for their own major locations, there will be no space for selected ones.  Swap:  A block in selected location is swapped to major location as it becomes MRU.  A block in major location is swapped to a selected location after a new block is loaded in it from memory.
  • 16. Multi-Column: Indexing Selected Locations  The selected locations associated with its major location are indexed for a fast search. Location 0 Location 1 Location 2 Location 3 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Bit Vector 3 (way3) Bit Vector 2 (way 2) Bit Vector 1 (way 1) Bit Vector 0 (way 0) Major Location 1 has two selected locations at 0 and 2
  • 17. Summary of Multi-column Caches  A major location in a set is the direct-mapped location of MRU.  A selected location is the direct-mapped location but non-MRU.  A selected location index is maintained for each major location.  A ``swap” is used to ensure the block in the major location is always MRU.
  • 18. Multi-column Cache Operations 10 0001 Reference 1 Way 0 Way 1 Way 2 Way 3 0000 1101 0111 1011 0000 0000 0000 0010 Selected location Not at major location ! 1101 10 1011 Reference 2 0000 No selected location! 0001 Place 0001 at the major location ! 0000 0001 0111 1011 1011 First Hit!
  • 19. Performance Summary of Multi-column Caches  Hit ratio to the major locations is about 90%.  The hit ratio is higher than that of direct-mapped cache due to high associativity while keeps low access time of direct- mapped cache in average.  First-hit is equivalent to direct-mapped.  Non-first hits are faster than set-associative caches.  Outperforming set associative caches
  • 20. Comparing First-hit Ratios between Multicolumn and MRU Caches 64KB Data Cache Hit Rate (Program mgrid) 0% 20% 40% 60% 80% 100% 120% 4-way 8-way 16-way Hit Rate overall MRU first-hit Multicolumn first- hit Note: What is an MRU cache? An MRU bit map is used for a set associative cache, (1 bit for way, 2 bits for 4-way, …), which remembers the most recent used way as the prediction of the next access. This bit map is used for the way prediction.
  • 21. Multi-column Technique is Critical for Low Power Caches by targeting specific ways 0.1 1 10 100 1000 10000 8 5 8 7 8 9 9 1 9 3 9 5 9 7 9 9 0 1 0 3 Frequency (MHz) Transistor (M) Power (W) Frequency Transistor Power Source: Intel.com 80386 80486 Pentium Pentium Pro Pentium II Pentium III Pentium 4
  • 22. The impact of multicolumn caches  The concept of multicolumn caches has influenced the R&D in the set-associative cache products:  A tag-less cache  K-way direct mapped cache  The direct-to-data cache  Highly associative caches for low power processors  A way memorization cache  Way selection based on tag bits  In ARM Cortex R chip, cache mapping “directly access one of the ways in a set”