Memory Hierarchy Computer memory is organized in ahierarchy. This done to cope up withthe speed of processor and henceincrease performance. Closest to the processor are theProcessing registers. Then comes theCache memory, followed by Mainmemory.
SRAM and DRAM Both are random access memories and arevolatile, i.e. constant power supply is requiredto avoid data loss. DRAM :- made up of a capacitor and atransistor. Transistor acts as a switch anddata in the form of charge is present on thecapacitor. Requires periodic chargerefreshing to maintain data storage. Lessercost per bit, less expensive. Used for largememory SRAM :- made up of 4 transistors, which arecross-connected in an arrangement thatproduces stable logic state. Greater costs perbit, more expensive. Used for small memory.
Principles of Locality Since programs can access a smallportion of their address space at anygiven instant, thus to increaseperformance, two policies are followed:- A) Temporal Locality :- locality in time,i.e. if an item is referred, it will tend toreferred again soon. B) Spatial Locality :- locality in space,i.e. if an item is referred, its neighboring
Mapping Functions There are three main types of memorymapping functions :- 1) Direct Mapped 2) Fully Associative 3) Set Associative For the coming explanations, let usassume 1GB main memory, 128KBCache memory and Cache line size32B.
Direct MappingTAG LINE or SLOT (r) OFFSET•Each memory block is mapped to asingle cache line. For the purpose ofcache access, each main memoryaddress can be viewed as consisting ofthree fields•No two block in the same line have thesame Tag field•Check contents of the cache by findings w
For the given example, we have – 1GB main memory = 220 bytes Cache size = 128KB = 217 bytes Block size = 32B = 25 bytes No. of cache lines = 217/25 = 212, thus12 bits are required to locate 212 lines. Also, offset is 25bytes and thus 5 bitsare required to locate individual byte. Thus Tag bits = 32 – 12 - 5 = 14 bits14 12 5
Summary Address length = (s + w) bits Number of addressable units = 2s+wwords or bytes Block size = line size = 2w words or bytes No. of blocks in main memory = 2s+ w/2w= 2s Number of lines in cache = m = 2r Size of tag = (s – r) bits Mapping Function Jth Block of the main memory maps to ithcache line I = J modulo M (M = no. of cache lines)
Pro’s and Con’s Simple Inexpensive Fixed location for given block If a program accesses 2 blocks thatmap to the same line repeatedly,cache misses (conflict misses) arevery high
Fully Associative Mapping A main memory block can load into anyline of cache Memory address is interpreted as tagand word Tag uniquely identifies block of memory Every line’s tag is examined for a match Cache searching gets expensive andmore power consumption due to parallelcomparatorsTAG OFFSETs w
For the given example, we have – 1GB main memory = 220 bytes Cache size = 128KB = 217 bytes Block size = 32B = 25 bytesHere, offset is 25bytes and thus 5 bitsare required to locate individual byte. Thus Tag bits = 32 – 5 = 27 bits27 5
Fully Associative MappingSummary Address length = (s + w) bits Number of addressable units = 2s+w wordsor bytes Block size = line size = 2w words or bytes No. of blocks in main memory = 2s+ w/2w =2s Number of lines in cache = Total Numberof cache blocks Size of tag = s bits
Pro’s and Con’s There is flexibility as to which block toreplace when a new block is read intothe cache The complex circuitry required forparallel Tag comparison is however amajor disadvantage.
Set Associative Mapping Cache is divided into a number of sets Each set contains a number of lines A given block maps to any line in agiven set. e.g. Block B can be in anyline of set i If 2 lines per set, 2 way associative mapping A given block can be in one of 2 lines inonly one sets wTAG SET (d) OFFSET
For the given example, we have – 1GB main memory = 220 bytes Cache size = 128KB = 217 bytes Block size = 32B = 25 bytes Let it be a 2-way set associative cache, No. of sets = 217/(2*25 )= 211, thus 11 bitsare required to locate 211 sets and eachset containing 2 lines each Also, offset is 25bytes and thus 5 bits arerequired to locate individual byte. Thus Tag bits = 32 – 11 - 5 = 16 bits16 11 5
Set Associative MappingSummary Address length = (s + w) bits Number of addressable units = 2s+w words orbytes Block size = line size = 2w words or bytes Number of blocks in main memory = 2s Number of lines in set = k Number of sets = v = 2d Number of lines in cache = kv = k * 2d Size of tag = (s – d) bits Mapping Function Jth Block of the main memory maps to ith set I = J modulo v (v = no. of sets) Within the set, the block can be mapped to anycache line.
Pro’s and Con’s After simulating the hit ratio for directmapped and (2,4,8 way) set associativemapped cache, we observe that thereis significant difference in performanceat least up to cache size of 64KB, setassociative being the better one. However, beyond that, the complexityof cache increases in proportion to theassociativity, hence both mapping giveapproximately similar hit ratio.
N-way Set Associative CacheVs. Direct Mapped Cache: N comparators Vs 1 Extra mux delay for the data Data comes after hit/miss In a direct map cache, cache block isavailable before hit/miss Number of misses DM > SA > FA Access latency : time to perform read orwrite operation, i.e. time from instantaddress is presented to memory to theinstant that data have stored or madeavailable DM < SA < FA
Types of MissesCompulsory Misses :- When a program is started, the cacheis completely empty and hence thefirst access to the block will always bea miss as it has to brought to thecache from memory, at least for thefirst time. Also called first reference misses.Can’t be avoided easily.
Capacity Misses Since the cache cannot hold all theblocks needed during the execution ofprogram Thus this miss occurs due to theblocks being discarded and laterretrieved. They occur because the cache islimited in size. Fully Associative cache has this as itsmajor miss reason.
Conflict Misses It occurs because multiple distinctmemory locations map to the samecache location. Thus in case of DM or SA, it occursbecause a blocks being discarded andlater retrieved. In DM, this is a repeated phenomenonas two blocks which map to the samecache line can be accessed alternatelyand thereby decreasing the hit ratio. This phenomenon is called
Coherence Misses Occur when other processors updatememory which in turn invalidates thedata block present in otherprocessor’s cache.
Replacement Algorithms For Direct Mapped Cache, since eachblock maps to only one line, we have nochoice but the replace that line itself Hence there isn’t any replacement policyfor DM. For SA and FA, few replacement policies:-◦ Optimal◦ Random◦ Arrival◦ Frequency◦ Recently Used
OptimalThis is the ideal benchmarkingreplacement strategy. All other policies are compared to it. This is not implemented, but used justfor comparison purposes.
Random Block to be replaced is randomlypicked Minimum hardware complexity – just apseudo random number generatorrequired. Access time is not affected by thereplacement circuit. Not suitable for high performancesystems
Arrival - FIFO For an N-way set associative cache Implementation 1 Use N-bit register per cache line to store arrival time information On cache miss – registers of all cache line in the set are compared to choose the victim cache line Implementation 2 Maintain a FIFO queue Register with (log2 N) bits per cache line On cache miss – cache line corresponding to register value 00 will be the victim. Decrement all other registers in the set by 1 and set the victim register with value N-1
FIFO : Advantages &Disadvantages Advantages Low hardware Complexity Better cache hit performance than Randomreplacement The cache access time is not affected by thereplacement strategy (not in critical path) Disadvantages Cache hit performance is poor compared to LRU andfrequency based replacement schemes Not suitable for high performance systems Replacement circuit complexity increases with increase
Frequency – Least FrequentlyUsed Requires a register per cache line tosave number of references (frequencycount) If cache access is hit, then increasefrequency count of the correspondingregister by 1 If cache miss, find the victim cache lineas the cache line corresponding tominimum frequency count in the set Reset the register corresponding tovictim cache line as 0 LFU can not differentiate between past
Least Frequently Used –Dynamic Aging (LFU-DA) When any frequency count register inthe set reaches its maximum value, allthe frequency count registers in thatset is shifter one position right (divideby 2) Rest is same as LFU
LFU : Advantages &Disadvantages Advantages For small and medium caches LFU works betterthan FIFO and Random replacements Suitable for high performance systems whosememory pattern follows frequency order Disadvantages The register should be updated in every cacheaccess Affects the critical path The replacement circuit becomes more complicatedwhen
Least Recently Used Policy Most widely used replacementstrategy Replaces the least recently usedcache line Implemented by two techniques :-◦ Square Matrix Implementation◦ Counter Implementation
Square Matrix Implementation N2 bits per set (DFF’s) to store the LRUinformation The cache line corresponding to the rowwith all zeros is the victim cache linefor replacement If cache hit, all the bits in correspondingrow is set to 1 and all the bits incorresponding column is set to 0. If cache miss, priority encoder selectsthe cache line corresponding to the rowwith all zeros for replacement Used when associativity is less
Counter Implementation N registers with log2N bits for N- wayset associativity. Thus Nlog2N bitsused. Each register for each line Cache line corresponding to counter 0is victim cache line for replacement If hit, all cache line with countergreater than hit cache line isdecremented by 1 & hit cache line isset to N-1 If miss, the cache whose count value
Look PolicyLook Through : Access Cache, if data not found access the lowerlevelLook Aside : Request to Cache and its lower level at the same
Write PolicyNeed of Write Policy :- A block in cache might have been beupdated, but corresponding updationin main memory might not have beendone Multiple CPU’s have individualcache’s, thereby invalidating the datain other processor’s cache I/O may be able to read write directlyinto main memory
Write Through In this technique, all the write operationsare made to main memory as well as tocache, ensuring MM is always valid. Any other processor-cache module, maymonitor traffic to MM to maintainconsistency.DISADVANTAGE It generates memory traffic and maycreate bottleneck. Bottleneck : delay in transmission of datadue to less bandwidth. Hence info is notrelayed at speed it is processed.
Pseudo Write Through Also called Write Buffer Processor writes data into the cacheand the write buffer Memory controller writes contents ofthe buffer to memory FIFO (typical number of entries 4) After write is complete, buffer isflushed
Write Back In this technique, the updates are made onlyin cache. When an update is made, a dirty bit or use bit,associated with the line is set Then when a block is replaced, it is writtenback into the main memory, iff the dirty bit isset Thus it minimizes memory writesDISADVANTAGE Portions of MM are still invalid, hence I/Oshould be allowed access only through cache This makes complex circuitry and potentialbottleneck
Cache CoherencyThis is required only in case ofmultiprocessors where each CPU hasits own cacheWhy is it needed ? Be it any write policy, if the data ismodified in one cache, it invalidatesthe data in other cache, if they seemto hold the same data Hence we need to maintain a cachecoherency to obtain correct results
Approaches towards CacheCoherency1) Bus watching write through : Cache controller monitors writes intoshared memory that also resides inthe cache memory If any writes are made, the controllerinvalidates the cache entry This approach depends on use ofwrite through policy
2) Hardware Transparency :- Additional hardware to ensure that allupdates to main memory via cacheare reflected in all cache3) Non Cacheable memory :- Only a portion of main memory isshared by more than 1 processor, andthis is designated as non cacheable. Here, all access to shared memoryare cache misses, as its never copiedto cache
Cache Optimization Reducing the miss penalty1. Multi level caches2. Critical word first3. Priority to Read miss over writes4. Merging write buffers5. Victim caches
Multilevel Cache The inclusion of an on-chip cache gaveleft a question whether another externalcache is still desirable? The answer is yes! The reasons are :◦ If there is no L2 cache and Processor makesa request for a memory location not in the L1cache, then it accesses the DRAM or ROM.Due to relatively slower bus speed,performance degrades.◦ Whereas, if an L2 SRAM cache is included,the frequently missing information can bequickly retrieved. Also SRAM is fast enoughto match the bus speed, hence giving zero-wait state transaction.
L2 cache do not use the system bus aspath for transfer between L2 andprocessor, but a separate data path toreduce burden A series have simulations have provedthat L2 cache is most efficient whenits double the size of L1 cache, asotherwise, its contents will be similar toL1 Due to continued shrinkage of processorcomponents, many processors canaccommodate L2 cache on chip givingrise to opportunity to include an L3 cache The only disadvantage of multilevelcache is that it complicates the design,
Cache Performance Average memory access time = HittimeL1+Miss Rate L1 X (Hit time L2 +Miss Rate L2 X Miss penalty L2) Average memory stalls per instruction= Misses per instruction L1 X (Hit timeL2 + Misses per instruction L2 X Misspenalty L2)
Unified Vs Split Cache Earlier same cache is used for data aswell as instructions i.e. Unified Cache Now we have separate caches fordata and instructions i.e. Split cache Thus, if the processor attempts tofetch instruction from main memory, itfirst consults the instruction L1 cacheand similarly for data.
Advantages of Unified Cache It balances load between data andinstructions automatically. That is, if execution involves moreinstruction fetches, the cache will tendto fill up with instructions, and ifexecution involves more of datafetches, the cache tends to fill up withdata. Only one cache is needed to design
Advantages of Split Cache Useful in parallel instruction executionand pre-fetching of predicted futureinstructions Eliminate contention for the instructionfetch/decode unit and the executionunit and thereby supporting pipelining the processor will fetch the instructionsahead of time and fill the buffer, orpipeline. E.g. Super scalar machines Pentiumand Power PC
Critical Word First This policy involves sending therequested word first and then transferthe rest. Thus getting the data to theprocessor in 1st cycle. Assume that 1 block = 16 bytes. 1 cycletransfers 4 bytes. Thus at least 4 cyclesrequired to transfer the block. If the processor demands for 2nd byte,then why should we wait for entire blockto be transferred. We can first send thatword and then the complete block withthe remaining bytes.
Priority to read miss overwritesWrite Buffer: Using write buffers: RAW conflicts with reads on cachemisses If simply wait for write buffer to empty - increases readmiss penalty by 50% Check the content of the write buffer on read miss, if noconflicts and memory system is available, allow readmiss to continue. If there is a conflict, then flush thebuffer before readWrite Back? Read miss replacing dirty block Normal: Write dirty block to memory, and then do theread Instead copy the dirty block to a write buffer, then do the
Victim Cache How to combine fast hit time of DM withreduced conflict Misses? Add a small fully associative buffer(cache) to hold data discarded fromcache Victim Cache A small fully associative cache is usedfor collecting spill out data Blocks that are discarded because of amiss (Victim) is stored in victim cacheand is checked on a cache miss. If found swap the data block betweenvictim cache and main cache
Replacement will always happen with the LRUblock of victim cache. The block that we wantto transfer is made MRU. Then from cache, the block will come to victimcache and made MRU. The block which was transferred to cache isnow made LRU If miss in victim cache also, then MM isreferred.01 00 10 1188 0011 1100
Cache Optimization Reducing hit time1. Small and simple caches2. Way prediction cache3. Trace cache4. Avoid Address translation duringindexing of the cache
Cache Optimization Reducing miss rate1)Changing cache configurations2)Compiler optimization
Cache Optimization Reducing miss penalty per miss ratevia parallelism1)Hardware prefetching2)Compiler prefetching