Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. CME212 Lecture 15 Caches and Memory CME212 – Introduction to Large-Scale Computing in Engineering High Performance Computing and Programming
  2. 2. Memory <ul><li>Writable? </li></ul><ul><ul><li>Read-Only (ROM) </li></ul></ul><ul><ul><li>Read-Write </li></ul></ul><ul><li>Accessing </li></ul><ul><ul><li>Random Access (RAM) </li></ul></ul><ul><ul><li>Sequential Access (Tapes) </li></ul></ul><ul><li>Lifetime </li></ul><ul><ul><li>Volatile (needs power) </li></ul></ul><ul><ul><li>Non-Volatile (can be powered off) </li></ul></ul>
  3. 3. Conventional RAM <ul><li>Dynamic RAM (DRAM) </li></ul><ul><ul><li>Works in refresh cycles </li></ul></ul><ul><ul><li>Few transistors means low cost </li></ul></ul><ul><li>Static RAM (SRAM) </li></ul><ul><ul><li>More transistors than DRAM </li></ul></ul><ul><ul><li>More expensive </li></ul></ul><ul><ul><li>No refresh means much faster </li></ul></ul>
  4. 4. Flash Memory <ul><li>Non-volatile memory </li></ul><ul><ul><li>Charged electrons in fields, quantum tunneling </li></ul></ul><ul><li>Cheap NAND Flash has only sequential access </li></ul><ul><li>Finite number of ”flashes” </li></ul><ul><li>Problems with writes </li></ul><ul><ul><li>Can only be written in blocks </li></ul></ul><ul><li>Used in cameras, MP3-players </li></ul>
  5. 5. Disk Operation (single-platter view) The disk surface spins at a fixed rotational rate spindle spindle spindle spindle spindle By moving radially, the arm can position the read/write head over any track. The read/write head is attached to the end of the arm and flies over the disk surface on a thin cushion of air.
  6. 6. Disk Operation (multi-platter view) arm read/write heads move in unison from cylinder to cylinder spindle
  7. 7. CPU-Memory Gap CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming Image from Sun Microsystems
  8. 8. The CPU-Memory Gap <ul><li>Cheap memory must be built out of few transistors </li></ul><ul><li>The most common main memory type is called DRAM (dynamic RAM) which saves transistors by operating in refresh cycles </li></ul><ul><li>The other type, SRAM (static RAM) uses another, more expensive design without refreshing </li></ul><ul><li>The clock frequency of CPUs increases at a much higher rate than that of DRAM </li></ul><ul><li>Conclusion: CPU must wait for data to pass through the memory system </li></ul>
  9. 9. Implications for Pipelines <ul><li>Waiting for data stalls the pipeline </li></ul><ul><li>Common DRAM latency is about 150 cycles </li></ul><ul><li>UNACCEPTABLE! </li></ul><ul><li>We will need a lot of registers to keep this latency hidden </li></ul><ul><li>Solution: cache memories </li></ul><ul><li>A cache memory is a smaller SRAM (faster) memory which act as a temporary storage to hide the DRAM latencies </li></ul>
  10. 10. Webster Definition of “cache” <ul><li>cache 'kash n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together, fr. L coactare to compel, fr. coactus, pp. of cogere to compel - more at COGENT </li></ul><ul><li>1a: a hiding place esp. for concealing and preserving provisions or implements </li></ul><ul><li>1b: a secure place of storage 2: something hidden or stored in a cache </li></ul>
  11. 11. Cache Memory CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming Cache Memory (DRAM) CPU Small but fast Close to CPU Large and slow (cheap) Far away from CPU
  12. 12. Basics of Caches <ul><li>Caches hold copies of the memory </li></ul><ul><ul><li>Need to be synchronized with memory </li></ul></ul><ul><ul><li>This is handled transparently to the CPU </li></ul></ul><ul><li>Caches have a limited capacity </li></ul><ul><ul><li>Cannot fit the entire memory at one time </li></ul></ul><ul><li>Caches work because of the principle of locality </li></ul>CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming
  13. 13. General Principles of Computer Programs <ul><li>Principle of locality: </li></ul><ul><ul><li>Programs tend to reuse data and instructions they have used recently. A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of its program. </li></ul></ul><ul><li>We can predict what instructions and data a program will use based on its history </li></ul><ul><li>Temporal locality , recently accessed items are likely to be accessed in the near future </li></ul><ul><li>Spatial locality , items whose addresses are near one another tend to be referenced close together in time. </li></ul>
  14. 14. Cache Knowledge Useful When... <ul><li>Designing a new computer </li></ul><ul><li>Writing an optimized program </li></ul><ul><ul><li>or compiler </li></ul></ul><ul><ul><li>or operating system … </li></ul></ul><ul><li>Implementing software caching </li></ul><ul><ul><li>Web caches </li></ul></ul><ul><ul><li>Proxies </li></ul></ul><ul><ul><li>File systems </li></ul></ul>
  15. 15. Cache Concepts <ul><li>Requests for data are sent to the memory subsystem </li></ul><ul><ul><li>They either hit or miss in a cache </li></ul></ul><ul><ul><li>On a miss we need to get a copy from memory </li></ul></ul><ul><li>Caches have finite capacity </li></ul><ul><ul><li>Data needs to be replaced </li></ul></ul><ul><ul><li>How do we find our victim? </li></ul></ul><ul><li>Caches need to be fast </li></ul><ul><ul><li>How do we verify if data is in the cache or not? </li></ul></ul>
  16. 16. Details of Caching <ul><li>Every piece of data is identified using an address </li></ul><ul><li>We can store the address in a “phone book” to find a piece of data </li></ul><ul><li>When the CPU sends out a request for data, we need a fast mechanism to find out if we have a hit or miss </li></ul>CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming
  17. 17. Mapping Strategies <ul><li>In a direct mapped cache each piece of data has a given location </li></ul><ul><li>In a fully associative cache any piece of data can go anywhere (parallel search) </li></ul><ul><li>In a set associative cache any piece of data can go anywhere within a subset </li></ul><ul><ul><li>Data is directly mapped to sets </li></ul></ul><ul><ul><li>Each set is associative (must be searched) </li></ul></ul>CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming
  18. 18. Set Associativity <ul><li>The address space is divided into sets modulo the associativity of the cache </li></ul><ul><li>Exact mapping given some bits of address </li></ul><ul><li>Example: </li></ul><ul><ul><li>4-way set associative, each set holds 256 bytes </li></ul></ul><ul><ul><li>Address space is 800 bytes (in hex), or 2048 bytes (decimal) </li></ul></ul><ul><ul><li>Bits 9 and 10 identify the set </li></ul></ul>CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming Potential conflict (highest bits specify a tag) set0 000-0FF 400-4FF set1 100-1FF 500-5FF set2 200-2FF 600-6FF set3 300-3FF 700-7FF
  19. 19. Address Book Cache Looking for Tommy’s Telephone Number Ö Ä Å Z Y X V U T TOMMY 12345 Ö Ä Å Z Y X V “ Address Tag” One entry per page => Direct-mapped caches with 28 entries “ Data” Indexing function
  20. 20. Address Book Cache Looking for Tommy’s Number Ö Ä Å Z Y X V U T OMMY 12345 TOMMY EQ? index
  21. 21. Address Book Cache Looking for Tomas’ Number Ö Ä Å Z Y X V U T OMMY 12345 TOMAS EQ? index Miss! Lookup Tomas’ number in the telephone directory
  22. 22. Address Book Cache Looking for Tomas’ Number Z Y X V U T OMMY 12345 TOMAS index Replace TOMMY’s data with TOMAS’ data. There is no other choice (direct mapped) OMAS 23457 Ö Ä Å
  23. 23. Cache Blocks <ul><li>To speed up the lookup process data is allocated in cache blocks consisting of several consecutively stored words </li></ul><ul><li>When you access a word you will always allocate several neighboring words in the cache </li></ul><ul><li>Works well due to the principle of locality </li></ul>CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming
  24. 24. Cache Blocks and Miss Ratios <ul><li>Consider a C array of 1024 doubles </li></ul><ul><ul><li>A pointer to a start address of a contiguous region in memory </li></ul></ul><ul><ul><li>Block size is 32 bytes which equals 4 array elements </li></ul></ul><ul><ul><li>Loop through the array with an index increment of one (stride-1) </li></ul></ul>double *array Every 4th element a cache miss. 256 misses in total Miss ratio of 25% i = 0 i = 4 i = 8
  25. 25. Consequences of Cache Blocks <ul><li>Works well because of principle of locality </li></ul><ul><ul><li>Codes with high degree of spatial locality reuse data within blocks </li></ul></ul><ul><li>We should aim for stride-1 access pattern </li></ul><ul><li>Struct’s should be packed and aligned to cache blocks </li></ul><ul><ul><li>Compiler can help </li></ul></ul><ul><ul><li>Fill out structs using dummy data </li></ul></ul>
  26. 26. Who to Replace? Picking a “victim” <ul><li>Least-recently used (LRU) </li></ul><ul><ul><li>Considered the “best” algorithm (which is not always true…) </li></ul></ul><ul><ul><li>Only practical up to ~4-way associative </li></ul></ul><ul><li>Pseudo-LRU </li></ul><ul><ul><li>Based on coarse time stamps. </li></ul></ul><ul><li>Random replacement </li></ul>
  27. 27. The Memory Hierarchy <ul><li>Extend the caching idea and create a hierarchy of caches </li></ul><ul><li>Arranged into levels </li></ul><ul><li>L1 – level 1 cache </li></ul><ul><li>L2 – level 2 cache </li></ul><ul><li>Caches are often of increasing size </li></ul><ul><li>Hide the latency of cheaper memory </li></ul>CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming L1 L2
  28. 28. Memory/Storage sram dram disk sram 2000 : 1ns 1ns 3ns 10ns 150ns 5 000 000ns 1kB 64k 4 MB 1 G B 1 TB (1982: 200ns 200ns 200ns 10 000 000ns) Registers & Caches Main Memory Disk and Virtual Memory
  29. 29. An Example Memory Hierarchy registers on-chip L1 cache (SRAM ) main memory (DRAM) local secondary storage (local disks) Larger, slower, and cheaper (per byte) storage devices remote secondary storage (distributed file systems, Web servers) Local disks hold files retrieved from disks on remote network servers. Main memory holds disk blocks retrieved from local disks. off-chip L2 cache (SRAM) L1 cache holds cache lines retrieved from the L2 cache memory. CPU registers hold words retrieved from L1 cache. L2 cache holds cache lines retrieved from main memory. L0: L1: L2: L3: L4: L5: Smaller, faster, and costlier (per byte) storage devices
  30. 30. Caching in a Memory Hierarchy 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. Level k+1: 4 4 4 10 10 10 8 9 14 3 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Level k: Data is copied between levels in block-sized transfer units
  31. 31. General Caching Concepts <ul><li>Program needs object d, which is stored in some block b. </li></ul><ul><li>Cache hit </li></ul><ul><ul><li>Program finds b in the cache at level k. e.g., block 14. </li></ul></ul><ul><li>Cache miss </li></ul><ul><ul><li>b is not at level k, so level k cache must fetch it from level k+1. e.g., block 12. </li></ul></ul><ul><ul><li>If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? </li></ul></ul><ul><ul><ul><li>Placement policy: where can the new block go? </li></ul></ul></ul><ul><ul><ul><li>Replacement policy: which block should be evicted? E.g., LRU </li></ul></ul></ul>Request 14 Request 12 9 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Level k: Level k+1: 14 14 12 14 4* 4* 12 12 0 1 2 3 Request 12 4* 4* 12
  32. 32. Block Sizes in a Typical Memory Hierarchy Capacity Block size # of lines # of 32-bit integers per block Register 32-bits 4 bytes 1 1 L1 Cache 64kB 32 bytes 2048 8 L2 Cache 2MB 64 bytes 32768 16
  33. 33. Address Translation <ul><li>Translation is expensive since we need to keep track of many pages on a multi-tasking multi-user system </li></ul><ul><ul><li>Need to search or index the page table that maintains this information </li></ul></ul><ul><li>Introduce the Translation Lookaside Buffer (TLB) to remember the most recent translations </li></ul><ul><ul><li>The TLB is a small on-chip cache </li></ul></ul><ul><ul><li>If we have an entry in the TLB the page is probably in physical memory </li></ul></ul><ul><ul><li>Translation is much quicker (faster access time) </li></ul></ul>
  34. 34. Page Sizes and TLB Reach <ul><li>Typical page sizes are 8kB or 4kB </li></ul><ul><li>TLBs typically holds 256 or 512 entries </li></ul><ul><li>The TLB reach is the amount of data we can fit in the TLB </li></ul><ul><ul><li>Multiply page size by number of entries </li></ul></ul>
  35. 35. General Caching Concepts <ul><li>Types of cache misses: </li></ul><ul><ul><li>Cold (compulsory) miss </li></ul></ul><ul><ul><ul><li>Cold misses occur because the cache is empty. </li></ul></ul></ul><ul><ul><li>Capacity miss </li></ul></ul><ul><ul><ul><li>Occurs when the set of active cache blocks (working set) is larger than the cache. </li></ul></ul></ul><ul><ul><li>Conflict miss </li></ul></ul><ul><ul><ul><li>Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. </li></ul></ul></ul><ul><ul><ul><li>E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time. </li></ul></ul></ul>
  36. 36. Caches in Hierarchies <ul><li>To syncronize data in hierachies caches can either be: </li></ul><ul><li>Write-through </li></ul><ul><ul><li>Reflect change immediately </li></ul></ul><ul><ul><li>L1 is often write-through </li></ul></ul><ul><li>Write-back </li></ul><ul><ul><li>Syncronize all data at a given signal </li></ul></ul><ul><ul><li>Less traffic </li></ul></ul>
  37. 37. Cache Performance Metrics <ul><li>Miss Rate </li></ul><ul><ul><li>Fraction of memory references not found in cache (misses/references) </li></ul></ul><ul><ul><li>Typical numbers: </li></ul></ul><ul><ul><ul><li>3-10% for L1 </li></ul></ul></ul><ul><ul><ul><li>can be quite small (e.g., < 1%) for L2, depending on size, etc. </li></ul></ul></ul><ul><li>Hit Time </li></ul><ul><ul><li>Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) </li></ul></ul><ul><ul><li>Typical numbers: </li></ul></ul><ul><ul><ul><li>1 clock cycle for L1 </li></ul></ul></ul><ul><ul><ul><li>3-8 clock cycles for L2 </li></ul></ul></ul><ul><li>Miss Penalty </li></ul><ul><ul><li>Additional time required because of a miss </li></ul></ul><ul><ul><ul><li>Typically 25-100 cycles for main memory </li></ul></ul></ul>
  38. 38. Caches and Performance <ul><li>Caches are extremely important for performance </li></ul><ul><ul><li>Level 1 latency is usually 1 or 2 cycles </li></ul></ul><ul><li>Caches only work well for programs with nice locality properties </li></ul><ul><li>Caching can be used in other areas as well, example: web-caching (proxies) </li></ul><ul><li>Modern CPUs have two or three levels of caches. </li></ul><ul><ul><li>Largest caches are tens of megabytes </li></ul></ul><ul><li>Most of the chip area is used for caches </li></ul>
  39. 39. Nested Multi-dim Arrays <ul><li>Dimensions are stacked consecutively using an index mapping </li></ul><ul><li>Consider a square two-dimensional array of size N </li></ul>N N
  40. 40. Row or Column-wise Order <ul><li>If you allocate a static multi-dimensional array in C the rows of your array will be stored consequtively </li></ul><ul><li>This is called row-wise ordering </li></ul><ul><li>Row-wise or row-major ordering means column index should vary fastest (i,j) </li></ul><ul><li>Column-wise or column-major ordering means that the row index should vary fastest </li></ul><ul><ul><li>Used in Fortran </li></ul></ul>CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming
  41. 41. Row-Major Ordering <ul><li>(i,j) loop will give stride-1 access </li></ul><ul><li>(j,i) loop will give stride-N access </li></ul>Array(i,j) -> (i*N+j)
  42. 42. Column-Major Ordering <ul><li>(i,j) will give stride-N </li></ul><ul><li>(j,i) will give stride-1 </li></ul>Array(i,j) ->(i+j*N)
  43. 43. Dynamically Allocated Arrays <ul><li>If you use a nested array you can choose row-major or column-major using your indexing function (i+N*j) or (i*N+j) </li></ul><ul><li>For multi-level arrays there is no guarantee that the rows (the second indirection) will be stored consecutively </li></ul><ul><li>You can still achieve this using some pointer arithmetic (page 92 in Oliviera) </li></ul>CME212 – Introduction to Large Scale Computing in Engineering High Performance Computing and Programming
  44. 44. Data caches, example <ul><li>double x[m][n]; </li></ul><ul><li>register double sum = 0.0; </li></ul><ul><li>for( i = 0; i < m; i++ ){ </li></ul><ul><li>for( j = 0; j < n; j++) { </li></ul><ul><li>sum = sum + x[i][j]; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>Assumptions: </li></ul><ul><li>Only one data cache </li></ul><ul><li>A cache block contains 4 double elements </li></ul><ul><li>The i,j,sum variables stay in registers </li></ul>
  45. 45. Storage visualization, (i,j)-loop for( i = 0; i < m; i++ ) { for( j = 0; j < n; j++) { sum = sum + x[i][j]; } } MISS MISS i j MISS 0 1 2 4 n 1 2 3 m
  46. 46. Storage visualization, (j,i)-loop MISS MISS MISS MISS i j for( j = 0; j < m; j++ ) { for( i = 0; i < n; i++) { sum = sum + x[i][j]; } } 1 2 3 n 1 2 3 m
  47. 47. Cache Thrashing <ul><li>The start addresses of x and y might map to the same set </li></ul><ul><li>Accesses to y will conflict with x </li></ul><ul><ul><li>No data will be mapped to the other sets </li></ul></ul><ul><ul><li>Only one set will be used (small part of the cache) </li></ul></ul><ul><ul><li>Index bits are the same for x and y </li></ul></ul><ul><li>Solution: array padding </li></ul><ul><ul><li>Make one array larger </li></ul></ul><ul><ul><li>Distance between arrays will not be a power of 2 </li></ul></ul><ul><ul><li>Same thing can happen in set associative caches </li></ul></ul>float dotprod(float x[256], float y[256]) { float sum = 0.0; int i; for( i=0; i < 256; i++ ) sum += x[i] * y[i]; return sum; } Bevare of array sizes that are powers of two!
  48. 48. Array Padding <ul><li>Used to reduce thrashing </li></ul><ul><ul><li>Especially important for multi-dimensional arrays </li></ul></ul><ul><li>Allocate more space </li></ul><ul><ul><li>Which isn’t used in computations </li></ul></ul><ul><ul><li>Will shift subsequent arrays to addresses that are not powers of two </li></ul></ul><ul><li>Typical padding </li></ul><ul><ul><li>Use a prime number like 13, 21, 31 </li></ul></ul><ul><ul><li>Verify effect experimentally </li></ul></ul>