Memory Hierarchy

19,481 views
24,271 views

Published on

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
19,481
On SlideShare
0
From Embeds
0
Number of Embeds
10,800
Actions
Shares
0
Downloads
339
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • http://www.usbyte.com/common/HDD_4.htm
  • Memory Hierarchy

    1. 1. Memory Hierarchy
    2. 2. Random-Access Memory <ul><li>Static RAM (SRAM) </li></ul><ul><ul><li>Each cell stores bit with a six-transistor circuit. </li></ul></ul><ul><ul><li>Retains value indefinitely, as long as it is kept powered. </li></ul></ul><ul><ul><li>Relatively insensitive to disturbances such as electrical noise. </li></ul></ul><ul><ul><li>Faster and more expensive than DRAM. </li></ul></ul><ul><li>Dynamic RAM (DRAM) </li></ul><ul><ul><li>Each cell stores bit with a capacitor and transistor. </li></ul></ul><ul><ul><li>Value must be refreshed every 10-100ms. </li></ul></ul><ul><ul><li>Sensitive to disturbances. </li></ul></ul><ul><ul><li>Slower and cheaper than SRAM. </li></ul></ul>Tran. Access per bit time Persist? Sensitive? Cost Applications SRAM 6 1X Yes No 100x cache memories DRAM 1 10X No Yes 1X Main memories, frame buffers
    3. 3. Conventional DRAM Organization <ul><li>d x w DRAM: </li></ul><ul><ul><li>dw total bits organized as d supercells of size w bits </li></ul></ul>cols rows 0 1 2 3 0 1 2 3 internal row buffer 16 x 8 DRAM chip addr data supercell (2,1) 2 bits / 8 bits / memory controller (to CPU)
    4. 4. Reading DRAM Supercell (2,1) <ul><ul><li>Step 1(a): Row access strobe ( RAS ) selects row 2. </li></ul></ul><ul><ul><li>Step 1(b): Row 2 copied from DRAM array to row buffer. </li></ul></ul>cols rows RAS = 2 0 1 2 3 0 1 2 internal row buffer 16 x 8 DRAM chip 3 addr data 2 / 8 / memory controller
    5. 5. Reading DRAM Supercell (2,1) <ul><ul><li>Step 2(a): Column access strobe ( CAS ) selects column 1. </li></ul></ul><ul><ul><li>Step 2(b): Supercell (2,1) copied from buffer to data lines, and eventually back to the CPU. </li></ul></ul>internal buffer cols rows 0 1 2 3 0 1 2 3 internal row buffer 16 x 8 DRAM chip CAS = 1 addr data 2 / 8 / memory controller supercell (2,1) supercell (2,1) To CPU
    6. 6. Memory Modules : supercell (i,j) 64 MB memory module consisting of eight 8Mx8 DRAMs Memory controller DRAM 7 DRAM 0 bits 0-7 bits 8-15 bits 16-23 bits 24-31 bits 32-39 bits 40-47 bits 48-55 bits 56-63 addr (row = i, col = j) 0 31 7 8 15 16 23 24 32 63 39 40 47 48 55 56 64-bit doubleword at main memory address A 64-bit doubleword 0 31 7 8 15 16 23 24 32 63 39 40 47 48 55 56 64-bit doubleword at main memory address A
    7. 7. Enhanced DRAMs <ul><li>All enhanced DRAMs are built around the conventional DRAM core. </li></ul><ul><ul><li>Fast page mode DRAM ( FPM DRAM ) </li></ul></ul><ul><ul><ul><li>Access contents of row with [RAS, CAS, CAS, CAS, CAS] instead of [(RAS,CAS), (RAS,CAS), (RAS,CAS), (RAS,CAS)]. </li></ul></ul></ul><ul><ul><li>Extended data out DRAM ( EDO DRAM ) </li></ul></ul><ul><ul><ul><li>Enhanced FPM DRAM with more closely spaced CAS signals. </li></ul></ul></ul><ul><ul><li>Synchronous DRAM ( SDRAM) </li></ul></ul><ul><ul><ul><li>Driven with rising clock edge instead of asynchronous control signals. </li></ul></ul></ul><ul><ul><li>Double data-rate synchronous DRAM ( DDR SDRAM ) </li></ul></ul><ul><ul><ul><li>Enhancement of SDRAM that uses both clock edges as control signals. </li></ul></ul></ul><ul><ul><li>Video RAM ( VRAM ) </li></ul></ul><ul><ul><ul><li>Like FPM DRAM, but output is produced by shifting row buffer </li></ul></ul></ul><ul><ul><ul><li>Dual ported (allows concurrent reads and writes) </li></ul></ul></ul>
    8. 8. Registers vs. Data Cache (1) <ul><li>Registers </li></ul><ul><ul><li>Explicitly managed by the compiler </li></ul></ul><ul><ul><li>Can use information available at compile time to preload data into registers and to purge data more effectively. </li></ul></ul><ul><ul><li>Outperforms a data cache by nearly a factor of two in both speed and cost. </li></ul></ul><ul><ul><li>Not easy to allocate for objects requiring multiple storage units. </li></ul></ul><ul><ul><li>Aliasing problem. </li></ul></ul>
    9. 9. Registers vs. Data Cache (2) <ul><li>Data Caches </li></ul><ul><ul><li>Based on the “locality of reference” of programs </li></ul></ul><ul><ul><ul><li>Temporal vs. spatial </li></ul></ul></ul><ul><ul><li>Takes dynamic program behavior into account </li></ul></ul><ul><ul><li>Invisible to programmers </li></ul></ul><ul><ul><li>Architecture-independent </li></ul></ul><ul><ul><ul><li>Some make the cache visible to ISA. </li></ul></ul></ul><ul><ul><li>Coherency problem in multiprocessor system </li></ul></ul>
    10. 10. <ul><li>Vanilla SDRAM </li></ul><ul><ul><li>FSB 100MHz, 133MHz (PC100, PC133) </li></ul></ul><ul><ul><li>Memory bandwidth: FSB x 1 data / clock cycle x 64bits / 1 data </li></ul></ul><ul><ul><ul><li>FSB 100MHz, 8 bytes/clock = 800MB/s </li></ul></ul></ul><ul><li>DDR SDRAM </li></ul><ul><ul><li>Double data rate </li></ul></ul><ul><ul><li>PC1600: Vanilla SDRAM @ 100MHz x 2 </li></ul></ul><ul><ul><li>PC2100: Vanilla SDRAM @ 133MHz x 2 </li></ul></ul><ul><li>RDRAM </li></ul><ul><ul><li>PC600, PC700, PC800 </li></ul></ul><ul><ul><li>PC600: 600MHz x 32bits = 2.4GB/s </li></ul></ul>
    11. 11. Nonvolatile Memories <ul><li>Nonvolatile memories retain value even if powered off. </li></ul><ul><ul><li>Generic name is read-only memory (ROM). </li></ul></ul><ul><ul><li>Misleading because some ROMs can be read and modified. </li></ul></ul><ul><li>Types of ROMs </li></ul><ul><ul><li>Programmable ROM (PROM) </li></ul></ul><ul><ul><li>Eraseable programming ROM (EPROM) </li></ul></ul><ul><ul><li>Electrically eraseable PROM (EEPROM) </li></ul></ul><ul><ul><li>Flash memory </li></ul></ul><ul><li>Firmware </li></ul><ul><ul><li>Program stored in a ROM </li></ul></ul><ul><ul><ul><li>Boot time code, BIOS (Basic Input/Output System) </li></ul></ul></ul><ul><ul><ul><li>Graphics cards, disk controllers, etc. </li></ul></ul></ul>
    12. 12. Flash Memory © Samsung Electronics, Co.
    13. 13. Flash Memory Characteristics <ul><li>Operations </li></ul><ul><ul><li>Read </li></ul></ul><ul><ul><li>Write or Program – change state from 1 to 0 </li></ul></ul><ul><ul><li>Erase – change state from 0 to 1 </li></ul></ul><ul><li>Unit </li></ul><ul><ul><li>Page (sector) – management or program unit </li></ul></ul><ul><ul><li>Block – erase unit </li></ul></ul>write write erase 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1
    14. 14. NOR vs. NAND Flash (1) <ul><li>NOR Flash </li></ul><ul><ul><li>Random, direct access interface </li></ul></ul><ul><ul><li>Fast random reads </li></ul></ul><ul><ul><li>Slow erase and write </li></ul></ul><ul><ul><li>Boot image, BIOS, Cellular phone, etc. </li></ul></ul><ul><li>NAND Flash </li></ul><ul><ul><li>I/O mapped access </li></ul></ul><ul><ul><li>Smaller cell size, lower cost </li></ul></ul><ul><ul><li>Smaller size erase blocks </li></ul></ul><ul><ul><li>Better performance for erase and write </li></ul></ul><ul><ul><li>Solid state file storage, MP3, Digital camera, etc. </li></ul></ul>
    15. 15. NOR vs. NAND Flash (2) <ul><li>Characteristics of Various Memory Devices </li></ul>
    16. 16. Flash Advantages <ul><li>Non-volatile </li></ul><ul><li>Small </li></ul><ul><li>Light-weight </li></ul><ul><li>Low-power </li></ul><ul><li>Robust </li></ul><ul><li>Fast read access times (compared to disks) </li></ul>
    17. 17. Flash Drawbacks <ul><li>Much slower write access times </li></ul><ul><li>No in-place-update </li></ul><ul><ul><li>A write should be preceded by an erase operation. </li></ul></ul><ul><ul><li>Erase operations can only be performed in a much larger unit than the write operation. </li></ul></ul><ul><li>Limited lifetime </li></ul><ul><ul><li>Typically, 100,000 – 1,000,000 program/erase cycles </li></ul></ul><ul><li>Bad blocks (for NAND) </li></ul>
    18. 18. Flash Memory Application <ul><li>Low Cost and High Density </li></ul>Code Memory-NOR BIOS/Networking (PC/router/hub) Telecommunications (switcher) Cellular Phone (code & data) POS / PDA / PCA (code & data) <ul><li>Fast Random Access </li></ul><ul><li>XIP </li></ul>Mass Storage-NAND Memory Cards (mobile computers) Solid-State Disk (rugged & reliable storage) Digital Camera (still & moving pictures) Voice/Audio Recorder (near CD quality)
    19. 19. Flash-based Data Storage (1) <ul><li>MultiMedia Card (MMC) / CompactFlash </li></ul><ul><ul><li>A microprocessor provides many capabilities. </li></ul></ul><ul><ul><ul><li>Host independence from details of erasing and programming flash memory </li></ul></ul></ul><ul><ul><ul><li>Sophisticated system for errors (bad blocks, ECC) </li></ul></ul></ul><ul><ul><ul><li>Power management for low power operation </li></ul></ul></ul>
    20. 20. Flash-based Data Storage (2) <ul><li>FFD 2.5” from M-Systems </li></ul><ul><ul><li>Solid-state flash disk in a 2.5” disk </li></ul></ul><ul><ul><li>Up to 90GB </li></ul></ul><ul><ul><li>ATA-6: interface speed of 100MB/s </li></ul></ul><ul><ul><li>40MB/s sustained read/write rates </li></ul></ul><ul><ul><li>Released: March 10, 2004 </li></ul></ul><ul><ul><li>~$40,000 for 90GB </li></ul></ul><ul><li>Benefits </li></ul><ul><ul><li>Reliable and robust: no mechanical parts </li></ul></ul><ul><ul><li>Small, light-weight, low power consumption </li></ul></ul>
    21. 21. Typical Bus Structure <ul><li>A bus is a collection of parallel wires that carry address, data, and control signals. </li></ul><ul><li>Buses are typically shared by multiple devices. </li></ul>main memory I/O bridge bus interface ALU register file CPU chip system bus memory bus
    22. 22. Modern PC Architecture Memory Controller Hub (MCH) I/O Controller Hub (ICH)
    23. 23. Disk Geometry <ul><ul><li>Disks consist of platters , each with two surfaces . </li></ul></ul><ul><ul><li>Each surface consists of concentric rings called tracks . </li></ul></ul><ul><ul><li>Each track consists of sectors separated by gaps. </li></ul></ul>spindle surface tracks track k sectors gaps
    24. 24. Multiple-platter View <ul><li>Aligned tracks form a cylinder </li></ul>surface 0 surface 1 surface 2 surface 3 surface 4 surface 5 cylinder k spindle platter 0 platter 1 platter 2 arm read/write heads move in unison from cylinder to cylinder
    25. 25. Disk Operation <ul><li>Disk Operation (Single-platter view) </li></ul>The disk surface spins at a fixed rotational rate spindle spindle spindle spindle spindle By moving radially, the arm can position the read/write head over any track. The read/write head is attached to the end of the arm and flies over the disk surface on a thin cushion of air.
    26. 26. Disk Device (3) <ul><li>Hard Disk Internals </li></ul><ul><ul><li>Our Boeing 747 will fly at the altitude of only a few mm at the speed of approximately 65mph periodically landing and taking off. </li></ul></ul><ul><ul><li>And still the surface of the runway, which consists of a few mm-think layers, will stay intact for years. </li></ul></ul>
    27. 27. Disk Access Time <ul><li>Average time to access some target sector approximated by : </li></ul><ul><ul><li>Taccess = Tavg seek + Tavg rotation + Tavg transfer </li></ul></ul><ul><li>Seek time </li></ul><ul><ul><li>Time to position heads over cylinder containing target sector. </li></ul></ul><ul><ul><li>Typically 9ms </li></ul></ul><ul><li>Rotational latency </li></ul><ul><ul><li>Time waiting for first bit of target sector to pass under r/w head. </li></ul></ul><ul><ul><li>½ x 1/RPMs x 60 sec/1min </li></ul></ul><ul><li>Transfer time </li></ul><ul><ul><li>Time to read the bits in the target sector. </li></ul></ul><ul><ul><li>1/RPM x 1/(avg #sectors/track) x 60sec/1min </li></ul></ul><ul><li>Important points: </li></ul><ul><ul><li>Access time dominated by seek time. </li></ul></ul><ul><ul><li>First bit in a sector is the most expensive, the rest are free. </li></ul></ul>
    28. 28. Hard Disk Data Sheet <ul><li>model Baracuda ATA II cheetah 73 </li></ul><ul><li>capacity 30GB 73GB </li></ul><ul><li>plates # 3 12 </li></ul><ul><li>heads # 6 24 </li></ul><ul><li>RPM 7200 10025 </li></ul><ul><li>sector size 512B same </li></ul><ul><li>sector/track 63 463 </li></ul><ul><li>track/in 21368 18145 </li></ul><ul><li>seek time </li></ul><ul><ul><li>read 8.2 ms 5.85ms </li></ul></ul><ul><ul><li>write 9.5ms 6.35ms </li></ul></ul><ul><ul><li>track to track(r) 1.2ms 0.6ms </li></ul></ul><ul><ul><li>track to track(w) 1.9ms 0.9ms </li></ul></ul>
    29. 29. Logical Disk Blocks <ul><li>Modern disks present a simple abstract view of the complex sector geometry: </li></ul><ul><ul><li>The set of available sectors is modeled as a sequence of block-sized logical blocks (0, 1, 2, …) </li></ul></ul><ul><li>Mapping between logical blocks and actual (physical) sectors </li></ul><ul><ul><li>Maintained by hardware/firmware device called disk controller. </li></ul></ul><ul><ul><li>Converts requests for logical blocks into (surface, track, sector) triples </li></ul></ul><ul><li>Disk controller also performs some intelligent functions </li></ul><ul><ul><li>Buffering, caching, prefetching, scheduling, etc. </li></ul></ul>
    30. 30. I/O Bus main memory I/O bridge bus interface ALU register file CPU chip system bus memory bus disk controller graphics adapter USB controller mouse keyboard monitor disk I/O bus Expansion slots for other devices such as network adapters.
    31. 31. Reading a Disk Sector (1) main memory ALU register file CPU chip disk controller graphics adapter USB controller mouse keyboard monitor disk I/O bus bus interface CPU initiates a disk read by writing a command, logical block number, and destination memory address to a port (address) associated with disk controller.
    32. 32. Reading a Disk Sector (2) main memory ALU register file CPU chip disk controller graphics adapter USB controller mouse keyboard monitor disk I/O bus bus interface Disk controller reads the sector and performs a direct memory access ( DMA ) transfer into main memory.
    33. 33. Reading a Disk Sector (3) main memory ALU register file CPU chip disk controller graphics adapter USB controller mouse keyboard monitor disk I/O bus bus interface When the DMA transfer completes, the disk controller notifies the CPU with an interrupt (i.e., asserts a special “interrupt” pin on the CPU)
    34. 34. Storage vs. CPU Trends metric 1980 1985 1990 1995 2000 2000:1980 $/MB 8,000 880 100 30 1 8,000 access (ns) 375 200 100 70 60 6 typical size(MB) 0.064 0.256 4 16 64 1,000 DRAM metric 1980 1985 1990 1995 2000 2000:1980 $/MB 19,200 2,900 320 256 100 190 access (ns) 300 150 35 15 2 100 SRAM metric 1980 1985 1990 1995 2000 2000:1980 $/MB 500 100 8 0.30 0.05 10,000 access (ms) 87 75 28 10 8 11 typical size(MB) 1 10 160 1,000 9,000 9,000 Disk 1980 1985 1990 1995 2000 2000:1980 processor 8080 286 386 Pent P-III clock rate(MHz) 1 6 20 150 750 750 cycle time(ns) 1,000 166 50 6 1.6 750 CPU
    35. 35. The CPU-Memory Gap <ul><li>The increasing gap between DRAM, disk, and CPU speeds </li></ul>
    36. 36. Locality <ul><li>Principle of Locality: </li></ul><ul><ul><li>Temporal locality : Recently referenced items are likely to be referenced in the near future. </li></ul></ul><ul><ul><li>Spatial locality : Items with nearby addresses tend to be referenced close together in time. </li></ul></ul><ul><li>Locality Example: </li></ul><ul><ul><li>Data </li></ul></ul><ul><ul><ul><li>Reference array elements in succession: Spatial locality </li></ul></ul></ul><ul><ul><ul><li>Reference sum each iteration: Temporal locality </li></ul></ul></ul><ul><ul><li>Instructions </li></ul></ul><ul><ul><ul><li>Reference instructions in sequence: Spatial locality </li></ul></ul></ul><ul><ul><ul><li>Cycle through loop repeatedly: Temporal locality </li></ul></ul></ul>sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;
    37. 37. Memory Hierarchies <ul><li>Some fundamental and enduring properties of hardware and software: </li></ul><ul><ul><li>Fast storage technologies cost more per byte and have less capacity. </li></ul></ul><ul><ul><li>The gap between CPU and main memory speed is widening. </li></ul></ul><ul><ul><li>Well-written programs tend to exhibit good locality. </li></ul></ul><ul><li>They suggest an approach for organizing memory and storage systems known as a memory hierarchy . </li></ul>
    38. 38. An Example Memory Hierarchy registers on-chip L1 cache (SRAM) main memory (DRAM) local secondary storage (local disks) Larger, slower, and cheaper (per byte) storage devices remote secondary storage (distributed file systems, Web servers) off-chip L2 cache (SRAM) CPU registers hold words retrieved from L1 cache. L0: L1: L2: L3: L4: L5: Smaller, faster, and costlier (per byte) storage devices Local disks hold files retrieved from disks on remote network servers. Main memory holds disk blocks retrieved from local disks. L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory.
    39. 39. Caching <ul><li>Cache </li></ul><ul><ul><li>A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. </li></ul></ul><ul><ul><li>Fundamental idea of a memory hierarchy: </li></ul></ul><ul><ul><ul><li>For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. </li></ul></ul></ul><ul><li>Why do memory hierarchies work? </li></ul><ul><ul><li>Programs tend to access the data at level k more often than they access the data at level k+1. </li></ul></ul><ul><ul><li>Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. </li></ul></ul><ul><ul><li>Net effect: A large pool of memory that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top. </li></ul></ul>
    40. 40. Caching in a Memory Hierarchy 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. Level k+1: 4 4 4 10 10 10 Data is copied between levels in block-sized transfer units 8 9 14 3 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Level k:
    41. 41. General Caching Concepts <ul><ul><li>Program needs object d, which is stored in some block b. </li></ul></ul><ul><ul><li>Cache hit </li></ul></ul><ul><ul><ul><li>Program finds b in the cache at level k. (e.g., block 14) </li></ul></ul></ul><ul><ul><li>Cache miss </li></ul></ul><ul><ul><ul><li>b is not at level k, so level k must fetch it from level k+1. (e.g., block 12) </li></ul></ul></ul><ul><ul><ul><li>If level k cache is full, then some current block (a “victim”) must be replaced (evicted). </li></ul></ul></ul>Request 14 Request 12 9 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Level k: Level k+1: 14 14 12 14 4* 4* 12 12 0 1 2 3 Request 12 4* 4* 12
    42. 42. Examples of Caching in the Hierarchy Hardware 0 On-Chip TLB Address translations TLB Web browser 10,000,000 Local disk Web pages Browser cache Web cache Network buffer cache Buffer cache Virtual Memory L2 cache L1 cache Registers Cache Type Web pages Parts of files Parts of files 4-KB page 32-byte block 32-byte block 4-byte word What Cached Web proxy server 1,000,000,000 Remote server disks OS 100 Main memory Hardware 1 On-Chip L1 Hardware 10 Off-Chip L2 AFS/NFS client 10,000,000 Local disk Hardware+OS 100 Main memory Compiler 0 CPU registers Managed By Latency (cycles) Where Cached
    43. 43. Cache Memories
    44. 44. Cache Memories <ul><li>Cache memories are small, fast SRAM-based memories managed automatically in hardware. </li></ul><ul><ul><li>Hold frequently accessed blocks of main memory </li></ul></ul><ul><li>CPU looks first for data in L1, then in L2, then in main memory. </li></ul><ul><li>Typical bus structure: </li></ul>main memory I/O bridge bus interface L2 cache register file ALU CPU chip cache bus system bus memory bus L1 cache
    45. 45. Inserting L1 Cache a b c d block 10 p q r s block 21 ... ... w x y z block 30 ... The big slow main memory has room for many 4-word blocks. The small fast L1 cache has room for two 4-word blocks. The tiny, very fast CPU register file has room for four 4-byte words. The transfer unit between the cache and main memory is a 4-word block (16 bytes). The transfer unit between the CPU register file and the cache is a 4-byte block. line 0 line 1
    46. 46. General Org of a Cache Memory <ul><ul><li>Cache is an array of sets. </li></ul></ul><ul><ul><li>Each set contains one or more lines. </li></ul></ul><ul><ul><li>Each line holds a block of data. </li></ul></ul>• • • B –1 1 0 • • • B –1 1 0 valid valid tag tag set 0: B = 2 b bytes per cache block E lines per set S = 2 s sets t tag bits per line 1 valid bit per line Cache size: C = B x E x S data bytes • • • • • • B –1 1 0 • • • B –1 1 0 valid valid tag tag set 1: • • • • • • B –1 1 0 • • • B –1 1 0 valid valid tag tag set S -1: • • • • • •
    47. 47. Addressing Caches t bits s bits b bits 0 m-1 <tag> <set index> <block offset> Address A: • • • B –1 1 0 • • • B –1 1 0 v v tag tag set 0: • • • • • • B –1 1 0 • • • B –1 1 0 v v tag tag set 1: • • • • • • B –1 1 0 • • • B –1 1 0 v v tag tag set S -1: • • • • • • The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>. The word contents begin at offset <block offset> bytes from the beginning of the block.
    48. 48. Direct-Mapped Cache <ul><li>Simplest kind of cache </li></ul><ul><li>Characterized by exactly one line per set. </li></ul>valid valid valid tag tag tag • • • set 0: set 1: set S-1: E=1 lines per set cache block cache block cache block
    49. 49. Accessing Direct-Mapped Caches <ul><li>Set selection </li></ul><ul><ul><li>Use the set index bits to determine the set of interest. </li></ul></ul>valid valid valid tag tag tag • • • set 0: set 1: set S-1: cache block cache block cache block t bits s bits 0 0 0 0 1 0 m-1 b bits tag set index block offset selected set
    50. 50. Accessing Direct-Mapped Caches <ul><li>Line matching and word selection </li></ul><ul><ul><li>Line matching : Find a valid line in the selected set with a matching tag </li></ul></ul><ul><ul><li>Word selection : Then extract the word </li></ul></ul>1 t bits s bits 100 i 0110 0 m-1 b bits tag set index block offset selected set (i): 0110 w 3 w 0 w 1 w 2 3 0 1 2 7 4 5 6 (3) If (1) and (2), then cache hit, and block offset selects starting byte. =1? (1) The valid bit must be set = ? (2) The tag bits in the cache line must match the tag bits in the address
    51. 51. Direct-Mapped Cache Simulation M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000 2 ], 1 [0001 2 ], 13 [1101 2 ], 8 [1000 2 ], 0 [0000 2 ] x t=1 s=2 b=1 xx x 1 0 m[1] m[0] v tag data 0 [0000 2 ] (miss) (1) 1 0 m[1] m[0] v tag data 1 1 m[13] m[12] 13 [1101 2 ] (miss) (3) 1 1 m[9] m[8] v tag data 8 [1000 2 ] (miss) (4) 1 0 m[1] m[0] v tag data 1 1 m[13] m[12] 0 [0000 2 ] (miss) (5) 0 M[0-1] 1 1 M[12-13] 1 1 M[8-9] 1 1 M[12-13] 1 0 M[0-1] 1 1 M[12-13] 1 0 M[0-1] 1
    52. 52. Why Use Middle Bits as Index? <ul><li>High-Order Bit Indexing </li></ul><ul><ul><li>Adjacent memory lines would map to same cache entry </li></ul></ul><ul><ul><li>Poor use of spatial locality </li></ul></ul><ul><li>Middle-Order Bit Indexing </li></ul><ul><ul><li>Consecutive memory lines map to different cache lines </li></ul></ul><ul><ul><li>Can hold C-byte region of address space in cache at one time </li></ul></ul>4-line Cache High-Order Bit Indexing Middle-Order Bit Indexing 00 01 10 11 00 00 00 01 00 10 00 11 01 00 01 01 01 10 01 11 10 00 10 01 10 10 10 11 11 00 11 01 11 10 11 11 00 00 00 01 00 10 00 11 01 00 01 01 01 10 01 11 10 00 10 01 10 10 10 11 11 00 11 01 11 10 11 11
    53. 53. Set Associative Caches <ul><li>Characterized by more than one line per set </li></ul>valid tag set 0: E=2 lines per set set 1: set S-1: • • • cache block valid tag cache block valid tag cache block valid tag cache block valid tag cache block valid tag cache block
    54. 54. Accessing Set Associative Caches <ul><li>Set selection </li></ul><ul><ul><li>identical to direct-mapped cache </li></ul></ul>valid valid tag tag set 0: valid valid tag tag set 1: valid valid tag tag set S-1: • • • t bits s bits 0 0 0 0 1 0 m-1 b bits tag set index block offset Selected set cache block cache block cache block cache block cache block cache block
    55. 55. Accessing Set Associative Caches <ul><li>Line matching and word selection </li></ul><ul><ul><li>must compare the tag in each valid line in the selected set. </li></ul></ul>1 0110 w 3 w 0 w 1 w 2 1 1001 t bits s bits 100 i 0110 0 m-1 b bits tag set index block offset selected set (i): 3 0 1 2 7 4 5 6 =1? (1) The valid bit must be set. = ? (2) The tag bits in one of the cache lines must match the tag bits in the address (3) If (1) and (2), then cache hit, and block offset selects starting byte.
    56. 56. Multi-Level Caches <ul><li>Options: separate data and instruction caches , or a unified cache </li></ul>size: speed: $/Mbyte: line size: 200 B 3 ns 8 B 8-64 KB 3 ns 32 B 128 MB DRAM 60 ns $1.50/MB 8 KB 30 GB 8 ms $0.05/MB larger, slower, cheaper Memory L1 d-cache Regs Unified L2 Cache Processor 1-4MB SRAM 6 ns $100/MB 32 B L1 i-cache disk
    57. 57. Intel Pentium Cache Hierarchy Processor Chip L1 Data 1 cycle latency 16 KB 4-way assoc Write-through 32B lines L1 Instruction 16 KB, 4-way 32B lines Regs. L2 Unified 128KB--2 MB 4-way assoc Write-back Write allocate 32B lines Main Memory Up to 4GB
    58. 58. Cache Performance Metrics <ul><li>Miss Rate </li></ul><ul><ul><li>Fraction of memory references not found in cache (misses/references) </li></ul></ul><ul><ul><li>Typical numbers: </li></ul></ul><ul><ul><ul><li>3-10% for L1 </li></ul></ul></ul><ul><ul><ul><li>can be quite small (e.g., < 1%) for L2, depending on size, etc. </li></ul></ul></ul><ul><li>Hit Time </li></ul><ul><ul><li>Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) </li></ul></ul><ul><ul><li>Typical numbers: </li></ul></ul><ul><ul><ul><li>1 clock cycle for L1 </li></ul></ul></ul><ul><ul><ul><li>3-8 clock cycles for L2 </li></ul></ul></ul><ul><li>Miss Penalty </li></ul><ul><ul><li>Additional time required because of a miss </li></ul></ul><ul><ul><ul><li>Typically 25-100 cycles for main memory </li></ul></ul></ul>
    59. 59. Writing Cache Friendly Code <ul><li>Repeated references to variables are good (temporal locality) </li></ul><ul><li>Stride-1 reference patterns are good (spatial locality) </li></ul><ul><li>Examples: </li></ul><ul><ul><li>cold cache, 4-byte words, 4-word cache blocks </li></ul></ul>int sumarrayrows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } int sumarraycols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } Miss rate = Miss rate = 1/4 = 25% 100%
    60. 60. The Memory Mountain <ul><li>Read throughput (read bandwidth) </li></ul><ul><ul><li>Number of bytes read from memory per second (MB/s) </li></ul></ul><ul><li>Memory mountain </li></ul><ul><ul><li>Measured read throughput as a function of spatial and temporal locality. </li></ul></ul><ul><ul><li>Compact way to characterize memory system performance. </li></ul></ul>
    61. 61. Memory Mountain Test Function /* The test function */ void test(int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride ) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ } /* Run test(elems, stride) and return read throughput (MB/s) */ double run(int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ }
    62. 62. Memory Mountain Main Routine /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ #define MAXBYTES (1 << 23) /* ... up to 8 MB */ #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ #define MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; /* The array we'll be traversing */ int main() { int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf(&quot;%.1f &quot;, run(size, stride, Mhz)); printf(&quot; &quot;); } exit(0); }
    63. 63. The Memory Mountain
    64. 64. Ridges of Temporal Locality <ul><li>Slice through the memory mountain with stride=1 </li></ul><ul><ul><li>illuminates read throughputs of different caches and memory </li></ul></ul>
    65. 65. A Slope of Spatial Locality <ul><li>Slice through memory mountain with size=256KB </li></ul><ul><ul><li>shows cache block size. </li></ul></ul>
    66. 66. Concluding Observations <ul><li>Programmer can optimize for cache performance </li></ul><ul><ul><li>How data structures are organized </li></ul></ul><ul><ul><li>How data are accessed </li></ul></ul><ul><ul><ul><li>Nested loop structure </li></ul></ul></ul><ul><ul><ul><li>Blocking is a general technique </li></ul></ul></ul><ul><li>All systems favor “cache friendly code” </li></ul><ul><ul><li>Getting absolute optimum performance is very platform specific </li></ul></ul><ul><ul><ul><li>Cache sizes, line sizes, associativities, etc. </li></ul></ul></ul><ul><ul><li>Can get most of the advantage with generic code </li></ul></ul><ul><ul><ul><li>Keep working set reasonably small (temporal locality) </li></ul></ul></ul><ul><ul><ul><li>Use small strides (spatial locality) </li></ul></ul></ul>

    ×