4. Levels in Memory Hierarchy
CPU
Regs
C
a
c
h
e
Memory disk
size:
speed:
$/Mbyte:
line size:
32 B
0.3 ns
4 B
Register Cache Memory Disk Memory
32 KB-4MB
2 ns?
$75/MB
32 B
4096 MB
7.5 ns
$0.014/MB
4 KB
1 TB
8 ms
$0.00012/MB
larger, slower, cheaper
8 B 32 B 4 KB
Cache Virtual Memory
6. Primary memory
• Memory is the workspace for CPU
• When a file is loaded into memory, it is a copy of the
file that is actually loaded
• Consists of a no. of cells, each having a number
(address)
• n cells addresses: 0 to n‐1
• Same no. off bits in each cell
• Adjacent cells have consecutive addresses
• m‐bit address 2m addressable cells
• A portion of RAM address space is mapped into one or
more ROM chips
6
8. SRAM (Static RAM)
• Constructed using flip‐flops
• 6 transistors for each bit of storage
• Very fast
• Contents are retained as long as power is kept on
• Expensive
• Used in level 2 cache
8
9. DRAM (Dynamic RAM)
• No flip‐flops
• Array of cells, each consisting a transistor and a capacitor
• Capacitors can be charged or discharged, allowing 0s and 1s to be
Stored
• Electric charge tends to leak out Þ each bit in a DRAM must be
reloaded (refreshed) every few milliseconds (15 ms) to prevent data
from leaking away
• Refreshing takes several CPU cycles to complete (less than 1% of
overall bandwidth)
• High density (30 times smaller than SRAM)
• Used in main memories
• Slower than SRAM
• Inexpensive (30 times lower than SRAM)
9
10. SDRAM (Synchronous DRAM)
• Hybrid of SRAM and DRAM
• Runs in synchronization with the system bus
• Driven by a single synchronous clock
• Used in large caches, main memories
10
11. DDR (Double Data Rate) SDRAM
• An upgrade to standard SDRAM
• Performs 2 transfers per clock cycle (one at falling edge, one
at rising edge) without doubling actual clock rate
11
12. Dual channel DDR
• Technique in which 2 DDR DIMMs are installed at one time and function as a
single bank doubling the bandwidth of a single module
• DDR2 SDRAM
– A faster version of DDR SDRAM (doubles the data rate of DDR)
– Less power consumption than DDR
– Achieves higher throughput by using differential pairs of signal wires
– Additional signal add to the pin count
• DDR3 SDRAM
– An improved version off DDR2 SDRAM
– Same no. of pins as in DDR2,
– Not compatible with DDR2
– Can transfer twice the data rate of DDR2
– DDR3 standard allows chip sizes of 512 Megabits to
8 Gigabits (max module size – 16GB)
12
16. SDRAM and DDR DIMM
• Buffered Module
– Has additional buffer circuits between memory chips and
the connector to buffer signals
– New motherboards are not designed to use buffered
modules
• Unbuffered Module
• Allows memory controller signals to pass directly to
memory chips with no interference
• Fast and most efficient design
• Most motherboards are designed to use unbuffered
modules
16
17. SDRAM and DDR DIMM
• Registered Module
– Uses register chips on the module that act as an
interface between RAM chip and chipset
– Used in systems designed to accept extremely
large amounts of RAM (server motherboards)
17
19. Memory errors
• Hard errors
– Permanent failure
– How to fix? (replace the chip)
• Soft errors
– Non‐permanent failure
– Occurs at infrequent intervals
– How to fix? (restart the system)
• Best way to deal with soft errors is to increase
system’s fault tolerance (implement ways of
detecting and correcting errors)
19
20. Techniques used for fault tolerance
• Parity
• ECC (Error Correcting Code)
20
21. Parity Checking
• 9 bits are used in the memory chip to store 1
byte of information
• Extra bit (parity bit) keeps tabs on other 8 bits
• Parity can only detect errors,, but cannot
correct them
21
22. ODD Parity stranded for error checking
• Parity generator/checker is a part of CPU or located in a
special chip on motherboard
• Parity checker evaluates the 8 data bits by adding the no. of
1s in the byte
• If an even no. of 1s is found, parity generator creates a 1
and stores it as the parity bit in memory chip
• If the sum is odd, parity bit would be 0
• If a (9‐bit) byte has an even no. of 1s, that byte must have
an error · System cannot tell which bit or bits have changed
• If 2 bits changed, bad byte could pass unnoticed
• Multiple bit errors in a single byte are very rare
• System halts when a parity check error is detected
22
23. ECC- Error Correcting Code
• Successor to parity checking
• Can detect and correct memory errors
• Only a single bit error can be corrected though it can detect doubled
‐bit errors
• This type of ECC is known as single‐bit error correction double‐bit
error detection (SEC‐DED)
• SEC‐DED requires an additional 7 check bits over 32 bits in a
4‐byte system, or 8 check bits over 64 bits in an 8‐byte system
• ECC entails memory controller calculating check bits on a
memory‐write operation, performing a compare between read and
calculated check bits on a read operation
• Cost of additional ECC logic in memory controller is not significant
• It affects memory performance on a write
23
25. Cache Memory
• A high‐speed,speed small memory
• Most frequently used memory words are kept in
• When CPU needs a word, it first checks it in cache. If
not found, checks in memory
25
26. Locality Principle
• Memory references made in any short time
interval tend to use only a small fraction off
the total memory
• Locality principle is the basis for all caching
systems
26
27. Locality Principle
Let
c – cache access time
m – main memory access time
h – hit ratio (fraction of all references that can be satisfied out of cache)
miss ratio = 1‐h
Average memory access time = c + (1‐h) m
H =1 No memory references
H=0 all are memory references
Example:
Suppose that a word is read k times in a short interval
First reference: memory, Other k‐1 references: cache
h = k – 1
k
Memory access time = c + m k
27
28. Cache Memory
• Main memories and caches are divided into fixed sized blocks
• Cache lines – blocks inside the cache
• On a cache miss, entire cache line is loaded into cache from memory
• Example:
– 64K cache can be divided into 1K lines of 64 bytes, 2K lines of 32
byte etc
– In a 64‐byte cache line size, a reference to memory address 270 would
pull the line containing bytes 256 to 319 into cache line
• Unified cache
– instruction and data use the same cache
• Split cache
– Instructions in one cache and data in another
28
30. Example
Suppose a cache is 10 times faster than main
memory, and suppose cache can be used 90%
of the time. How much speedup do we gain by
using the cache?
30
31. Memory stall cycles
No. of clock cycles during which CPU is stalled
waiting for a memory access
CPU time =
(CPU clock cycles + Memory stall cycles)
x Clock cycle time
Memory stall cycles = No. of misses x Miss penalty
= IC x Misses per instruction x Miss penalty
= IC x Memory accesses per instruction x
Miss ratio x Miss penalty
31
32. Example
Assume we have a machine where CPI is 2.0
when all memory accesses hit in the cache.
Only data accesses are loads and stores, and
these total 40% of instructions. If the miss
penalty is 25 clock cycles and miss ratio is 2%,
how much faster would the machine be if all
instructions were cache hits?
32
35. Direct mapped cache
• A single level direct‐mapped cache
• Example cache contains 2048 entries
• Each entry (row) in cache can hold exactly one cache
line from main memory
• 32‐byte cache line size cache can hold 64KB
35
36. Direct mapped cache
• Valid bit: indicates whether there is any valid data in this entry (invalid at
boot time)
• Tag field: 16‐bit value identifying the line of memory from which data
came
• Data field contains a copy of data in memory
• Given memory word is stored in exactly one place within cache
• To store/retrieve data from cache, memory address is divided into 4
components (32‐bit virtual address)
• TAG – corresponds to Tag bits stored in cache entry
• LINE – indicates which cache entry holds corresponding data
• WORD – tells which word within a line is referenced
• BYTE – if only a single byte is requested, it tells which byte within the
word is needed
36
37. When CPU produces a memory address
• Hardware extracts 11 LINE bits from address, indexes into
cache, finds corresponding cache entry
• Valid entry? ® TAG field of memory is compared with Tag
field of cache
– If agree, word requested is extracted from cache
• Invalid entry? ® 32‐byte cache line is fetched from memory
and stored in cache entry, discarding what was there
• most common type off cache
• performs quite efficiently
• puts consecutive memory lines in consecutive cache entries
• two memory lines that differ in their address by 64K cannot
be stored in cache at the same time
37
38. Set Associative cache
• A cache with n possible entries for each address is called
an n‐way set associative cache
• 4‐way set associative cache (2048 addresses)
• More complicated than a direct‐mapped cache
– n cache entries have to be checked
38
39. Set Associative cache
• 2‐way, 4‐way caches perform well enough to make this
extra circuitry worthwhile
• Memory block b is mapped into cache address “b mod
2048” and maybe stored in any of n entries in that address
• Direct‐mapped cache: one‐way set associative
• Fully associative cache: N‐way set associative;
• N ‐ total no. of blocks in the cache
• High‐associativity does not improve performance much
over low‐associativity caches
• When a new entry is brought into cache, which of present
items should be discarded?
39
40. Set Associative cache
• LRU (Least Recently Used) algorithm is used
– keep an ordering of each set of locations that could
be accessed from a given memory location
– whenever any of present lines are accessed, it
updates list, making that entry the most recently
accessed
– when it comes to replace an entry, one at the end of
list is discarded
40
41. Cache politics
• write back ‐ data are only written to main
memory when it is forced out of cache
• write through ‐ data are written to main
memory at the same time it is cached
• Write through approach is simpler to
implement and more reliable
– memory is always up to date
– but it creates more write traffic to memory
41
43. Magnetic Disk
• Purpose:
– Long term, nonvolatile storage
– Large, inexpensive, and slow
– Lowest level in the memory hierarchy
• Two major types:
– Floppy disk
– Hard disk
• Both types of disks:
– Rely on a rotating platter coated with a magnetic surface
– Use a moveable read/write head to access the disk
• Advantages of hard disks over floppy disks:
– Platters are more rigid ( metal or glass) so they can be larger
– Higher density because it can be controlled more precisely
– Higher data rate because it spins faster
– Can incorporate more than one platter
44. Magnetic Disk
• A stack of platters, a surface with a magnetic coating
• Typical numbers (depending on the disk size):
– 500 to 2,000 tracks per surface
– 32 to 128 sectors per track
• A sector is the smallest unit that can be read or written
• Traditionally all tracks have the same number of
sectors:
• Constant bit density: record more sectors on the outer
tracks
47. Magnetic Disk Characteristic
• Disk head: each side of a platter has separate disk head
• Cylinder: all the tracks under the head at a given point on all surface
• Read/write data is a three-stage process:
– Seek time: position the arm over the proper track
– Rotational latency: wait for the desired sector to rotate under the
read/write head
– Transfer time: transfer a block of bits (sector) under the read-write head
• Average seek time as reported by the industry:
– Typically in the range of 8 ms to 15 ms
– (Sum of the time for all possible seek) / (total # of possible seeks)
• Due to locality of disk reference, actual average seek time may:
– Only be 25% to 33% of the advertised number
48. Typical Numbers of a Magnetic Disk
• Rotational Latency:
– Most disks rotate at 3,600/5400/7200 RPM
– Approximately 16 ms per revolution
– An average latency to the desired information is
halfway around the disk: 8 ms
• Transfer Time is a function of :
– Transfer size (usually a sector): 1 KB / sector
– Rotation speed: 3600 RPM to 5400 RPM to 7200
– Recording density: typical diameter ranges from 2 to
14 in
– Typical values: 2 to 4 MB per second
49. Disk I/O Performance
Disk Access Time =
Seek time + Rotational Latency
+ Transfer time + Controller Time
+ Queueing Delay
50. Disk I/O Performance
• Disk Access Time = Seek time + Rotational
Latency + Transfer time + Controller Time +
Queueing Delay
• Estimating Queue Length:
– Utilization = U = Request Rate / Service Rate
– Mean Queue Length = U / (1 - U)
– As Request Rate Service Rate -> Mean Queue
Length ->Infinity
51. Example
• Setup parameters:
– 16383 Cycliders, 63 sectors per track, 3 platters, 6 heads
• Bytes per sector: 512
• RPM: 7200
• Transfer mode: 66.6MB/s
• Average Read Seek time: 9.0ms (read), 9.5ms (write)
• Average latency: 4.17ms
• Physical dimension: 1’’ x 4’’ x 5.75’’
• Interleave: 1:1
52. Disk performance
• Preamble: allows head to be synchronized before read/write
• ECC (Error Correction Code): corrects errors
• Unformatted capacity: preambles, ECCs and inter sector gaps are counted
as data
• Disk performance depends on
– seek time ‐ time to move arm to desired track
– rotational latency – time needed for requested sector to rotate
under head
• Rotational speed: 5400, 7200, 10000, 15000 rpm
• Transfer time – time needed to transfer a block of bits
under head (e.g., 40 MB/s)
52
53. Disk performance
Disk controller
– chip that controls the drive. Its tasks include accepting
– commands (READ, WRITE, FORMAT) from software,
controlling arm motion, detecting and correcting errors
Controller time
– overhead the disk controller imposes in performing an I/O
access
Avg. disk access time = avg. seek time +
avg. rotational delay +
Transfer time + controller overhead
53
54. Example
• Advertised average seek time of a disk is 5 ms,
transfer rate is 40 MB per second, and it rotates at
10,000 rpm Controller overhead is 0.1 ms. Calculate
the average time to read a 512‐byte sector.
54
55. RAID-
(Redundant Array of Inexpensive Disks)
• A disk organization used to improve
performance of storage systems
• An array of disks controlled by a controller
(RAID Controller)
• Data are distributed over disks (striping) to
allow parallel operation
55
56. RAID 0- No redundancy
• No redundancy to tolerate disk failure
• · Each strip has k sectors (say)
– Strip 0: sectors 0 to k‐1
– Strip 1: sectors k to 2k‐1 ...etc
• Works well with large accesses
• Less reliable than having a single large disk
56
57. Example (RAID 0)
• Suppose that RAID consists of 4 disks with
MTTF (mean time to failure) of 20,000 hours.
– A drive will fail once in every 5,000 hours
– A single large drive with MTTF of 20,000 hours is
4 times reliable
57
58. RAID 1 (Mirroring)
• Uses twice as many disk as does RAID 0
(first half: primary, next half: backup)
• Duplicates all disks
• On a write, every strip is written twice
• Excellent fault tolerance (if a disk fails, backup copy is
used)
• Requires more disks
58
59. RAID 3 (Bit‐Interleaved Parity)
• Reads/writes go to all disks in the group, with one
extra disk (parity disk) to hold check information
in case off a failure
• Parity contains sum of all data in other disks
• If a disk fails, subtract all data in good disks from
parity disk
59
60. RAD 4 (Block‐Interleaved Parity)
• RAID 4 is much like RAID 3 with a
strip‐for‐strip parity written onto an extra disk
– A write involves accessing 2 disks instead of all
– Parity disk must be updated on every write
60
61. RAID 5- Block‐Interleaved Distributed
Parity
• In RAID 5, parity information is spread throughout
all disks
• In RAID 5, multiple writes can occur simultaneously
as long as stripe units are not located in same disks,
but it is not possible in RAID 4
61
63. Virtual Memory
• Virtual memory is a memory management technique
developed for multitasking kernels
• Separation of user logical memory from physical
memory.
• Logical address space can therefore be much larger
than physical address space
63
64. A System with
Physical Memory Only
• Examples:
– Most Cray machines, early PCs, nearly all embedded systems, etc.
Addresses generated by the CPU correspond directly to bytes in physical
memory
CPU
0:
1:
N-1:
Memory
Physical
Addresses
65. A System with Virtual Memory
• Examples:
– Workstations, servers, modern PCs, etc.
Address Translation: Hardware converts virtual addresses to physical ones
via OS-managed lookup table (page table)
CPU
0:
1:
N-1:
Memory
0:
1:
P-1:
Page Table
Disk
Virtual
Addresses
Physical
Addresses