SlideShare a Scribd company logo
1 of 55
1
Cache Memory
2
Outline
• General concepts
• 3 ways to organize cache memory
• Issues with writes
• Write cache friendly codes
• Cache mountain
• Suggested Reading: 6.4, 6.5, 6.6
3
6.4 Cache Memories
4
Cache Memory
• History
– At very beginning, 3 levels
• Registers, main memory, disk storage
– 10 years later, 4 levels
• Register, SRAM cache, main DRAM memory, disk storage
– Modern processor, 4~5 levels
• Registers, SRAM L1, L2(,L3) cache, main DRAM memory,
disk storage
– Cache memories
• are small, fast SRAM-based memories
• are managed by hardware automatically
• can be on-chip, on-die, off-chip
5
Cache Memory
Figure 6.24 P488
main
memory
I/O
bridge
bus interfaceL2 cache
ALU
register file
CPU chip
cache bus system bus memory bus
L1
cache
6
Cache Memory
• L1 cache is on-chip
• L2 cache is off-chip several years ago
• L3 cache can be off-chip or on-chip
• CPU looks first for data in L1, then in L2, then
in main memory
– Hold frequently accessed blocks of main memory
are in caches
7
Inserting an L1 cache between the CPU and
main memory
a b c dblock 10
p q r sblock 21
...
...
w x y zblock 30
...
The big slow main memory
has room for many 4-word
blocks.
The small fast L1 cache has room
for two 4-word blocks.
The tiny, very fast CPU register file
has room for four 4-byte words.
The transfer unit between
the cache and main
memory is a 4-word block
(16 bytes).
The transfer unit between
the CPU register file and
the cache is a 4-byte block.
line 0
line 1
8
6.4.1 Generic Cache Memory Organization
Figure 6.25 P488
• • • B–110
• • • B–110
valid
valid
tag
tag
set 0:
B = 2b
bytes
per cache block
E lines
per set
S = 2s
sets
t tag bits
per line
1 valid bit
per line
• • •
• • • B–110
• • • B–110
valid
valid
tag
tag
set 1: • • •
• • • B–110
• • • B–110
valid
valid
tag
tag
set S-1: • • •
• • •
Cache is an array
of sets.
Each set contains
one or more lines.
Each line holds a
block of data.
pp.488
9
Addressing caches
Figure 6.25 P488
t bits s bits b bits
0m-1
<tag> <set index><block offset>
Address A:
• • •B–110
• • •B–110
v
v
tag
tag
set 0: • • •
• • •B–110
• • •B–110
v
v
tag
tag
set 1: • • •
• • •B–110
• • •B–110
v
v
tag
tag
set S-1: • • •
• • •
The word at address A is in the cache if
the tag bits in one of the <valid> lines in
set <set index> match <tag>.
The word contents begin at offset
<block offset> bytes from the beginning
of the block.
10
Cache Memory
Fundamental parameters
Parameters Descriptions
S = 2s
E
B=2b
m=log2(M)
Number of sets
Number of lines per set
Block size(bytes)
Number of physical(main memory)
address bits
11
Cache Memory
Derived quantities
Parameters Descriptions
M=2m
s=log2(S)
b=log2(B)
t=m-(s+b)
C=BĂ—E Ă—S
Maximum number of unique memory address
Number of set index bits
Number of block offset bits
Number of tag bits
Cache size (bytes) not including overhead
such as the valid and tag bits
12
6.4.2 Direct-mapped cache
Figure 6.27 P490
• Simplest kind of cache
• Characterized by exactly one line per set.
valid
valid
valid
tag
tag
tag
• • •
set 0:
set 1:
set S-1:
E=1 lines per setcache block
cache block
cache block
13
Accessing direct-mapped caches
Figure 6.28 P491
• Set selection
– Use the set index bits to determine the set of
interest
valid
valid
valid
tag
tag
tag
• • •
set 0:
set 1:
set S-1:
t bits s bits
0 0 0 0 1
0m-1
b bits
tag set indexblock offset
selected set
cache block
cache block
cache block
14
Accessing direct-mapped caches
• Line matching and word extraction
– find a valid line in the selected set with a matching
tag (line matching)
– then extract the word (word selection)
15
Accessing direct-mapped caches
Figure 6.29 P491
1
t bits s bits
100i0110
0m-1
b bits
tag set index block offset
selected set (i):
=1?
= ?
(3) If (1) and (2), then
cache hit,
and block offset
selects
starting byte.
(1) The valid bit must be set
(2) The tag bits in the cache
line must match the
tag bits in the address
0110 w3w0 w1 w2
30 1 2 74 5 6
16
Line Replacement on Misses in Directed Caches
• If cache misses
– Retrieve the requested block from the next level in
the memory hierarchy
– Store the new block in one of the cache lines of
the set indicated by the set index bits
17
Line Replacement on Misses in Directed Caches
• If the set is full of valid cache lines
– One of the existing lines must be evicted
• For a direct-mapped caches
– Each set contains only one line
– Current line is replaced by the newly fetched line
18
Direct-mapped cache simulation P492
• M=16 byte addresses
• B=2 bytes/block, S=4 sets, E=1 entry/set
19
Direct-mapped cache simulation P493
1 0 m[1] m[0]
v tag data
1 1 m[13] m[12]
0 [0000] (miss)
(4)
1 1 m[9] m[8]
v tag data
1 1 m[13] m[12]
8 [1000] (miss)
(3)
1 0 m[1] m[0]
v tag data
1 1 m[13] m[12]
13 [1101] (miss)
(2)
1 0 m[1] m[0]
v tag data
0 [0000] (miss)
(1)
M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1
entry/set
Address trace (reads):
0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000]
x
t=1 s=2 b=1
xx x
20
Direct-mapped cache simulation
Figure 6.30 P493
Address bits
Address
(decimal)
Tag bits
(t=1)
Index bits
(s=2)
Offset bits
(b=1)
Block number
(decimal)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
00
00
01
01
10
10
11
11
00
00
01
01
10
10
11
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
21
Why use middle bits as index?
• High-Order Bit Indexing
– Adjacent memory lines would
map to same cache entry
– Poor use of spatial locality
• Middle-Order Bit Indexing
– Consecutive memory lines
map to different cache lines
– Can hold C-byte region of
address space in cache at one
time
4-line Cache High-Order
Bit Indexing
Middle-Order
Bit Indexing
00
01
10
11
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Figure 6.31 P497
22
6.4.3 Set associative caches
• Characterized by more than one line per set
valid tag
set 0: E=2 lines per set
set 1:
set S-1:
• • •
cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
Figure 6.32 P498
23
Accessing set associative caches
• Set selection
– identical to direct-mapped cache
valid
valid
tag
tag
set 0:
valid
valid
tag
tag
set 1:
valid
valid
tag
tag
set S-1:
• • •
t bits s bits
0 0 0 0 1
0m-1
b bits
tag set index block offset
Selected set
cache block
cache block
cache block
cache block
cache block
cache block
Figure 6.33 P498
24
Accessing set associative caches
• Line matching and word selection
– must compare the tag in each valid line in the
selected set.
(3) If (1) and (2), then
cache hit, and
block offset selects
starting byte.
1 0110 w3w0 w1 w2
1 1001
t bits s bits
100i0110
0m-1
b bits
tag set index block offset
selected set (i):
=1?
= ?
(2) The tag bits in one
of the cache lines must
match the tag bits in
the address
(1) The valid bit must be set.
30 1 2 74 5 6
Figure 6.34 P499
25
6.4.4 Fully associative caches
• Characterized by all of the lines in the only
one set
• No set index bits in the address
set 0:
valid
valid
tag
tag
cache block
cache block
valid tag cache block
…
E=C/B lines in
the one and only set
t bits b bits
tag block offset
Figure 6.36 P500
Figure 6.35 P500
26
Accessing set associative caches
• Word selection
– must compare the tag in each valid line
0 0110
w3w0 w1 w2
1 1001
t bits
1000110
0m-1
b bits
tag block offset
=1?
= ?
(3) If (1) and (2), then
cache hit, and
block offset selects
starting byte.
(2) The tag bits in one
of the cache lines must
match the tag bits in
the address
(1) The valid bit must be set.
30 1 2 74 5 6
1
0
0110
1110
Figure 6.37 P500
27
6.4.5 Issues with Writes
• Write hits
– 1) Write through
• Cache updates its copy
• Immediately writes the corresponding cache block to
memory
– 2) Write back
• Defers the memory update as long as possible
• Writing the updated block to memory only when it is
evicted from the cache
• Maintains a dirty bit for each cache line
28
Issues with Writes
• Write misses
– 1) Write-allocate
• Loads the corresponding memory block into the cache
• Then updates the cache block
– 2) No-write-allocate
• Bypasses the cache
• Writes the word directly to memory
• Combination
– Write through, no-write-allocate
– Write back, write-allocate
29
6.4.6 Multi-level caches
size:
speed:
$/Mbyte:
line size:
8-64 KB
3 ns
32 B
128 MB DRAM
60 ns
$1.50/MB
8 KB
30 GB
8 ms
$0.05/MB
larger, slower, cheaper
MemoryMemory diskdisk
TLB
L1 I-cache
L1 D-cacheregs
L2
Cache
Processor
1-4MB SRAM
6 ns
$100/MB
32 B
larger line size, higher associativity, more likely to write back
Options: separate data and instruction caches, or a unified cache
Figure 6.38 P504
30
6.4.7 Cache performance metrics P505
• Miss Rate
– fraction of memory references not found in cache
(misses/references)
– Typical numbers:
3-10% for L1
• Hit Rate
– fraction of memory references found in cache (1 -
miss rate)
31
Cache performance metrics
• Hit Time
– time to deliver a line in the cache to the processor
(includes time to determine whether the line is in
the cache)
– Typical numbers:
1-2 clock cycle for L1
5-10 clock cycles for L2
• Miss Penalty
– additional time required because of a miss
• Typically 25-100 cycles for main memory
32
Cache performance metrics P505
• 1> Cache size
– Hit rate vs. hit time
• 2> Block size
– Spatial locality vs. temporal locality
• 3> Associativity
– Thrashing
– Cost
– Speed
– Miss penalty
• 4> Write strategy
– Simple, read misses, fewer transfer
33
6.5 Writing Cache-Friendly Code
34
Writing Cache-Friendly Code
• Principles
– Programs with better locality will tend to have
lower miss rates
– Programs with lower miss rates will tend to run
faster than programs with higher miss rates
35
Writing Cache-Friendly Code
• Basic approach
– Make the common case go fast
• Programs often spend most of their time in a few core
functions.
• These functions often spend most of their time in a few
loops
– Minimize the number of cache misses in each inner
loop
• All things being equal
36
Writing Cache-Friendly Code P507
8[h]7[h]6[h]5[m]4[h]3[h]2[h]1[m]Access order,
[h]it or [m]iss
i= 7i= 6i= 5i= 4i= 3i= 2i= 1i=0v[i]
Temporal locality,
These variables are usually put in registersint sumvec(int v[N])
{
int i, sum = 0 ;
for (i = 0 ; i < N ; i++)
sum += v[i] ;
return sum ;
}
37
Writing cache-friendly code
• Temporal locality
– Repeated references to local variables are good
because the compiler can cache them in the
register file
38
Writing cache-friendly code
• Spatial locality
– Stride-1 references patterns are good because
caches at all levels of the memory hierarchy store
data as contiguous blocks
• Spatial locality is especially important in
programs that operate on multidimensional
arrays
39
Writing cache-friendly code P508
• Example (M=4, N=8, 10cycles/iter)
int sumvec(int a[M][N])
{
int i, j, sum = 0 ;
for (i = 0 ; i < M ; i++)
for ( j = 0 ; j < N ; j++ )
sum += a[i][j] ;
return sum ;
}
40
Writing cache-friendly code
a[i][j] j=0 j= 1 j= 2 j= 3 j= 4 j= 5 j= 6 j= 7
i=0
i=1
i=2
i=3
1[m]
9[m]
17[m]
25[m]
2[h]
10[h]
18[h]
26[h]
3[h]
11[h]
19[h]
27[h]
4[h]
12[h]
20[h]
28[h]
5[m]
13[m]
21[m]
29[m]
6[h]
14[h]
22[h]
30[h]
7[h]
15[h]
23[h]
31[h]
8[h]
16[h]
24[h]
32[h]
41
Writing cache-friendly code P508
• Example (M=4, N=8, 20cycles/iter)
int sumvec(int v[M][N])
{
int i, j, sum = 0 ;
for ( j = 0 ; j < N ; j++ )
for ( i = 0 ; i < M ; i++ )
sum += v[i][j] ;
return sum ;
}
42
Writing cache-friendly code
a[i][j] j=0 j= 1 j= 2 j= 3 j= 4 j= 5 j= 6 j= 7
i=0
i=1
i=2
i=3
1[m]
2[m]
3[m]
4[m]
5[m]
6[m]
7[m]
8[m]
9[m]
10[m]
11[m]
12[m]
13[m]
14[m]
15[m]
16[m]
17[m]
18[m]
19[m]
20[m]
21[m]
22[m]
23[m]
24[m]
25[m]
26[m]
27[m]
28[m]
29[m]
30[m]
31[m]
32[m]
43
6.6 Putting it Together: The Impact of
Caches on Program Performance
6.6.1 The Memory Mountain
44
The Memory Mountain P512
• Read throughput (read bandwidth)
– The rate that a program reads data from the
memory system
• Memory mountain
– A two-dimensional function of read bandwidth
versus temporal and spatial locality
– Characterizes the capabilities of the memory
system for each computer
45
Memory mountain main routine
Figure 6.41 P513
/* mountain.c - Generate the memory mountain. */
#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */
#define MAXBYTES (1 << 23) /* ... up to 8 MB */
#define MAXSTRIDE 16 /* Strides range from 1 to 16 */
#define MAXELEMS MAXBYTES/sizeof(int)
int data[MAXELEMS]; /* The array we'll be traversing */
46
Memory mountain main routine
int main()
{
int size; /* Working set size (in bytes) */
int stride; /* Stride (in array elements) */
double Mhz; /* Clock frequency */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */
Mhz = mhz(0); /* Estimate the clock frequency */
47
Memory mountain main routine
for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++)
printf("%.1ft", run(size, stride, Mhz));
printf("n");
}
exit(0);
}
48
Memory mountain test function
Figure 6.40 P512
/* The test function */
void test (int elems, int stride) {
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i += stride)
result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
}
49
Memory mountain test function
/* Run test (elems, stride) and return read throughput (MB/s) */
double run (int size, int stride, double Mhz)
{
double cycles;
int elems = size / sizeof(int);
test (elems, stride); /* warm up the cache */
cycles = fcyc2(test, elems, stride, 0); /* call test (elems,stride) */
return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */
}
50
The Memory Mountain
• Data
– Size
• MAXBYTES(8M) bytes or MAXELEMS(2M) words
– Partially accessed
• Working set: from 8MB to 1KB
• Stride: from 1 to 16
51
The Memory Mountain
Figure 6.42 P514
s1
s3
s5
s7
s9
s11
s13
s15
8m
2m
512k
128k
32k
8k
2k
0
200
400
600
800
1000
1200
readthroughput(MB/s)
stride (words) working set size (bytes)
Pentium III Xeon
550 MHz
16 KB on-chip L1 d-cache
16 KB on-chip L1 i-cache
512 KB off-chip unified
L2 cache
Ridges of
Temporal
Locality
L1
L2
mem
Slopes of
Spatial
Locality
xe
52
Ridges of temporal locality
• Slice through the memory mountain with
stride=1
– illuminates read throughputs of different caches
and memory
Ridges: 山脊
53
Ridges of temporal locality
Figure 6.43 P515
0
200
400
600
800
1000
1200 8m
4m
2m
1024k
512k
256k
128k
64k
32k
16k
8k
4k
2k
1k
working set size (bytes)
readthrougput(MB/s)
L1 cache
region
L2 cache
region
main memory
region
54
A slope of spatial locality
• Slice through memory mountain with
size=256KB
– shows cache block size.
55
A slope of spatial locality
Figure 6.44 P516
0
100
200
300
400
500
600
700
800
s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16
stride (words)
readthroughput(MB/s)
one access per cache line

More Related Content

What's hot

12 memory hierarchy
12 memory hierarchy12 memory hierarchy
12 memory hierarchyAbhijit Gaikwad
 
6.01.hash tableintro
6.01.hash tableintro6.01.hash tableintro
6.01.hash tableintroamoldkul
 
Networking notes 2
Networking notes 2Networking notes 2
Networking notes 2Beth Bauzon
 
Advance computer architecture
Advance computer architectureAdvance computer architecture
Advance computer architecturesuma1991
 

What's hot (6)

Week11 lec1
Week11 lec1Week11 lec1
Week11 lec1
 
12 memory hierarchy
12 memory hierarchy12 memory hierarchy
12 memory hierarchy
 
6.01.hash tableintro
6.01.hash tableintro6.01.hash tableintro
6.01.hash tableintro
 
Ch07 linearspacealignment
Ch07 linearspacealignmentCh07 linearspacealignment
Ch07 linearspacealignment
 
Networking notes 2
Networking notes 2Networking notes 2
Networking notes 2
 
Advance computer architecture
Advance computer architectureAdvance computer architecture
Advance computer architecture
 

Viewers also liked

Hashfunction
HashfunctionHashfunction
HashfunctionYoung Alista
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in pythonYoung Alista
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheYoung Alista
 
Database introduction
Database introductionDatabase introduction
Database introductionYoung Alista
 
Key exchange in crypto
Key exchange in cryptoKey exchange in crypto
Key exchange in cryptoYoung Alista
 
Crypto theory practice
Crypto theory practiceCrypto theory practice
Crypto theory practiceYoung Alista
 
Overview prolog
Overview prologOverview prolog
Overview prologYoung Alista
 
Learning python
Learning pythonLearning python
Learning pythonYoung Alista
 
La informacion andres sanchez- nidia rodriguez
La informacion andres sanchez- nidia rodriguezLa informacion andres sanchez- nidia rodriguez
La informacion andres sanchez- nidia rodriguezAndres Felipe Sanchez
 
Xml and webdata
Xml and webdataXml and webdata
Xml and webdataYoung Alista
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsYoung Alista
 
List in webpage
List in webpageList in webpage
List in webpageYoung Alista
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendYoung Alista
 
Python basics
Python basicsPython basics
Python basicsYoung Alista
 

Viewers also liked (20)

Hashfunction
HashfunctionHashfunction
Hashfunction
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Poo java
Poo javaPoo java
Poo java
 
Maven
MavenMaven
Maven
 
Database introduction
Database introductionDatabase introduction
Database introduction
 
Key exchange in crypto
Key exchange in cryptoKey exchange in crypto
Key exchange in crypto
 
Crypto theory practice
Crypto theory practiceCrypto theory practice
Crypto theory practice
 
Overview prolog
Overview prologOverview prolog
Overview prolog
 
Learning python
Learning pythonLearning python
Learning python
 
La informacion andres sanchez- nidia rodriguez
La informacion andres sanchez- nidia rodriguezLa informacion andres sanchez- nidia rodriguez
La informacion andres sanchez- nidia rodriguez
 
Xml and webdata
Xml and webdataXml and webdata
Xml and webdata
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Html5
Html5Html5
Html5
 
Gm theory
Gm theoryGm theory
Gm theory
 
List in webpage
List in webpageList in webpage
List in webpage
 
Java
JavaJava
Java
 
Linked list
Linked listLinked list
Linked list
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python basics
Python basicsPython basics
Python basics
 

Similar to Memory caching

Cache mapping
Cache mappingCache mapping
Cache mappingBibek Brahma
 
Cache recap
Cache recapCache recap
Cache recapTony Nguyen
 
Cache recap
Cache recapCache recap
Cache recapJames Wong
 
Cache recap
Cache recapCache recap
Cache recapFraboni Ec
 
Cmp.pptx
Cmp.pptxCmp.pptx
Cmp.pptxfoff3
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Hsien-Hsin Sean Lee, Ph.D.
 
Advanced Computer Architecture chapter 5 problem solutions
Advanced Computer  Architecture  chapter 5 problem solutionsAdvanced Computer  Architecture  chapter 5 problem solutions
Advanced Computer Architecture chapter 5 problem solutionsJoe Christensen
 
Cache Memory for Computer Architecture.ppt
Cache Memory for Computer Architecture.pptCache Memory for Computer Architecture.ppt
Cache Memory for Computer Architecture.pptrularofclash69
 
04 cache memory.ppt 1
04 cache memory.ppt 104 cache memory.ppt 1
04 cache memory.ppt 1Anwal Mirza
 
Cache Memory.ppt
Cache Memory.pptCache Memory.ppt
Cache Memory.pptAmarDura2
 
04_Cache Memory.ppt
04_Cache Memory.ppt04_Cache Memory.ppt
04_Cache Memory.pptBanglaTutorial
 
04_Cache Memory.ppt
04_Cache Memory.ppt04_Cache Memory.ppt
04_Cache Memory.pptBanglaTutorial
 
04 cache memory
04 cache memory04 cache memory
04 cache memoryInshad Arshad
 
04 cache memory
04 cache memory04 cache memory
04 cache memoryFaisal Hayat
 

Similar to Memory caching (20)

Cache mapping
Cache mappingCache mapping
Cache mapping
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cmp.pptx
Cmp.pptxCmp.pptx
Cmp.pptx
 
Lecture 25
Lecture 25Lecture 25
Lecture 25
 
Cache memory
Cache  memoryCache  memory
Cache memory
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
 
Advanced Computer Architecture chapter 5 problem solutions
Advanced Computer  Architecture  chapter 5 problem solutionsAdvanced Computer  Architecture  chapter 5 problem solutions
Advanced Computer Architecture chapter 5 problem solutions
 
Cache Memory for Computer Architecture.ppt
Cache Memory for Computer Architecture.pptCache Memory for Computer Architecture.ppt
Cache Memory for Computer Architecture.ppt
 
04 cache memory.ppt 1
04 cache memory.ppt 104 cache memory.ppt 1
04 cache memory.ppt 1
 
Cache Memory.ppt
Cache Memory.pptCache Memory.ppt
Cache Memory.ppt
 
04_Cache Memory.ppt
04_Cache Memory.ppt04_Cache Memory.ppt
04_Cache Memory.ppt
 
04_Cache Memory.ppt
04_Cache Memory.ppt04_Cache Memory.ppt
04_Cache Memory.ppt
 
cache memory
cache memorycache memory
cache memory
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 

More from Young Alista

Google appenginejava.ppt
Google appenginejava.pptGoogle appenginejava.ppt
Google appenginejava.pptYoung Alista
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 
Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserializationYoung Alista
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data miningYoung Alista
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningYoung Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryYoung Alista
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceYoung Alista
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksYoung Alista
 
Object model
Object modelObject model
Object modelYoung Alista
 
Abstract data types
Abstract data typesAbstract data types
Abstract data typesYoung Alista
 
Abstraction file
Abstraction fileAbstraction file
Abstraction fileYoung Alista
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaYoung Alista
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsYoung Alista
 
Abstract class
Abstract classAbstract class
Abstract classYoung Alista
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonYoung Alista
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisYoung Alista
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with pythonYoung Alista
 
Python language data types
Python language data typesPython language data types
Python language data typesYoung Alista
 

More from Young Alista (20)

Google appenginejava.ppt
Google appenginejava.pptGoogle appenginejava.ppt
Google appenginejava.ppt
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserialization
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Object model
Object modelObject model
Object model
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Abstract class
Abstract classAbstract class
Abstract class
 
Inheritance
InheritanceInheritance
Inheritance
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Api crash
Api crashApi crash
Api crash
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Python language data types
Python language data typesPython language data types
Python language data types
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Memory caching

  • 2. 2 Outline • General concepts • 3 ways to organize cache memory • Issues with writes • Write cache friendly codes • Cache mountain • Suggested Reading: 6.4, 6.5, 6.6
  • 4. 4 Cache Memory • History – At very beginning, 3 levels • Registers, main memory, disk storage – 10 years later, 4 levels • Register, SRAM cache, main DRAM memory, disk storage – Modern processor, 4~5 levels • Registers, SRAM L1, L2(,L3) cache, main DRAM memory, disk storage – Cache memories • are small, fast SRAM-based memories • are managed by hardware automatically • can be on-chip, on-die, off-chip
  • 5. 5 Cache Memory Figure 6.24 P488 main memory I/O bridge bus interfaceL2 cache ALU register file CPU chip cache bus system bus memory bus L1 cache
  • 6. 6 Cache Memory • L1 cache is on-chip • L2 cache is off-chip several years ago • L3 cache can be off-chip or on-chip • CPU looks first for data in L1, then in L2, then in main memory – Hold frequently accessed blocks of main memory are in caches
  • 7. 7 Inserting an L1 cache between the CPU and main memory a b c dblock 10 p q r sblock 21 ... ... w x y zblock 30 ... The big slow main memory has room for many 4-word blocks. The small fast L1 cache has room for two 4-word blocks. The tiny, very fast CPU register file has room for four 4-byte words. The transfer unit between the cache and main memory is a 4-word block (16 bytes). The transfer unit between the CPU register file and the cache is a 4-byte block. line 0 line 1
  • 8. 8 6.4.1 Generic Cache Memory Organization Figure 6.25 P488 • • • B–110 • • • B–110 valid valid tag tag set 0: B = 2b bytes per cache block E lines per set S = 2s sets t tag bits per line 1 valid bit per line • • • • • • B–110 • • • B–110 valid valid tag tag set 1: • • • • • • B–110 • • • B–110 valid valid tag tag set S-1: • • • • • • Cache is an array of sets. Each set contains one or more lines. Each line holds a block of data. pp.488
  • 9. 9 Addressing caches Figure 6.25 P488 t bits s bits b bits 0m-1 <tag> <set index><block offset> Address A: • • •B–110 • • •B–110 v v tag tag set 0: • • • • • •B–110 • • •B–110 v v tag tag set 1: • • • • • •B–110 • • •B–110 v v tag tag set S-1: • • • • • • The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>. The word contents begin at offset <block offset> bytes from the beginning of the block.
  • 10. 10 Cache Memory Fundamental parameters Parameters Descriptions S = 2s E B=2b m=log2(M) Number of sets Number of lines per set Block size(bytes) Number of physical(main memory) address bits
  • 11. 11 Cache Memory Derived quantities Parameters Descriptions M=2m s=log2(S) b=log2(B) t=m-(s+b) C=BĂ—E Ă—S Maximum number of unique memory address Number of set index bits Number of block offset bits Number of tag bits Cache size (bytes) not including overhead such as the valid and tag bits
  • 12. 12 6.4.2 Direct-mapped cache Figure 6.27 P490 • Simplest kind of cache • Characterized by exactly one line per set. valid valid valid tag tag tag • • • set 0: set 1: set S-1: E=1 lines per setcache block cache block cache block
  • 13. 13 Accessing direct-mapped caches Figure 6.28 P491 • Set selection – Use the set index bits to determine the set of interest valid valid valid tag tag tag • • • set 0: set 1: set S-1: t bits s bits 0 0 0 0 1 0m-1 b bits tag set indexblock offset selected set cache block cache block cache block
  • 14. 14 Accessing direct-mapped caches • Line matching and word extraction – find a valid line in the selected set with a matching tag (line matching) – then extract the word (word selection)
  • 15. 15 Accessing direct-mapped caches Figure 6.29 P491 1 t bits s bits 100i0110 0m-1 b bits tag set index block offset selected set (i): =1? = ? (3) If (1) and (2), then cache hit, and block offset selects starting byte. (1) The valid bit must be set (2) The tag bits in the cache line must match the tag bits in the address 0110 w3w0 w1 w2 30 1 2 74 5 6
  • 16. 16 Line Replacement on Misses in Directed Caches • If cache misses – Retrieve the requested block from the next level in the memory hierarchy – Store the new block in one of the cache lines of the set indicated by the set index bits
  • 17. 17 Line Replacement on Misses in Directed Caches • If the set is full of valid cache lines – One of the existing lines must be evicted • For a direct-mapped caches – Each set contains only one line – Current line is replaced by the newly fetched line
  • 18. 18 Direct-mapped cache simulation P492 • M=16 byte addresses • B=2 bytes/block, S=4 sets, E=1 entry/set
  • 19. 19 Direct-mapped cache simulation P493 1 0 m[1] m[0] v tag data 1 1 m[13] m[12] 0 [0000] (miss) (4) 1 1 m[9] m[8] v tag data 1 1 m[13] m[12] 8 [1000] (miss) (3) 1 0 m[1] m[0] v tag data 1 1 m[13] m[12] 13 [1101] (miss) (2) 1 0 m[1] m[0] v tag data 0 [0000] (miss) (1) M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] x t=1 s=2 b=1 xx x
  • 20. 20 Direct-mapped cache simulation Figure 6.30 P493 Address bits Address (decimal) Tag bits (t=1) Index bits (s=2) Offset bits (b=1) Block number (decimal) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 00 00 01 01 10 10 11 11 00 00 01 01 10 10 11 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7
  • 21. 21 Why use middle bits as index? • High-Order Bit Indexing – Adjacent memory lines would map to same cache entry – Poor use of spatial locality • Middle-Order Bit Indexing – Consecutive memory lines map to different cache lines – Can hold C-byte region of address space in cache at one time 4-line Cache High-Order Bit Indexing Middle-Order Bit Indexing 00 01 10 11 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Figure 6.31 P497
  • 22. 22 6.4.3 Set associative caches • Characterized by more than one line per set valid tag set 0: E=2 lines per set set 1: set S-1: • • • cache block valid tag cache block valid tag cache block valid tag cache block valid tag cache block valid tag cache block Figure 6.32 P498
  • 23. 23 Accessing set associative caches • Set selection – identical to direct-mapped cache valid valid tag tag set 0: valid valid tag tag set 1: valid valid tag tag set S-1: • • • t bits s bits 0 0 0 0 1 0m-1 b bits tag set index block offset Selected set cache block cache block cache block cache block cache block cache block Figure 6.33 P498
  • 24. 24 Accessing set associative caches • Line matching and word selection – must compare the tag in each valid line in the selected set. (3) If (1) and (2), then cache hit, and block offset selects starting byte. 1 0110 w3w0 w1 w2 1 1001 t bits s bits 100i0110 0m-1 b bits tag set index block offset selected set (i): =1? = ? (2) The tag bits in one of the cache lines must match the tag bits in the address (1) The valid bit must be set. 30 1 2 74 5 6 Figure 6.34 P499
  • 25. 25 6.4.4 Fully associative caches • Characterized by all of the lines in the only one set • No set index bits in the address set 0: valid valid tag tag cache block cache block valid tag cache block … E=C/B lines in the one and only set t bits b bits tag block offset Figure 6.36 P500 Figure 6.35 P500
  • 26. 26 Accessing set associative caches • Word selection – must compare the tag in each valid line 0 0110 w3w0 w1 w2 1 1001 t bits 1000110 0m-1 b bits tag block offset =1? = ? (3) If (1) and (2), then cache hit, and block offset selects starting byte. (2) The tag bits in one of the cache lines must match the tag bits in the address (1) The valid bit must be set. 30 1 2 74 5 6 1 0 0110 1110 Figure 6.37 P500
  • 27. 27 6.4.5 Issues with Writes • Write hits – 1) Write through • Cache updates its copy • Immediately writes the corresponding cache block to memory – 2) Write back • Defers the memory update as long as possible • Writing the updated block to memory only when it is evicted from the cache • Maintains a dirty bit for each cache line
  • 28. 28 Issues with Writes • Write misses – 1) Write-allocate • Loads the corresponding memory block into the cache • Then updates the cache block – 2) No-write-allocate • Bypasses the cache • Writes the word directly to memory • Combination – Write through, no-write-allocate – Write back, write-allocate
  • 29. 29 6.4.6 Multi-level caches size: speed: $/Mbyte: line size: 8-64 KB 3 ns 32 B 128 MB DRAM 60 ns $1.50/MB 8 KB 30 GB 8 ms $0.05/MB larger, slower, cheaper MemoryMemory diskdisk TLB L1 I-cache L1 D-cacheregs L2 Cache Processor 1-4MB SRAM 6 ns $100/MB 32 B larger line size, higher associativity, more likely to write back Options: separate data and instruction caches, or a unified cache Figure 6.38 P504
  • 30. 30 6.4.7 Cache performance metrics P505 • Miss Rate – fraction of memory references not found in cache (misses/references) – Typical numbers: 3-10% for L1 • Hit Rate – fraction of memory references found in cache (1 - miss rate)
  • 31. 31 Cache performance metrics • Hit Time – time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) – Typical numbers: 1-2 clock cycle for L1 5-10 clock cycles for L2 • Miss Penalty – additional time required because of a miss • Typically 25-100 cycles for main memory
  • 32. 32 Cache performance metrics P505 • 1> Cache size – Hit rate vs. hit time • 2> Block size – Spatial locality vs. temporal locality • 3> Associativity – Thrashing – Cost – Speed – Miss penalty • 4> Write strategy – Simple, read misses, fewer transfer
  • 34. 34 Writing Cache-Friendly Code • Principles – Programs with better locality will tend to have lower miss rates – Programs with lower miss rates will tend to run faster than programs with higher miss rates
  • 35. 35 Writing Cache-Friendly Code • Basic approach – Make the common case go fast • Programs often spend most of their time in a few core functions. • These functions often spend most of their time in a few loops – Minimize the number of cache misses in each inner loop • All things being equal
  • 36. 36 Writing Cache-Friendly Code P507 8[h]7[h]6[h]5[m]4[h]3[h]2[h]1[m]Access order, [h]it or [m]iss i= 7i= 6i= 5i= 4i= 3i= 2i= 1i=0v[i] Temporal locality, These variables are usually put in registersint sumvec(int v[N]) { int i, sum = 0 ; for (i = 0 ; i < N ; i++) sum += v[i] ; return sum ; }
  • 37. 37 Writing cache-friendly code • Temporal locality – Repeated references to local variables are good because the compiler can cache them in the register file
  • 38. 38 Writing cache-friendly code • Spatial locality – Stride-1 references patterns are good because caches at all levels of the memory hierarchy store data as contiguous blocks • Spatial locality is especially important in programs that operate on multidimensional arrays
  • 39. 39 Writing cache-friendly code P508 • Example (M=4, N=8, 10cycles/iter) int sumvec(int a[M][N]) { int i, j, sum = 0 ; for (i = 0 ; i < M ; i++) for ( j = 0 ; j < N ; j++ ) sum += a[i][j] ; return sum ; }
  • 40. 40 Writing cache-friendly code a[i][j] j=0 j= 1 j= 2 j= 3 j= 4 j= 5 j= 6 j= 7 i=0 i=1 i=2 i=3 1[m] 9[m] 17[m] 25[m] 2[h] 10[h] 18[h] 26[h] 3[h] 11[h] 19[h] 27[h] 4[h] 12[h] 20[h] 28[h] 5[m] 13[m] 21[m] 29[m] 6[h] 14[h] 22[h] 30[h] 7[h] 15[h] 23[h] 31[h] 8[h] 16[h] 24[h] 32[h]
  • 41. 41 Writing cache-friendly code P508 • Example (M=4, N=8, 20cycles/iter) int sumvec(int v[M][N]) { int i, j, sum = 0 ; for ( j = 0 ; j < N ; j++ ) for ( i = 0 ; i < M ; i++ ) sum += v[i][j] ; return sum ; }
  • 42. 42 Writing cache-friendly code a[i][j] j=0 j= 1 j= 2 j= 3 j= 4 j= 5 j= 6 j= 7 i=0 i=1 i=2 i=3 1[m] 2[m] 3[m] 4[m] 5[m] 6[m] 7[m] 8[m] 9[m] 10[m] 11[m] 12[m] 13[m] 14[m] 15[m] 16[m] 17[m] 18[m] 19[m] 20[m] 21[m] 22[m] 23[m] 24[m] 25[m] 26[m] 27[m] 28[m] 29[m] 30[m] 31[m] 32[m]
  • 43. 43 6.6 Putting it Together: The Impact of Caches on Program Performance 6.6.1 The Memory Mountain
  • 44. 44 The Memory Mountain P512 • Read throughput (read bandwidth) – The rate that a program reads data from the memory system • Memory mountain – A two-dimensional function of read bandwidth versus temporal and spatial locality – Characterizes the capabilities of the memory system for each computer
  • 45. 45 Memory mountain main routine Figure 6.41 P513 /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ #define MAXBYTES (1 << 23) /* ... up to 8 MB */ #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ #define MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; /* The array we'll be traversing */
  • 46. 46 Memory mountain main routine int main() { int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */
  • 47. 47 Memory mountain main routine for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1ft", run(size, stride, Mhz)); printf("n"); } exit(0); }
  • 48. 48 Memory mountain test function Figure 6.40 P512 /* The test function */ void test (int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ }
  • 49. 49 Memory mountain test function /* Run test (elems, stride) and return read throughput (MB/s) */ double run (int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test (elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test (elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ }
  • 50. 50 The Memory Mountain • Data – Size • MAXBYTES(8M) bytes or MAXELEMS(2M) words – Partially accessed • Working set: from 8MB to 1KB • Stride: from 1 to 16
  • 51. 51 The Memory Mountain Figure 6.42 P514 s1 s3 s5 s7 s9 s11 s13 s15 8m 2m 512k 128k 32k 8k 2k 0 200 400 600 800 1000 1200 readthroughput(MB/s) stride (words) working set size (bytes) Pentium III Xeon 550 MHz 16 KB on-chip L1 d-cache 16 KB on-chip L1 i-cache 512 KB off-chip unified L2 cache Ridges of Temporal Locality L1 L2 mem Slopes of Spatial Locality xe
  • 52. 52 Ridges of temporal locality • Slice through the memory mountain with stride=1 – illuminates read throughputs of different caches and memory Ridges: 山脊
  • 53. 53 Ridges of temporal locality Figure 6.43 P515 0 200 400 600 800 1000 1200 8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k working set size (bytes) readthrougput(MB/s) L1 cache region L2 cache region main memory region
  • 54. 54 A slope of spatial locality • Slice through memory mountain with size=256KB – shows cache block size.
  • 55. 55 A slope of spatial locality Figure 6.44 P516 0 100 200 300 400 500 600 700 800 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 stride (words) readthroughput(MB/s) one access per cache line