CPU cache helps to reduce memory access latency. To benefit from it CPU has to predict what data to read ahead to keep instruction and data cache filled with relevant information. This is done behind the scene and we all benefit from it, even without knowing. To make a profit of cache we will discuss the basics of CPU and see how a programmer can measure and optimize cache usage.
See also YouTube video: youtube.com/watch?v=gl2WjsbeGsI
6. What is average DRAM latency?
0
5
10
15
20
25
0 500 1000 1500 2000 2500
speed (MT
s
)
latency(ns)
7. And end to end?
row 0
row N
. . .
data
address
control
8. And end to end?
row 0
row N
. . .
data
address
control
activeinactive
9. And end to end?
row 0
row N
. . .
data
address
control
activeinactive
10. And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
11. And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
12. And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
more latency
13. And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
more latency
DRAM
controller
14. And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
more latency
DRAM
controller
CPU
core
15. And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency
inactiveactive
more latency
DRAM
controller
CPU
core
end to end latency: 50-100ns
17. Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle 50ns
18. Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@2MHz 50ns
Fun fact: Clock rate of 8080 CPU
19. Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@2MHz 50ns
1 clock cycle@1GHz 1ns
20. Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@2MHz 50ns
1 clock cycle@1GHz 1ns
1 clock cycle@5GHz 0.2ns
21. Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@2MHz 50ns
1 clock cycle@1GHz 1ns
1 clock cycle@5GHz 0.2ns
Fun fact: 1 heartbeat vs boiling 1 liter of
water
22. And how to counter that?
row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
end to end latency: 50-100ns
23. And how to counter that?
row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
end to end latency: 50-100ns
Cache
24. And how to counter that?
row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
end to end latency: 50-100ns
Cache
DRAM <-> Cache latency: 50-100ns
25. And how to counter that?
row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
end to end latency: 50-100ns
Cache
DRAM <-> Cache latency: 50-100nsCache latency: 1 clock cycle
28. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
29. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
30. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
31. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
32. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
33. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
34. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
35. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
36. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
37. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
38. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
Consequences:
- complexity grows with number of ways
Consequences:
- 15% less cache misses
Consequences:
- avoids N-parallel stalls
39. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
Consequences:
- complexity grows with number of ways
Consequences:
- 15% less cache misses
Consequences:
- avoids N-parallel stalls
40. Associativity or how to map memory
Must be fast, die-size and power efficient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?
Consequences:
- simple - fast and die-size, power efficient
Consequences:
- good best case - optimal sequence traversal
Consequences:
- worst case - jumping every nth line
Example: 2-way set associative
Selection: address mod 2
And then?
Example: 2-way set associative
Selection: address mod 2
Then: Least Recently Used
Consequences: ?
0
1
Consequences:
- complexity grows with number of ways
Consequences:
- 15% less cache misses
Consequences:
- avoids N-parallel stalls
56. Keeping caches hot
// data: vector of objects composed of:
// int32_t type , string name , data* parent ,
// map<string , string > params
// task: find by type
57. Naive C-style!
// data: vector of objects composed of:
// int32_t type , string name , data* parent ,
// map<string , string > params
s t r u c t data {
i n t 3 2 _ t type ;
s t r i n g name ;
o b s e r v e r _ p t r <data> p a r en t ;
map<s t r i n g , s t r i n g > params ;
} ;
// task: find by type (vector is sorted by type)
pair <s i z e _ t , s i z e _ t >
f i n d _ b y _ t y p e ( v e c t o r <data> const & x , key type ) {
auto r = equal_range ( begin ( x ) , end ( x ) , type ) ;
r e t u r n { r . f i r s t − begin ( x ) , r . second − begin ( x ) } ;
}
58. Layout
s i z e o f ( data ) ; // : 64B
o f f s e t o f ( type ) ; // : 0B
o f f s e t o f ( name ) ; // : 8B
o f f s e t o f ( p ar e n t ) ; // : 40B
o f f s e t o f ( params ) ; // : 48B
a l i g n o f ( data ) ; // : 64B
//Cache line : 64B
Type #1 — Name #1 Parent #1 Params #1
Type #2 — Name #2 Parent #2 Params #2
59. C++ style!
// data: vector of objects composed of:
// int32_t type , string name , data* parent ,
// map<string , string > params
s t r u c t data {
s t r i n g name ;
o b s e r v e r _ p t r <data> p a r en t ;
map<s t r i n g , s t r i n g > params ;
} ;
boost : : flat_map <i n t 3 2 _ t , data >:: equal_range ;
std : : map<i n t 3 2 _ t , data >:: equal_range ;
std : : unordered_map<i n t 3 2 _ t , data >:: equal_range ;
60. AMD AthlonTM
II X3 440, 12GiB RAM DDR3-1333, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
flat-map-clang
map-clang
naive-clang
unordered-map-clang
62. Will separate array do better?
s t r u c t a r r {
v e c t o r <i n t 3 2 _ t > types ;
v e c t o r <entry > e n t r i e s ;
} ;
pair <s i z e _ t , s i z e _ t >
f i n d _ b y _ t y p e ( a r r const & d , i n t 3 2 _ t type ) {
auto b = begin ( d . types ) ;
auto r = equal_range ( b , end ( d . types ) , type ) ;
r e t u r n { r . f i r s t − b , r . second − b } ;
}
63. Layout
s i z e o f ( type ) ; // : 8B
a l i g n o f ( data ) ; // : 4B
//Cache line : 64B
Type #1 Type #2 Type #3 Type #4 Type #5 Type #6 Type #7 Type #8 Type #9 Type #10 Type #11 Type #12 Type #13 Type #14 Type #15 Type #16
64. AMD AthlonTM
II X3 440, 12GiB RAM DDR3-1333, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
flat-map-clang
map-clang
naive-clang
optimized-clang
unordered-map-clang
71. Bibliograpy I
Micron Technology, Inc.
Speed vs. latency.
White paper, Micron Technology, Inc.
David A. Patterson and John L. Hennessy.
Computer Organization and Design, Fifth Edition: The Hardware/Software
Interface.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th edition, 2013.