Code dive 2019 kamil witecki - should i care about cpu cache

Should I care about CPU cache?
Kamil Witecki

Key notes
Relative latency Organization

Key notes
Relative latency Organization Proﬁts?

What is average DRAM latency?
0
5
10
15
20
25
0 500 1000 1500 2000 2500
speed (MT
s
)
latency(ns)

And end to end?
row 0
row N
. . .
data
address
control

And end to end?
row 0
row N
. . .
data
address
control
activeinactive

And end to end?
row 0
row N
. . .
data
address
control
activeinactivelatency

And end to end?
row 0
row N
. . .
data
address
control
inactiveactive

And end to end?
row 0
row N
. . .
data
address
control
inactiveactive
more latency

And end to end?
row 0
row N
. . .
data
address
control
inactiveactive
more latency
DRAM
controller

And end to end?
row 0
row N
. . .
data
address
control
inactiveactive
more latency
DRAM
controller
CPU
core

And end to end?
row 0
row N
. . .
data
address
control
inactiveactive
more latency
DRAM
controller
CPU
core
end to end latency: 50-100ns

Is 50ns a lot?
What Duration
Reference 50ns

Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle 50ns

Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@2MHz 50ns
Fun fact: Clock rate of 8080 CPU

Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@1GHz 1ns

Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@5GHz 0.2ns

Is 50ns a lot?
What Duration
Reference 50ns
1 clock cycle@5GHz 0.2ns
Fun fact: 1 heartbeat vs boiling 1 liter of
water

And how to counter that?
row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core

row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
Cache

row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
Cache
DRAM <-> Cache latency: 50-100ns

row 0
row N
. . .
data
address
control
activeinactive
inactiveactive
DRAM
controller
CPU
core
Cache
DRAM <-> Cache latency: 50-100nsCache latency: 1 clock cycle

Memory hierarchy
L1 64KiB
L2 512KiB
L3 8MiB
RAM 64MiB

Associativity or how to map memory
Must be fast, die-size and power eﬃcient
Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3

Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Example: Direct-mapping
Selection: address mod 4
Consequences: ?

Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Consequences: ?
Consequences:
- simple - fast and die-size, power eﬃcient

Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Consequences: ?
Consequences:
Consequences:
- good best case - optimal sequence traversal

Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Consequences: ?
Consequences:
Consequences:
Consequences:
- worst case - jumping every nth line

Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Consequences: ?
Consequences:
Consequences:
Consequences:
Example: 2-way set associative
And then?
Then: Least Recently Used
Consequences: ?
0
1

Ln+1 Ln
0
1
2
3
4
5
6
7
0
1
2
3
Consequences: ?
Consequences:
Consequences:
Consequences:
And then?
Then: Least Recently Used
Consequences: ?
0
1
Consequences:
- complexity grows with number of ways
Consequences:
- 15% less cache misses
Consequences:
- avoids N-parallel stalls

Cache line - alignment consequences
Cache line: 32B
Object size: 4B
Alignment: 2B -> alignas(2) std::byte x[4];

Cache line: 32B
Object size: 4B
Alignment: 2B -> alignas(2) std::byte x[4];Alignment: 4B -> int32_t;

Cache line: 32B
Object size: 4B
Alignment: 2B -> alignas(2) std::byte x[4];Alignment: 4B -> int32_t;Alignment: 32B -> alignas(32) int32_t x[4];

Cache line - write consequences
Cache line: 32B
Object size: 4B
Alignment: 4B

Cache line - write consequences
Cache line: 32B
Object size: 4B
Alignment: 4B
Cache line: 32B (invalidated)

Keeping caches hot
// data: vector of objects composed of:
// int32_t type , string name , data* parent ,
// map<string , string > params
// task: find by type

Naive C-style!
s t r u c t data {
i n t 3 2 _ t type ;
s t r i n g name ;
o b s e r v e r _ p t r <data> p a r en t ;
map<s t r i n g , s t r i n g > params ;
} ;
// task: find by type (vector is sorted by type)
pair <s i z e _ t , s i z e _ t >
f i n d _ b y _ t y p e ( v e c t o r <data> const & x , key type ) {
auto r = equal_range ( begin ( x ) , end ( x ) , type ) ;
r e t u r n { r . f i r s t − begin ( x ) , r . second − begin ( x ) } ;
}

Layout
s i z e o f ( data ) ; // : 64B
o f f s e t o f ( type ) ; // : 0B
o f f s e t o f ( name ) ; // : 8B
o f f s e t o f ( p ar e n t ) ; // : 40B
o f f s e t o f ( params ) ; // : 48B
a l i g n o f ( data ) ; // : 64B
//Cache line : 64B
Type #1 — Name #1 Parent #1 Params #1
Type #2 — Name #2 Parent #2 Params #2

C++ style!
s t r u c t data {
s t r i n g name ;
o b s e r v e r _ p t r <data> p a r en t ;
map<s t r i n g , s t r i n g > params ;
} ;
boost : : flat_map <i n t 3 2 _ t , data >:: equal_range ;
std : : map<i n t 3 2 _ t , data >:: equal_range ;
std : : unordered_map<i n t 3 2 _ t , data >:: equal_range ;

AMD AthlonTM
II X3 440, 12GiB RAM DDR3-1333, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
ﬂat-map-clang
map-clang
naive-clang
unordered-map-clang

AMD RyzenTM
1600X, 16GiB RAM DDR4-3200, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
ﬂat-map-clang
map-clang
naive-clang
unordered-map-clang

Will separate array do better?
s t r u c t a r r {
v e c t o r <i n t 3 2 _ t > types ;
v e c t o r <entry > e n t r i e s ;
} ;
pair <s i z e _ t , s i z e _ t >
f i n d _ b y _ t y p e ( a r r const & d , i n t 3 2 _ t type ) {
auto b = begin ( d . types ) ;
auto r = equal_range ( b , end ( d . types ) , type ) ;
r e t u r n { r . f i r s t − b , r . second − b } ;
}

Layout
s i z e o f ( type ) ; // : 8B
a l i g n o f ( data ) ; // : 4B
//Cache line : 64B
Type #1 Type #2 Type #3 Type #4 Type #5 Type #6 Type #7 Type #8 Type #9 Type #10 Type #11 Type #12 Type #13 Type #14 Type #15 Type #16

AMD AthlonTM
II X3 440, 12GiB RAM DDR3-1333, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
ﬂat-map-clang
map-clang
naive-clang
optimized-clang
unordered-map-clang

AMD RyzenTM
1600X, 16GiB RAM DDR4-3200, clang 8.0.1-3
0
2500
5000
7500
10000
1e+04 1e+06 1e+08
Number of objects
Time(ns)
colour
ﬂat-map-clang
map-clang
naive-clang
optimized-clang
unordered-map-clang

Instructions (1.0e+07, 2.5e+07, 5.0e+07)
95614956149561495614111611161116111613461346134613468698698698691241124112411241
0
500
1000
1500
Instr
values
types
ﬂat-map
map
naive
optimized
unordered-map

L1 uses & misses
870587058888115115156156115115 1739517395122122130130227227118118
0
100
200
300
L1 misses L1 uses
values
types
ﬂat-map
map
naive
optimized
unordered-map

LL miss rate
7.77.77.77.78.68.68.68.69.29.29.29.215.315.315.315.39.89.89.89.8
0
5
10
15
L3 miss rate
values
types
ﬂat-map
map
naive
optimized
unordered-map

Questions and Answers
Kamil.Witecki@nokia.com

Bibliograpy I
Micron Technology, Inc.
Speed vs. latency.
White paper, Micron Technology, Inc.
David A. Patterson and John L. Hennessy.
Computer Organization and Design, Fifth Edition: The Hardware/Software
Interface.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th edition, 2013.

Code dive 2019 kamil witecki - should i care about cpu cache

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Code dive 2019 kamil witecki - should i care about cpu cache

Similar to Code dive 2019 kamil witecki - should i care about cpu cache (20)

Recently uploaded

Recently uploaded (20)

Code dive 2019 kamil witecki - should i care about cpu cache