Modern CPUs and Caches - A Starting Point for Programmers

Yaser Zhian
Fanafzar Game Studio
IGDI, Workshop 07, January 2nd, 2013

 Some notes about the subject
 CPUs and their gimmicks
 Caches and their importance
 How CPU and OS handle memory logically

http://yaserzt.com/blog/ 2

 These are very complex subjects
 Expect very few details and much simplification
 These are very complicated subjects
 Expect much generalization and omission
 No time
 Even a full course would be hilariously insufficient
 Not an expert
 Sorry! Can’t help much.
 Just a pile of loosely related stuff

 Pressure for performance
 Backwards compatibility
 Cost/power/etc.
 The ridiculous “numbers game”
 Law of diminishing returns
 Latency vs. Throughput


 You can always solve your bandwidth
(throughput) problems with money, but it is
rarely so for lag (latency.)
 Relative rate of improvements (from David
Patterson’s keynote, HPEC 2004)
 CPU, 80286 till Pentium 4: 21x vs. 2250x
 Ethernet, 10Mb till 10Gb: 16x vs. 1000x
 Disk, 3600 till 15000rpm: 8x vs. 143x
 DRAM, plain till DDR: 4x vs. 120x

 At the simplest level, the von Neumann
model stipulates:
 Program is data and is stored in memory along
with data (departing from Turing’s model)
 Program is executed sequentially
 Not the way computers function anymore…
 Abstraction still used for thinking about programs
 But it’s leaky as heck!
 “Not Your Fathers’ von Neumann Machine!”

 Speed of Light: can’t send and receive signals to
and from all parts of the die in a cycle anymore
 Power: more transistors leads to more power,
which leads to much more heat
 Memory: the CPU isn’t even close to the
bottleneck anymore. “All your base are belong
to” memory
 Complexity: adding more transistors for more
sophisticated operation won’t give much of a
speedup (e.g. doubling transistors might give
2%.)


 Family introduced with 8086 in 1978
 Today, new members are still fully binary
backward-compatible with that puny machine
(5MHz clock, 20-bit addressing, 16-bit regs.)
 It had very few registers
 It had segmented memory addressing (joy!)
 It had many complex instructions and several
addressing modes


 1982 (80286): Protected mode, MMU
 1985 (80386): 32-bit ISA, Paging
 1989 (80486): Pipelining, Cache, Intgrtd. FPU
 1993 (Pentium): Superscalar, 64-bit bus, MMX
 1995 (P-Pro): μ-ops, OoO Exec., Register
Renaming, Speculative Exec.
 1997 (K6-2, PIII): 3DNow!/SSE
 2003 (Opteron): 64-bit ISA
 2006 (Core 2): Multi-core

 Registers got expanded from (all 16 bit, non really
general purpose)
 AX, BX, CX, DX
 SI, DI, BP, SP
 CS, DS, ES, SS, Flags, IP
 To
 16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI,
R8-R15) plus RIP and Flags and others
 16 x 128-bit XMM regs. (XMM0-...)
▪ Or 16 x 256-bit YMM regs. (YMM0-...)
 More than a thousand logically different instructions (the
usual, plus string processing, cryptography, CRC, complex
numbers, etc.)

 The Fetch-Decode-Execute-Retire Cycle
 Strategies for more performance:
 More complex instructions, doing more in
hardware (CISCing things up)
 Faster CPU clock rates (the free lunch)
 Instruction-Level Parallelism (SIMD + gimmicks)
 Adding cores (free lunch is over!)
 And then, there are gimmicks…


 Pipelining
 µ-ops
 Superscalar Pipelines
 Out-of-order Execution
 Speculative Execution
 Register Renaming
 Branch Prediction
 Prefetching
 Store Buffer
 Trace Cache
 …


Classic sequential execution:
 Length of instruction executions vary a lot (5-10
times usual, several orders of magnitude also
happen.)

Instruction 1

Instruction 2

Instruction 3

Instruction 4


It’s really more like this for the CPU:
 Instructions may have many sub-parts, and they
engage different parts of the CPU

F1 D1 E1 R1

F2 D2 E2 R2

F3 D3 E3 R3

F4 D4 E4 R4


So why not do this:
 This is called “pipelining”
 It increases throughput (significantly)
 Doesn’t decrease latency for single instructions
F1 D1 E1 R1

F2 D2 E2 R2

F3 D3 E3 R3

F4 D4 E4 R4


But it has its own share of problems
 Hazards, stalls, flushing, etc.
 Execution of i2 depends on the result of i1
 After i2, we jump and the i3, i4,… are flushed out
F1 D1 E1 R1 add EAX,120

F2 D2 E2 R2 jmp [EAX]

F3 D3 E3 R3 mov [4*EBX+42],EDX

F4 D4 E4 R4 add ECX,[EAX]


 Instructions are broken up into simple,
orthogonal µ-ops
 mov EAX,EDX might generate only one µ-op
 mov EAX,[EDX] might generate two:
1. µld tmp0,[EDX]
2. µmov EAX,tmp0
 add [EAX],EDX probably generates three:
1. µld tmp0,[EAX]
2. µadd tmp0,EDX
3. µst [EAX],tmp0


 The CPU then, gets two layers:
 The one that breaks up operations into µ-ops
 The one that executes µ-ops
 The part that executes µ-ops can be simpler
(more RISCy) and therefore faster.
 More complex instructions can be supported
without (much) complicating the CPU
 The pipelining (and other gimmicks) can
happen at the µ-op level

 CPUs that issue (or retire) more than one
instruction per cycle are called Superscalar
 Can be thought of as a pipeline with more
than one line
 Simplest form: integer pipe plus floating-point
pipe
 These days, CPUs do 4 or more
 Obviously requires more of each type of
operational unit in the CPU

 To prevent your pipeline from stalling as
much as possible, issue the next instructions
even if you can’t start the current one.
 But of course, only if there are no hazards
(dependencies) and there are operational
units available.
 add RAX,RAX
add RAX,RBX This can be and is started before
the previous instruction.
add RCX,RDX

 This obviously also applies at the µ-op level:
 mov RAX,[mem0] Fetching mem1 is started long
mul RAX,42 before the result of the
multiply becomes available.
add RAX,[mem1]
 push RAX
Pushing RAX is sub RSP,8 and then
call Func mov [RSP],RAX. Since call
instruction needs RSP too, it will only
wait for the subtraction and not the
store to finish to start.


 Consider this:
mov RAX,[mem0]
mul RAX,42
mov [mem1],RAX
mov RAX,[mem2]
add RAX,7
mov [mem3],RAX
 Logically, the two parts are totally separate.
 However, the use of RAX will stall the pipeline.

 Modern CPUs have a lot of temporary,
unnamed registers at their disposal.
 They will detect the logical independence,
and will use one of those in the second block
instead on RAX.
 And they will track which reg. is which, where.
 In effect, they are renaming another register
to RAX.
 There might not even be a real RAX!

 This is, for once, simpler than it might seem!
 Every time a register is assigned to, a new
temporary register is used in its stead.
 Consider this:

Rename happens
mov RAX,[cached]
mov RBX,[uncached]
Renaming on mul means
add RBX,RAX that it won’t clobber RAX
mul RAX,42 (which we need for the
add, that is waiting on the
mov [mem0],RAX load of [uncached]) and we
mov [mem1],RBX can do the multiply and
reach the first store much
sooner.

 The CPU always depends on knowing where
the next instruction is, so it can go ahead and
work on it.
 That’s why branches in code are anathema to
modern, deep pipelines and all the gimmicks
they pull.
 Only if the CPU could somehow guess where
the target of each branch is going to be…
 That’s where branch prediction comes in.

 So the CPU guesses the target of a jump (if it
doesn’t know for sure,) and continues to
speculatively execute instructions from there.
 For a conditional jump, the CPU must also
predict whether the branch is taken or not.
 If the CPU is right, the pipeline flows
smoothly. If not, the pipeline must be flushed
and much time and resource is wasted on a
misprediction.

 In this code:
cmp RAX,0
jne [RBX]
both the target and whether the jump happens
or not must be predicted.
 The above can effectively jump anywhere!
 But usually branches are closer to this:
cmp RAX,0
jne somewhere_specific
 Which can only have two possible targets.


 In a simple form, when a branch is executed,
its target is stored in a table called the BTB (or
Branch Target Buffer.) When that branch is
encountered again, the target address is
predicted to be the value read from the BTB.
 As you might guess, this doesn’t work for
many situations (e.g. alternating branch.)
 Also, the size of the BTB is limited, so CPU will
forget about the last target of some jumps.

 A simple expansion on the previous idea is to use a
saturating counter along with each entry of the BTB.
 For example, with a 2-bit counter,
 Branch is predicted not to be taken if the counter is 0 or 1.
 The branch is predicted to be taken if the counter is 2 or 3.
 Each time it is taken, counter is incremented, and vice versa.
T T T
Strongly Weakly
Weakly Strongly
NT Not Not T
Taken Taken
Taken Taken
NT NT NT


 But this behaves very badly in common situations.
 For an alternating branch,
 If the counter starts in 00 or 11, it will mispredict 50%.
 If the counter starts in 01, and the first time the branch
is taken, it will mispredict 100%!
 As an improvement, we can store the history of
the last N occurrences of the branch in the BTB,
and use 2N counters for each of the possible
history patterns.


 For N=4 and 2-bit counters, we’ll have:
 This is an extremely cool method of doing branch
prediction!

Branch History Prediction
.
(0 or 1)
0010 .
.


 Some predictions are simpler:
 For each ret instruction, the target is somewhere
on the stack (pushed before.) Modern CPUs keep
track of return addresses in an internal return
stack buffer. Each time a call is executed, an
entry is added and is used for the return address.
 On a cold encounter (a.k.a. static prediction) a
branch is sometimes predicted to
▪ fall through if it goes forward.
▪ be taken if it goes backward.


 Best general advice is to arrange your code so
that the most common path for branches is
“not taken”. This improves the effectiveness
of code prefetching and the trace cache.
 Branch prediction, register renaming and
speculative execution work extremely well
together.


mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]


Clock 0 – Instruction 0

mov RAX,[RBX+16] Load RAX from memory
add RBX,16 Assume cache miss – 300
cmp RAX,0 cycles to load
Instruction starts and
je IsNull
dispatch continues...
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]



mov RAX,[RBX+16] This instruction writes RBX,
add RBX,16 which conflicts with the
cmp RAX,0 read in instruction 0.
Rename this instance of
je IsNull
RBX and continue…
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]



mov RAX,[RBX+16] Value of RAX not available
add RBX,16 yet; cannot calculate value
cmp RAX,0 of Flags reg.
Queue up behind
je IsNull
instruction 0…
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]



mov RAX,[RBX+16] Flags reg. still not available.
add RBX,16 Predict that this branch is
cmp RAX,0 not taken.
Assuming 4-wide dispatch,
je IsNull
instruction issue limit is
mov [RBX-16],RCX reached.
mov RCX,[RDX+0]
mov RAX,[RAX+8]



mov RAX,[RBX+16] Store is speculative. Result
add RBX,16 kept in Store Buffer. Also,
cmp RAX,0 RBX might not be available
yet (from instruction 1.)
je IsNull
Load/Store Unit is tied up
mov [RBX-16],RCX from now on; can’t issue
mov RCX,[RDX+0] any more memory ops in
mov RAX,[RAX+8] this cycle.



mov RAX,[RBX+16]
add RBX,16
cmp RAX,0 Had to wait for L/S Unit.
je IsNull Assume this is another (and
unrelated) cache miss. We
mov [RBX-16],RCX
have 2 overlapping cache
mov RCX,[RDX+0] misses now.
mov RAX,[RAX+8] L/S Unit is busy again.



mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
RAX is not ready yet (300-
mov [RBX-16],RCX
cycle latency, remember?!)
mov RCX,[RDX+0] This load cannot even start
mov RAX,[RAX+8] until instruction 0 is done.



mov RAX,[RBX+16]
add RBX,16 At clock 300 (or 301,) RAX is
cmp RAX,0 finally ready.
je IsNull Do the comparison and
update Flags register.
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]



mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull Issue this load too. Assume
mov [RBX-16],RCX a cache hit (finally!) Result
mov RCX,[RDX+0] will be available in clock
mov RAX,[RAX+8] 304.



mov RAX,[RBX+16]
add RBX,16 Now the Flags reg. is ready.
cmp RAX,0 Check the prediction.
je IsNull Assume prediction was
correct.
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]



mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull This speculative store can
actually be committed to
mov [RBX-16],RCX memory (or cache,
mov RCX,[RDX+0] actually.)
mov RAX,[RAX+8]



mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
mov [RBX-16],RCX At clock 302, the result of
mov RCX,[RDX+0] this load arrives.
mov RAX,[RAX+8]



mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
mov [RBX-16],RCX
mov RCX,[RDX+0] Result arrived at clock 304;
mov RAX,[RAX+8] instruction retired at 305.


To summarize,
mov RAX,[RBX+16] • In 4 clocks, started 7 ops
add RBX,16 and 2 cache misses
cmp RAX,0 • Retired 7 ops in 306 cycles.
• Cache misses totally
je IsNull
dominate performance.
mov [RBX-16],RCX • The only real benefit came
mov RCX,[RDX+0] from being able to have 2
mov RAX,[RAX+8] overlapping cache misses!


To get to the next cache
miss as early as possible.


 Main memory is slow; S.L.O.W.
 Very slow
 Painfully slow
 And it specially has very bad (high) latency
 But all is not lost! Many (most) references to
memory have high temporal and address locality.
 So we use a small amount of very fast memory to
keep recently-accessed or likely-to-be-accessed
chunks of main memory close to CPU.


 Typically come is several levels (3 these days.)
 Each lower level is several times smaller, but
several times faster than the level above.
 CPU can only see the L1 cache, each level only
sees the level above, and only the highest
level can communicate with main memory.
 Data is transferred between memory and
cache in units of fixed size, called a cache line.
The most common size today is 64 bytes.

 When any memory byte is Main Memory
needed, its place in cache is Each block is the
calculated; size of a cache line
 CPU asks the cache;
 If there, the cache returns the The Cache
data; Each block also
 If not, the data is pulled in holds metadata
from memory; like tag (address)
 If the calculated cache line is and some flags
occupied by data with a
different tag, that data is
evicted.
 If the line is dirty (modified) it
is written back to memory
first.

 In this basic model, if the CPU periodically
accesses memory addresses that differ by a
multiple of the cache size, they will constantly
evict each other out and most cache accesses
will be misses. This is called cache thrashing.
 An application can innocently and very easily
trigger this.


 To alleviate this problem, each cache block is
turned into an associative memory that can
house more than one cache line.
 Each cache block holds more cache lines (2, 4,
8 or more,) and still uses the tag to look up
the line requested by the CPU in the block.
 When a new line comes in from memory, an
LRU (or similar) policy is used to evict only the
least-likely-to-be-needed line.

 References:
 Patterson & Hennessy - Computer Organization and Design
 Intel 64 and IA-32 Architectures Software Developer’s
Manual – vol. 1, 2 and 3
 Click & Goetz – A Crash Course in Modern Hardware
 Agner Fog - The Microarchitecture of Intel, AMD and VIA
CPUs
 Drepper - What Every Programmer Should Know About
Memory


Modern CPUs and Caches - A Starting Point for Programmers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modern CPUs and Caches - A Starting Point for Programmers

Similar to Modern CPUs and Caches - A Starting Point for Programmers (20)

Modern CPUs and Caches - A Starting Point for Programmers