Modern CPUs and Caches - A Starting Point for Programmers


Published on

A short and very cursory look at some of the features that make modern (x86) CPUs "modern".
I wished to include more examples, time-comparisons and more detailed information, but I the time allotted to the presentation barely allowed even this.
This was the first time I was presenting the subject, so expect much roughness around the edges.
Also, if you are even remotely interested in modern CPUs and caches and whatnot, don't look at this; Google for Cliff Click's excellent talk "A Crash Course in Modern Hardware".

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Modern CPUs and Caches - A Starting Point for Programmers

  1. 1. Yaser ZhianFanafzar Game StudioIGDI, Workshop 07, January 2nd, 2013
  2. 2.  Some notes about the subject CPUs and their gimmicks Caches and their importance How CPU and OS handle memory logically 2
  3. 3.  These are very complex subjects  Expect very few details and much simplification These are very complicated subjects  Expect much generalization and omission No time  Even a full course would be hilariously insufficient Not an expert  Sorry! Can’t help much. Just a pile of loosely related stuff 3
  4. 4.  Pressure for performance Backwards compatibility Cost/power/etc. The ridiculous “numbers game” Law of diminishing returns Latency vs. Throughput 4
  5. 5.  You can always solve your bandwidth (throughput) problems with money, but it is rarely so for lag (latency.) Relative rate of improvements (from David Patterson’s keynote, HPEC 2004)  CPU, 80286 till Pentium 4: 21x vs. 2250x  Ethernet, 10Mb till 10Gb: 16x vs. 1000x  Disk, 3600 till 15000rpm: 8x vs. 143x  DRAM, plain till DDR: 4x vs. 120x 5
  6. 6.  At the simplest level, the von Neumann model stipulates:  Program is data and is stored in memory along with data (departing from Turing’s model)  Program is executed sequentially Not the way computers function anymore…  Abstraction still used for thinking about programs  But it’s leaky as heck! “Not Your Fathers’ von Neumann Machine!” 6
  7. 7.  Speed of Light: can’t send and receive signals to and from all parts of the die in a cycle anymore Power: more transistors leads to more power, which leads to much more heat Memory: the CPU isn’t even close to the bottleneck anymore. “All your base are belong to” memory Complexity: adding more transistors for more sophisticated operation won’t give much of a speedup (e.g. doubling transistors might give 2%.) 7
  8. 8.  Family introduced with 8086 in 1978 Today, new members are still fully binary backward-compatible with that puny machine (5MHz clock, 20-bit addressing, 16-bit regs.) It had very few registers It had segmented memory addressing (joy!) It had many complex instructions and several addressing modes 8
  9. 9.  1982 (80286): Protected mode, MMU 1985 (80386): 32-bit ISA, Paging 1989 (80486): Pipelining, Cache, Intgrtd. FPU 1993 (Pentium): Superscalar, 64-bit bus, MMX 1995 (P-Pro): μ-ops, OoO Exec., Register Renaming, Speculative Exec. 1997 (K6-2, PIII): 3DNow!/SSE 2003 (Opteron): 64-bit ISA 2006 (Core 2): Multi-core 9
  10. 10.  Registers got expanded from (all 16 bit, non really general purpose)  AX, BX, CX, DX  SI, DI, BP, SP  CS, DS, ES, SS, Flags, IP To  16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI, R8-R15) plus RIP and Flags and others  16 x 128-bit XMM regs. (XMM0-...) ▪ Or 16 x 256-bit YMM regs. (YMM0-...)  More than a thousand logically different instructions (the usual, plus string processing, cryptography, CRC, complex numbers, etc.) 10
  11. 11.  The Fetch-Decode-Execute-Retire Cycle Strategies for more performance:  More complex instructions, doing more in hardware (CISCing things up)  Faster CPU clock rates (the free lunch)  Instruction-Level Parallelism (SIMD + gimmicks)  Adding cores (free lunch is over!) And then, there are gimmicks… 11
  12. 12.  Pipelining µ-ops Superscalar Pipelines Out-of-order Execution Speculative Execution Register Renaming Branch Prediction Prefetching Store Buffer Trace Cache … 12
  13. 13. Classic sequential execution:  Length of instruction executions vary a lot (5-10 times usual, several orders of magnitude also happen.) Instruction 1 Instruction 2 Instruction 3 Instruction 4 13
  14. 14. It’s really more like this for the CPU:  Instructions may have many sub-parts, and they engage different parts of the CPU F1 D1 E1 R1 F2 D2 E2 R2 F3 D3 E3 R3 F4 D4 E4 R4 14
  15. 15. So why not do this:  This is called “pipelining”  It increases throughput (significantly)  Doesn’t decrease latency for single instructions F1 D1 E1 R1 F2 D2 E2 R2 F3 D3 E3 R3 F4 D4 E4 R4 15
  16. 16. But it has its own share of problems  Hazards, stalls, flushing, etc.  Execution of i2 depends on the result of i1  After i2, we jump and the i3, i4,… are flushed out F1 D1 E1 R1 add EAX,120 F2 D2 E2 R2 jmp [EAX] F3 D3 E3 R3 mov [4*EBX+42],EDX F4 D4 E4 R4 add ECX,[EAX] 16
  17. 17.  Instructions are broken up into simple, orthogonal µ-ops  mov EAX,EDX might generate only one µ-op  mov EAX,[EDX] might generate two: 1. µld tmp0,[EDX] 2. µmov EAX,tmp0  add [EAX],EDX probably generates three: 1. µld tmp0,[EAX] 2. µadd tmp0,EDX 3. µst [EAX],tmp0 17
  18. 18.  The CPU then, gets two layers:  The one that breaks up operations into µ-ops  The one that executes µ-ops The part that executes µ-ops can be simpler (more RISCy) and therefore faster. More complex instructions can be supported without (much) complicating the CPU The pipelining (and other gimmicks) can happen at the µ-op level 18
  19. 19.  CPUs that issue (or retire) more than one instruction per cycle are called Superscalar Can be thought of as a pipeline with more than one line Simplest form: integer pipe plus floating-point pipe These days, CPUs do 4 or more Obviously requires more of each type of operational unit in the CPU 19
  20. 20.  To prevent your pipeline from stalling as much as possible, issue the next instructions even if you can’t start the current one. But of course, only if there are no hazards (dependencies) and there are operational units available. add RAX,RAX add RAX,RBX This can be and is started before the previous instruction. add RCX,RDX 20
  21. 21.  This obviously also applies at the µ-op level: mov RAX,[mem0] Fetching mem1 is started long mul RAX,42 before the result of the multiply becomes available. add RAX,[mem1] push RAX Pushing RAX is sub RSP,8 and then call Func mov [RSP],RAX. Since call instruction needs RSP too, it will only wait for the subtraction and not the store to finish to start. 21
  22. 22.  Consider this: mov RAX,[mem0] mul RAX,42 mov [mem1],RAX mov RAX,[mem2] add RAX,7 mov [mem3],RAX Logically, the two parts are totally separate. However, the use of RAX will stall the pipeline. 22
  23. 23.  Modern CPUs have a lot of temporary, unnamed registers at their disposal. They will detect the logical independence, and will use one of those in the second block instead on RAX. And they will track which reg. is which, where. In effect, they are renaming another register to RAX. There might not even be a real RAX! 23
  24. 24.  This is, for once, simpler than it might seem! Every time a register is assigned to, a new temporary register is used in its stead. Consider this: Rename happens mov RAX,[cached] mov RBX,[uncached] Renaming on mul means add RBX,RAX that it won’t clobber RAX mul RAX,42 (which we need for the add, that is waiting on the mov [mem0],RAX load of [uncached]) and we mov [mem1],RBX can do the multiply and reach the first store much sooner. 24
  25. 25.  The CPU always depends on knowing where the next instruction is, so it can go ahead and work on it. That’s why branches in code are anathema to modern, deep pipelines and all the gimmicks they pull. Only if the CPU could somehow guess where the target of each branch is going to be… That’s where branch prediction comes in. 25
  26. 26.  So the CPU guesses the target of a jump (if it doesn’t know for sure,) and continues to speculatively execute instructions from there. For a conditional jump, the CPU must also predict whether the branch is taken or not. If the CPU is right, the pipeline flows smoothly. If not, the pipeline must be flushed and much time and resource is wasted on a misprediction. 26
  27. 27.  In this code: cmp RAX,0 jne [RBX] both the target and whether the jump happens or not must be predicted. The above can effectively jump anywhere! But usually branches are closer to this: cmp RAX,0 jne somewhere_specific Which can only have two possible targets. 27
  28. 28.  In a simple form, when a branch is executed, its target is stored in a table called the BTB (or Branch Target Buffer.) When that branch is encountered again, the target address is predicted to be the value read from the BTB. As you might guess, this doesn’t work for many situations (e.g. alternating branch.) Also, the size of the BTB is limited, so CPU will forget about the last target of some jumps. 28
  29. 29.  A simple expansion on the previous idea is to use a saturating counter along with each entry of the BTB. For example, with a 2-bit counter,  Branch is predicted not to be taken if the counter is 0 or 1.  The branch is predicted to be taken if the counter is 2 or 3.  Each time it is taken, counter is incremented, and vice versa. T T T Strongly Weakly Weakly Strongly NT Not Not T Taken Taken Taken Taken NT NT NT 29
  30. 30.  But this behaves very badly in common situations. For an alternating branch,  If the counter starts in 00 or 11, it will mispredict 50%.  If the counter starts in 01, and the first time the branch is taken, it will mispredict 100%! As an improvement, we can store the history of the last N occurrences of the branch in the BTB, and use 2N counters for each of the possible history patterns. 30
  31. 31.  For N=4 and 2-bit counters, we’ll have:  This is an extremely cool method of doing branch prediction! Branch History Prediction . (0 or 1) 0010 . . 31
  32. 32.  Some predictions are simpler:  For each ret instruction, the target is somewhere on the stack (pushed before.) Modern CPUs keep track of return addresses in an internal return stack buffer. Each time a call is executed, an entry is added and is used for the return address.  On a cold encounter (a.k.a. static prediction) a branch is sometimes predicted to ▪ fall through if it goes forward. ▪ be taken if it goes backward. 32
  33. 33.  Best general advice is to arrange your code so that the most common path for branches is “not taken”. This improves the effectiveness of code prefetching and the trace cache. Branch prediction, register renaming and speculative execution work extremely well together. 33
  34. 34. mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] 34
  35. 35. Clock 0 – Instruction 0mov RAX,[RBX+16] Load RAX from memoryadd RBX,16 Assume cache miss – 300cmp RAX,0 cycles to load Instruction starts andje IsNull dispatch [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] 35
  36. 36. Clock 0 – Instruction 1mov RAX,[RBX+16] This instruction writes RBX,add RBX,16 which conflicts with thecmp RAX,0 read in instruction 0. Rename this instance ofje IsNull RBX and continue…mov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] 36
  37. 37. Clock 0 – Instruction 2mov RAX,[RBX+16] Value of RAX not availableadd RBX,16 yet; cannot calculate valuecmp RAX,0 of Flags reg. Queue up behindje IsNull instruction 0…mov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] 37
  38. 38. Clock 0 – Instruction 3mov RAX,[RBX+16] Flags reg. still not available.add RBX,16 Predict that this branch iscmp RAX,0 not taken. Assuming 4-wide dispatch,je IsNull instruction issue limit ismov [RBX-16],RCX RCX,[RDX+0]mov RAX,[RAX+8] 38
  39. 39. Clock 1 – Instruction 4mov RAX,[RBX+16] Store is speculative. Resultadd RBX,16 kept in Store Buffer. Also,cmp RAX,0 RBX might not be available yet (from instruction 1.)je IsNull Load/Store Unit is tied upmov [RBX-16],RCX from now on; can’t issuemov RCX,[RDX+0] any more memory ops inmov RAX,[RAX+8] this cycle. 39
  40. 40. Clock 2 – Instruction 5mov RAX,[RBX+16]add RBX,16cmp RAX,0 Had to wait for L/S IsNull Assume this is another (and unrelated) cache miss. Wemov [RBX-16],RCX have 2 overlapping cachemov RCX,[RDX+0] misses RAX,[RAX+8] L/S Unit is busy again. 40
  41. 41. Clock 3 – Instruction 6mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNull RAX is not ready yet (300-mov [RBX-16],RCX cycle latency, remember?!)mov RCX,[RDX+0] This load cannot even startmov RAX,[RAX+8] until instruction 0 is done. 41
  42. 42. Clock 301 – Instruction 2mov RAX,[RBX+16]add RBX,16 At clock 300 (or 301,) RAX iscmp RAX,0 finally IsNull Do the comparison and update Flags [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] 42
  43. 43. Clock 301 – Instruction 6mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNull Issue this load too. Assumemov [RBX-16],RCX a cache hit (finally!) Resultmov RCX,[RDX+0] will be available in clockmov RAX,[RAX+8] 304. 43
  44. 44. Clock 302 – Instruction 3mov RAX,[RBX+16]add RBX,16 Now the Flags reg. is ready.cmp RAX,0 Check the IsNull Assume prediction was [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] 44
  45. 45. Clock 302 – Instruction 4mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNull This speculative store can actually be committed tomov [RBX-16],RCX memory (or cache,mov RCX,[RDX+0] actually.)mov RAX,[RAX+8] 45
  46. 46. Clock 302 – Instruction 5mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCX At clock 302, the result ofmov RCX,[RDX+0] this load RAX,[RAX+8] 46
  47. 47. Clock 305 – Instruction 6mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0] Result arrived at clock 304;mov RAX,[RAX+8] instruction retired at 305. 47
  48. 48. To summarize,mov RAX,[RBX+16] • In 4 clocks, started 7 opsadd RBX,16 and 2 cache missescmp RAX,0 • Retired 7 ops in 306 cycles. • Cache misses totallyje IsNull dominate [RBX-16],RCX • The only real benefit camemov RCX,[RDX+0] from being able to have 2mov RAX,[RAX+8] overlapping cache misses! 48
  49. 49. To get to the next cachemiss as early as possible. 49
  50. 50.  Main memory is slow; S.L.O.W. Very slow Painfully slow And it specially has very bad (high) latency But all is not lost! Many (most) references to memory have high temporal and address locality. So we use a small amount of very fast memory to keep recently-accessed or likely-to-be-accessed chunks of main memory close to CPU. 50
  51. 51.  Typically come is several levels (3 these days.) Each lower level is several times smaller, but several times faster than the level above. CPU can only see the L1 cache, each level only sees the level above, and only the highest level can communicate with main memory. Data is transferred between memory and cache in units of fixed size, called a cache line. The most common size today is 64 bytes. 51
  52. 52.  When any memory byte is Main Memory needed, its place in cache is Each block is the calculated; size of a cache line CPU asks the cache; If there, the cache returns the The Cache data; Each block also If not, the data is pulled in holds metadata from memory; like tag (address) If the calculated cache line is and some flags occupied by data with a different tag, that data is evicted. If the line is dirty (modified) it is written back to memory first. 52
  53. 53.  In this basic model, if the CPU periodically accesses memory addresses that differ by a multiple of the cache size, they will constantly evict each other out and most cache accesses will be misses. This is called cache thrashing. An application can innocently and very easily trigger this. 53
  54. 54.  To alleviate this problem, each cache block is turned into an associative memory that can house more than one cache line. Each cache block holds more cache lines (2, 4, 8 or more,) and still uses the tag to look up the line requested by the CPU in the block. When a new line comes in from memory, an LRU (or similar) policy is used to evict only the least-likely-to-be-needed line. 54
  55. 55.  References:  Patterson & Hennessy - Computer Organization and Design  Intel 64 and IA-32 Architectures Software Developer’s Manual – vol. 1, 2 and 3  Click & Goetz – A Crash Course in Modern Hardware  Agner Fog - The Microarchitecture of Intel, AMD and VIA CPUs  Drepper - What Every Programmer Should Know About Memory 55