Modern CPUs and Caches - A Starting Point for Programmers

1,236 views

Published on

A short and very cursory look at some of the features that make modern (x86) CPUs "modern".
I wished to include more examples, time-comparisons and more detailed information, but I the time allotted to the presentation barely allowed even this.
This was the first time I was presenting the subject, so expect much roughness around the edges.
Also, if you are even remotely interested in modern CPUs and caches and whatnot, don't look at this; Google for Cliff Click's excellent talk "A Crash Course in Modern Hardware".

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,236
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
13
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Modern CPUs and Caches - A Starting Point for Programmers

  1. 1. Yaser ZhianFanafzar Game StudioIGDI, Workshop 07, January 2nd, 2013
  2. 2.  Some notes about the subject CPUs and their gimmicks Caches and their importance How CPU and OS handle memory logically http://yaserzt.com/blog/ 2
  3. 3.  These are very complex subjects  Expect very few details and much simplification These are very complicated subjects  Expect much generalization and omission No time  Even a full course would be hilariously insufficient Not an expert  Sorry! Can’t help much. Just a pile of loosely related stuff http://yaserzt.com/blog/ 3
  4. 4.  Pressure for performance Backwards compatibility Cost/power/etc. The ridiculous “numbers game” Law of diminishing returns Latency vs. Throughput http://yaserzt.com/blog/ 4
  5. 5.  You can always solve your bandwidth (throughput) problems with money, but it is rarely so for lag (latency.) Relative rate of improvements (from David Patterson’s keynote, HPEC 2004)  CPU, 80286 till Pentium 4: 21x vs. 2250x  Ethernet, 10Mb till 10Gb: 16x vs. 1000x  Disk, 3600 till 15000rpm: 8x vs. 143x  DRAM, plain till DDR: 4x vs. 120x http://yaserzt.com/blog/ 5
  6. 6.  At the simplest level, the von Neumann model stipulates:  Program is data and is stored in memory along with data (departing from Turing’s model)  Program is executed sequentially Not the way computers function anymore…  Abstraction still used for thinking about programs  But it’s leaky as heck! “Not Your Fathers’ von Neumann Machine!” http://yaserzt.com/blog/ 6
  7. 7.  Speed of Light: can’t send and receive signals to and from all parts of the die in a cycle anymore Power: more transistors leads to more power, which leads to much more heat Memory: the CPU isn’t even close to the bottleneck anymore. “All your base are belong to” memory Complexity: adding more transistors for more sophisticated operation won’t give much of a speedup (e.g. doubling transistors might give 2%.) http://yaserzt.com/blog/ 7
  8. 8.  Family introduced with 8086 in 1978 Today, new members are still fully binary backward-compatible with that puny machine (5MHz clock, 20-bit addressing, 16-bit regs.) It had very few registers It had segmented memory addressing (joy!) It had many complex instructions and several addressing modes http://yaserzt.com/blog/ 8
  9. 9.  1982 (80286): Protected mode, MMU 1985 (80386): 32-bit ISA, Paging 1989 (80486): Pipelining, Cache, Intgrtd. FPU 1993 (Pentium): Superscalar, 64-bit bus, MMX 1995 (P-Pro): μ-ops, OoO Exec., Register Renaming, Speculative Exec. 1997 (K6-2, PIII): 3DNow!/SSE 2003 (Opteron): 64-bit ISA 2006 (Core 2): Multi-core http://yaserzt.com/blog/ 9
  10. 10.  Registers got expanded from (all 16 bit, non really general purpose)  AX, BX, CX, DX  SI, DI, BP, SP  CS, DS, ES, SS, Flags, IP To  16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI, R8-R15) plus RIP and Flags and others  16 x 128-bit XMM regs. (XMM0-...) ▪ Or 16 x 256-bit YMM regs. (YMM0-...)  More than a thousand logically different instructions (the usual, plus string processing, cryptography, CRC, complex numbers, etc.) http://yaserzt.com/blog/ 10
  11. 11.  The Fetch-Decode-Execute-Retire Cycle Strategies for more performance:  More complex instructions, doing more in hardware (CISCing things up)  Faster CPU clock rates (the free lunch)  Instruction-Level Parallelism (SIMD + gimmicks)  Adding cores (free lunch is over!) And then, there are gimmicks… http://yaserzt.com/blog/ 11
  12. 12.  Pipelining µ-ops Superscalar Pipelines Out-of-order Execution Speculative Execution Register Renaming Branch Prediction Prefetching Store Buffer Trace Cache … http://yaserzt.com/blog/ 12
  13. 13. Classic sequential execution:  Length of instruction executions vary a lot (5-10 times usual, several orders of magnitude also happen.) Instruction 1 Instruction 2 Instruction 3 Instruction 4 http://yaserzt.com/blog/ 13
  14. 14. It’s really more like this for the CPU:  Instructions may have many sub-parts, and they engage different parts of the CPU F1 D1 E1 R1 F2 D2 E2 R2 F3 D3 E3 R3 F4 D4 E4 R4 http://yaserzt.com/blog/ 14
  15. 15. So why not do this:  This is called “pipelining”  It increases throughput (significantly)  Doesn’t decrease latency for single instructions F1 D1 E1 R1 F2 D2 E2 R2 F3 D3 E3 R3 F4 D4 E4 R4 http://yaserzt.com/blog/ 15
  16. 16. But it has its own share of problems  Hazards, stalls, flushing, etc.  Execution of i2 depends on the result of i1  After i2, we jump and the i3, i4,… are flushed out F1 D1 E1 R1 add EAX,120 F2 D2 E2 R2 jmp [EAX] F3 D3 E3 R3 mov [4*EBX+42],EDX F4 D4 E4 R4 add ECX,[EAX] http://yaserzt.com/blog/ 16
  17. 17.  Instructions are broken up into simple, orthogonal µ-ops  mov EAX,EDX might generate only one µ-op  mov EAX,[EDX] might generate two: 1. µld tmp0,[EDX] 2. µmov EAX,tmp0  add [EAX],EDX probably generates three: 1. µld tmp0,[EAX] 2. µadd tmp0,EDX 3. µst [EAX],tmp0 http://yaserzt.com/blog/ 17
  18. 18.  The CPU then, gets two layers:  The one that breaks up operations into µ-ops  The one that executes µ-ops The part that executes µ-ops can be simpler (more RISCy) and therefore faster. More complex instructions can be supported without (much) complicating the CPU The pipelining (and other gimmicks) can happen at the µ-op level http://yaserzt.com/blog/ 18
  19. 19.  CPUs that issue (or retire) more than one instruction per cycle are called Superscalar Can be thought of as a pipeline with more than one line Simplest form: integer pipe plus floating-point pipe These days, CPUs do 4 or more Obviously requires more of each type of operational unit in the CPU http://yaserzt.com/blog/ 19
  20. 20.  To prevent your pipeline from stalling as much as possible, issue the next instructions even if you can’t start the current one. But of course, only if there are no hazards (dependencies) and there are operational units available. add RAX,RAX add RAX,RBX This can be and is started before the previous instruction. add RCX,RDX http://yaserzt.com/blog/ 20
  21. 21.  This obviously also applies at the µ-op level: mov RAX,[mem0] Fetching mem1 is started long mul RAX,42 before the result of the multiply becomes available. add RAX,[mem1] push RAX Pushing RAX is sub RSP,8 and then call Func mov [RSP],RAX. Since call instruction needs RSP too, it will only wait for the subtraction and not the store to finish to start. http://yaserzt.com/blog/ 21
  22. 22.  Consider this: mov RAX,[mem0] mul RAX,42 mov [mem1],RAX mov RAX,[mem2] add RAX,7 mov [mem3],RAX Logically, the two parts are totally separate. However, the use of RAX will stall the pipeline. http://yaserzt.com/blog/ 22
  23. 23.  Modern CPUs have a lot of temporary, unnamed registers at their disposal. They will detect the logical independence, and will use one of those in the second block instead on RAX. And they will track which reg. is which, where. In effect, they are renaming another register to RAX. There might not even be a real RAX! http://yaserzt.com/blog/ 23
  24. 24.  This is, for once, simpler than it might seem! Every time a register is assigned to, a new temporary register is used in its stead. Consider this: Rename happens mov RAX,[cached] mov RBX,[uncached] Renaming on mul means add RBX,RAX that it won’t clobber RAX mul RAX,42 (which we need for the add, that is waiting on the mov [mem0],RAX load of [uncached]) and we mov [mem1],RBX can do the multiply and reach the first store much sooner. http://yaserzt.com/blog/ 24
  25. 25.  The CPU always depends on knowing where the next instruction is, so it can go ahead and work on it. That’s why branches in code are anathema to modern, deep pipelines and all the gimmicks they pull. Only if the CPU could somehow guess where the target of each branch is going to be… That’s where branch prediction comes in. http://yaserzt.com/blog/ 25
  26. 26.  So the CPU guesses the target of a jump (if it doesn’t know for sure,) and continues to speculatively execute instructions from there. For a conditional jump, the CPU must also predict whether the branch is taken or not. If the CPU is right, the pipeline flows smoothly. If not, the pipeline must be flushed and much time and resource is wasted on a misprediction. http://yaserzt.com/blog/ 26
  27. 27.  In this code: cmp RAX,0 jne [RBX] both the target and whether the jump happens or not must be predicted. The above can effectively jump anywhere! But usually branches are closer to this: cmp RAX,0 jne somewhere_specific Which can only have two possible targets. http://yaserzt.com/blog/ 27
  28. 28.  In a simple form, when a branch is executed, its target is stored in a table called the BTB (or Branch Target Buffer.) When that branch is encountered again, the target address is predicted to be the value read from the BTB. As you might guess, this doesn’t work for many situations (e.g. alternating branch.) Also, the size of the BTB is limited, so CPU will forget about the last target of some jumps. http://yaserzt.com/blog/ 28
  29. 29.  A simple expansion on the previous idea is to use a saturating counter along with each entry of the BTB. For example, with a 2-bit counter,  Branch is predicted not to be taken if the counter is 0 or 1.  The branch is predicted to be taken if the counter is 2 or 3.  Each time it is taken, counter is incremented, and vice versa. T T T Strongly Weakly Weakly Strongly NT Not Not T Taken Taken Taken Taken NT NT NT http://yaserzt.com/blog/ 29
  30. 30.  But this behaves very badly in common situations. For an alternating branch,  If the counter starts in 00 or 11, it will mispredict 50%.  If the counter starts in 01, and the first time the branch is taken, it will mispredict 100%! As an improvement, we can store the history of the last N occurrences of the branch in the BTB, and use 2N counters for each of the possible history patterns. http://yaserzt.com/blog/ 30
  31. 31.  For N=4 and 2-bit counters, we’ll have:  This is an extremely cool method of doing branch prediction! Branch History Prediction . (0 or 1) 0010 . . http://yaserzt.com/blog/ 31
  32. 32.  Some predictions are simpler:  For each ret instruction, the target is somewhere on the stack (pushed before.) Modern CPUs keep track of return addresses in an internal return stack buffer. Each time a call is executed, an entry is added and is used for the return address.  On a cold encounter (a.k.a. static prediction) a branch is sometimes predicted to ▪ fall through if it goes forward. ▪ be taken if it goes backward. http://yaserzt.com/blog/ 32
  33. 33.  Best general advice is to arrange your code so that the most common path for branches is “not taken”. This improves the effectiveness of code prefetching and the trace cache. Branch prediction, register renaming and speculative execution work extremely well together. http://yaserzt.com/blog/ 33
  34. 34. mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] http://yaserzt.com/blog/ 34
  35. 35. Clock 0 – Instruction 0mov RAX,[RBX+16] Load RAX from memoryadd RBX,16 Assume cache miss – 300cmp RAX,0 cycles to load Instruction starts andje IsNull dispatch continues...mov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] http://yaserzt.com/blog/ 35
  36. 36. Clock 0 – Instruction 1mov RAX,[RBX+16] This instruction writes RBX,add RBX,16 which conflicts with thecmp RAX,0 read in instruction 0. Rename this instance ofje IsNull RBX and continue…mov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] http://yaserzt.com/blog/ 36
  37. 37. Clock 0 – Instruction 2mov RAX,[RBX+16] Value of RAX not availableadd RBX,16 yet; cannot calculate valuecmp RAX,0 of Flags reg. Queue up behindje IsNull instruction 0…mov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] http://yaserzt.com/blog/ 37
  38. 38. Clock 0 – Instruction 3mov RAX,[RBX+16] Flags reg. still not available.add RBX,16 Predict that this branch iscmp RAX,0 not taken. Assuming 4-wide dispatch,je IsNull instruction issue limit ismov [RBX-16],RCX reached.mov RCX,[RDX+0]mov RAX,[RAX+8] http://yaserzt.com/blog/ 38
  39. 39. Clock 1 – Instruction 4mov RAX,[RBX+16] Store is speculative. Resultadd RBX,16 kept in Store Buffer. Also,cmp RAX,0 RBX might not be available yet (from instruction 1.)je IsNull Load/Store Unit is tied upmov [RBX-16],RCX from now on; can’t issuemov RCX,[RDX+0] any more memory ops inmov RAX,[RAX+8] this cycle. http://yaserzt.com/blog/ 39
  40. 40. Clock 2 – Instruction 5mov RAX,[RBX+16]add RBX,16cmp RAX,0 Had to wait for L/S Unit.je IsNull Assume this is another (and unrelated) cache miss. Wemov [RBX-16],RCX have 2 overlapping cachemov RCX,[RDX+0] misses now.mov RAX,[RAX+8] L/S Unit is busy again. http://yaserzt.com/blog/ 40
  41. 41. Clock 3 – Instruction 6mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNull RAX is not ready yet (300-mov [RBX-16],RCX cycle latency, remember?!)mov RCX,[RDX+0] This load cannot even startmov RAX,[RAX+8] until instruction 0 is done. http://yaserzt.com/blog/ 41
  42. 42. Clock 301 – Instruction 2mov RAX,[RBX+16]add RBX,16 At clock 300 (or 301,) RAX iscmp RAX,0 finally ready.je IsNull Do the comparison and update Flags register.mov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] http://yaserzt.com/blog/ 42
  43. 43. Clock 301 – Instruction 6mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNull Issue this load too. Assumemov [RBX-16],RCX a cache hit (finally!) Resultmov RCX,[RDX+0] will be available in clockmov RAX,[RAX+8] 304. http://yaserzt.com/blog/ 43
  44. 44. Clock 302 – Instruction 3mov RAX,[RBX+16]add RBX,16 Now the Flags reg. is ready.cmp RAX,0 Check the prediction.je IsNull Assume prediction was correct.mov [RBX-16],RCXmov RCX,[RDX+0]mov RAX,[RAX+8] http://yaserzt.com/blog/ 44
  45. 45. Clock 302 – Instruction 4mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNull This speculative store can actually be committed tomov [RBX-16],RCX memory (or cache,mov RCX,[RDX+0] actually.)mov RAX,[RAX+8] http://yaserzt.com/blog/ 45
  46. 46. Clock 302 – Instruction 5mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCX At clock 302, the result ofmov RCX,[RDX+0] this load arrives.mov RAX,[RAX+8] http://yaserzt.com/blog/ 46
  47. 47. Clock 305 – Instruction 6mov RAX,[RBX+16]add RBX,16cmp RAX,0je IsNullmov [RBX-16],RCXmov RCX,[RDX+0] Result arrived at clock 304;mov RAX,[RAX+8] instruction retired at 305. http://yaserzt.com/blog/ 47
  48. 48. To summarize,mov RAX,[RBX+16] • In 4 clocks, started 7 opsadd RBX,16 and 2 cache missescmp RAX,0 • Retired 7 ops in 306 cycles. • Cache misses totallyje IsNull dominate performance.mov [RBX-16],RCX • The only real benefit camemov RCX,[RDX+0] from being able to have 2mov RAX,[RAX+8] overlapping cache misses! http://yaserzt.com/blog/ 48
  49. 49. To get to the next cachemiss as early as possible. http://yaserzt.com/blog/ 49
  50. 50.  Main memory is slow; S.L.O.W. Very slow Painfully slow And it specially has very bad (high) latency But all is not lost! Many (most) references to memory have high temporal and address locality. So we use a small amount of very fast memory to keep recently-accessed or likely-to-be-accessed chunks of main memory close to CPU. http://yaserzt.com/blog/ 50
  51. 51.  Typically come is several levels (3 these days.) Each lower level is several times smaller, but several times faster than the level above. CPU can only see the L1 cache, each level only sees the level above, and only the highest level can communicate with main memory. Data is transferred between memory and cache in units of fixed size, called a cache line. The most common size today is 64 bytes. http://yaserzt.com/blog/ 51
  52. 52.  When any memory byte is Main Memory needed, its place in cache is Each block is the calculated; size of a cache line CPU asks the cache; If there, the cache returns the The Cache data; Each block also If not, the data is pulled in holds metadata from memory; like tag (address) If the calculated cache line is and some flags occupied by data with a different tag, that data is evicted. If the line is dirty (modified) it is written back to memory first. http://yaserzt.com/blog/ 52
  53. 53.  In this basic model, if the CPU periodically accesses memory addresses that differ by a multiple of the cache size, they will constantly evict each other out and most cache accesses will be misses. This is called cache thrashing. An application can innocently and very easily trigger this. http://yaserzt.com/blog/ 53
  54. 54.  To alleviate this problem, each cache block is turned into an associative memory that can house more than one cache line. Each cache block holds more cache lines (2, 4, 8 or more,) and still uses the tag to look up the line requested by the CPU in the block. When a new line comes in from memory, an LRU (or similar) policy is used to evict only the least-likely-to-be-needed line. http://yaserzt.com/blog/ 54
  55. 55.  References:  Patterson & Hennessy - Computer Organization and Design  Intel 64 and IA-32 Architectures Software Developer’s Manual – vol. 1, 2 and 3  Click & Goetz – A Crash Course in Modern Hardware  Agner Fog - The Microarchitecture of Intel, AMD and VIA CPUs  Drepper - What Every Programmer Should Know About Memory http://yaserzt.com/blog/ 55

×