Yaser Zhian
Fanafzar Game Studio
IGDI, Workshop 07, January 2nd, 2013
   Some notes about the subject
   CPUs and their gimmicks
   Caches and their importance
   How CPU and OS handle memory logically




             http://yaserzt.com/blog/        2
   These are very complex subjects
     Expect very few details and much simplification
   These are very complicated subjects
     Expect much generalization and omission
   No time
     Even a full course would be hilariously insufficient
   Not an expert
     Sorry! Can’t help much.
   Just a pile of loosely related stuff
                http://yaserzt.com/blog/                     3
   Pressure for performance
   Backwards compatibility
   Cost/power/etc.
   The ridiculous “numbers game”
   Law of diminishing returns
   Latency vs. Throughput



             http://yaserzt.com/blog/   4
   You can always solve your bandwidth
    (throughput) problems with money, but it is
    rarely so for lag (latency.)
   Relative rate of improvements (from David
    Patterson’s keynote, HPEC 2004)
     CPU, 80286 till Pentium 4: 21x vs. 2250x
     Ethernet, 10Mb till 10Gb: 16x vs. 1000x
     Disk, 3600 till 15000rpm: 8x vs. 143x
     DRAM, plain till DDR: 4x vs. 120x
                http://yaserzt.com/blog/          5
   At the simplest level, the von Neumann
    model stipulates:
     Program is data and is stored in memory along
      with data (departing from Turing’s model)
     Program is executed sequentially
   Not the way computers function anymore…
     Abstraction still used for thinking about programs
     But it’s leaky as heck!
   “Not Your Fathers’ von Neumann Machine!”
                http://yaserzt.com/blog/                   6
   Speed of Light: can’t send and receive signals to
    and from all parts of the die in a cycle anymore
   Power: more transistors leads to more power,
    which leads to much more heat
   Memory: the CPU isn’t even close to the
    bottleneck anymore. “All your base are belong
    to” memory
   Complexity: adding more transistors for more
    sophisticated operation won’t give much of a
    speedup (e.g. doubling transistors might give
    2%.)

               http://yaserzt.com/blog/                 7
   Family introduced with 8086 in 1978
   Today, new members are still fully binary
    backward-compatible with that puny machine
    (5MHz clock, 20-bit addressing, 16-bit regs.)
   It had very few registers
   It had segmented memory addressing (joy!)
   It had many complex instructions and several
    addressing modes

              http://yaserzt.com/blog/              8
   1982 (80286): Protected mode, MMU
   1985 (80386): 32-bit ISA, Paging
   1989 (80486): Pipelining, Cache, Intgrtd. FPU
   1993 (Pentium): Superscalar, 64-bit bus, MMX
   1995 (P-Pro): μ-ops, OoO Exec., Register
    Renaming, Speculative Exec.
   1997 (K6-2, PIII): 3DNow!/SSE
   2003 (Opteron): 64-bit ISA
   2006 (Core 2): Multi-core
              http://yaserzt.com/blog/              9
   Registers got expanded from (all 16 bit, non really
    general purpose)
     AX, BX, CX, DX
     SI, DI, BP, SP
     CS, DS, ES, SS, Flags, IP
   To
     16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI,
      R8-R15) plus RIP and Flags and others
     16 x 128-bit XMM regs. (XMM0-...)
         ▪ Or 16 x 256-bit YMM regs. (YMM0-...)
     More than a thousand logically different instructions (the
      usual, plus string processing, cryptography, CRC, complex
      numbers, etc.)
                      http://yaserzt.com/blog/                     10
   The Fetch-Decode-Execute-Retire Cycle
   Strategies for more performance:
     More complex instructions, doing more in
      hardware (CISCing things up)
     Faster CPU clock rates (the free lunch)
     Instruction-Level Parallelism (SIMD + gimmicks)
     Adding cores (free lunch is over!)
   And then, there are gimmicks…

               http://yaserzt.com/blog/                 11
   Pipelining
   µ-ops
   Superscalar Pipelines
   Out-of-order Execution
   Speculative Execution
   Register Renaming
   Branch Prediction
   Prefetching
   Store Buffer
   Trace Cache
   …

              http://yaserzt.com/blog/   12
Classic sequential execution:
   Length of instruction executions vary a lot (5-10
     times usual, several orders of magnitude also
     happen.)

 Instruction 1

                           Instruction 2

                                            Instruction 3

                                                            Instruction 4


                 http://yaserzt.com/blog/                                   13
It’s really more like this for the CPU:
   Instructions may have many sub-parts, and they
      engage different parts of the CPU


 F1   D1   E1   R1

                     F2   D2         E2          R2

                                                      F3   D3   E3   R3

                                                                          F4   D4   E4   R4



                      http://yaserzt.com/blog/                                                14
So why not do this:
   This is called “pipelining”
   It increases throughput (significantly)
   Doesn’t decrease latency for single instructions
 F1   D1   E1   R1

      F2   D2        E2          R2

           F3   D3               E3    R3


                F4        D4           E4     R4



                          http://yaserzt.com/blog/     15
But it has its own share of problems
   Hazards, stalls, flushing, etc.
   Execution of i2 depends on the result of i1
   After i2, we jump and the i3, i4,… are flushed out
 F1   D1   E1    R1                                        add EAX,120

      F2   D2                E2         R2                 jmp [EAX]

           F3   D3                      E3       R3        mov [4*EBX+42],EDX

                F4    D4                         E4   R4   add ECX,[EAX]


                      http://yaserzt.com/blog/                                  16
   Instructions are broken up into simple,
    orthogonal µ-ops
     mov EAX,EDX might generate only one µ-op
     mov EAX,[EDX] might generate two:
      1. µld tmp0,[EDX]
      2. µmov EAX,tmp0
     add [EAX],EDX probably generates three:
      1. µld tmp0,[EAX]
      2. µadd tmp0,EDX
      3. µst [EAX],tmp0

              http://yaserzt.com/blog/           17
   The CPU then, gets two layers:
     The one that breaks up operations into µ-ops
     The one that executes µ-ops
   The part that executes µ-ops can be simpler
    (more RISCy) and therefore faster.
   More complex instructions can be supported
    without (much) complicating the CPU
   The pipelining (and other gimmicks) can
    happen at the µ-op level
               http://yaserzt.com/blog/              18
   CPUs that issue (or retire) more than one
    instruction per cycle are called Superscalar
   Can be thought of as a pipeline with more
    than one line
   Simplest form: integer pipe plus floating-point
    pipe
   These days, CPUs do 4 or more
   Obviously requires more of each type of
    operational unit in the CPU
              http://yaserzt.com/blog/                19
   To prevent your pipeline from stalling as
    much as possible, issue the next instructions
    even if you can’t start the current one.
   But of course, only if there are no hazards
    (dependencies) and there are operational
    units available.
   add RAX,RAX
    add RAX,RBX         This can be and is started before
                        the previous instruction.
    add RCX,RDX
                http://yaserzt.com/blog/                    20
   This obviously also applies at the µ-op level:
   mov RAX,[mem0]          Fetching mem1 is started long
    mul RAX,42              before the result of the
                            multiply becomes available.
    add RAX,[mem1]
   push RAX
                      Pushing RAX is sub RSP,8 and then
    call Func         mov [RSP],RAX. Since call
                                  instruction needs RSP too, it will only
                                  wait for the subtraction and not the
                                  store to finish to start.

                http://yaserzt.com/blog/                                    21
   Consider this:
    mov RAX,[mem0]
    mul RAX,42
    mov [mem1],RAX
    mov RAX,[mem2]
    add RAX,7
    mov [mem3],RAX
   Logically, the two parts are totally separate.
   However, the use of RAX will stall the pipeline.
              http://yaserzt.com/blog/                 22
   Modern CPUs have a lot of temporary,
    unnamed registers at their disposal.
   They will detect the logical independence,
    and will use one of those in the second block
    instead on RAX.
   And they will track which reg. is which, where.
   In effect, they are renaming another register
    to RAX.
   There might not even be a real RAX!
              http://yaserzt.com/blog/                23
   This is, for once, simpler than it might seem!
   Every time a register is assigned to, a new
    temporary register is used in its stead.
   Consider this:




                                             Rename happens
    mov RAX,[cached]
    mov RBX,[uncached]
                                      Renaming on mul means
    add RBX,RAX                       that it won’t clobber RAX
    mul RAX,42                        (which we need for the
                                      add, that is waiting on the
    mov [mem0],RAX                    load of [uncached]) and we
    mov [mem1],RBX                    can do the multiply and
                                                              reach the first store much
                                                              sooner.
                  http://yaserzt.com/blog/                                                 24
   The CPU always depends on knowing where
    the next instruction is, so it can go ahead and
    work on it.
   That’s why branches in code are anathema to
    modern, deep pipelines and all the gimmicks
    they pull.
   Only if the CPU could somehow guess where
    the target of each branch is going to be…
   That’s where branch prediction comes in.
              http://yaserzt.com/blog/                25
   So the CPU guesses the target of a jump (if it
    doesn’t know for sure,) and continues to
    speculatively execute instructions from there.
   For a conditional jump, the CPU must also
    predict whether the branch is taken or not.
   If the CPU is right, the pipeline flows
    smoothly. If not, the pipeline must be flushed
    and much time and resource is wasted on a
    misprediction.
              http://yaserzt.com/blog/               26
   In this code:
    cmp RAX,0
    jne [RBX]
    both the target and whether the jump happens
    or not must be predicted.
   The above can effectively jump anywhere!
   But usually branches are closer to this:
    cmp RAX,0
    jne somewhere_specific
   Which can only have two possible targets.

              http://yaserzt.com/blog/             27
   In a simple form, when a branch is executed,
    its target is stored in a table called the BTB (or
    Branch Target Buffer.) When that branch is
    encountered again, the target address is
    predicted to be the value read from the BTB.
   As you might guess, this doesn’t work for
    many situations (e.g. alternating branch.)
   Also, the size of the BTB is limited, so CPU will
    forget about the last target of some jumps.
               http://yaserzt.com/blog/                  28
 A simple expansion on the previous idea is to use a
  saturating counter along with each entry of the BTB.
 For example, with a 2-bit counter,
     Branch is predicted not to be taken if the counter is 0 or 1.
     The branch is predicted to be taken if the counter is 2 or 3.
     Each time it is taken, counter is incremented, and vice versa.
                    T                          T             T
         Strongly             Weakly
                                                    Weakly        Strongly
    NT     Not                 Not                                           T
                                                    Taken          Taken
          Taken               Taken
                    NT                         NT            NT

                    http://yaserzt.com/blog/                                     29
   But this behaves very badly in common situations.
   For an alternating branch,
     If the counter starts in 00 or 11, it will mispredict 50%.
     If the counter starts in 01, and the first time the branch
      is taken, it will mispredict 100%!
   As an improvement, we can store the history of
    the last N occurrences of the branch in the BTB,
    and use 2N counters for each of the possible
    history patterns.

                  http://yaserzt.com/blog/                         30
   For N=4 and 2-bit counters, we’ll have:
     This is an extremely cool method of doing branch
     prediction!




        Branch History                             Prediction
                                               .
                                                      (0 or 1)
        0010                                   .
                                               .


                    http://yaserzt.com/blog/                     31
   Some predictions are simpler:
     For each ret instruction, the target is somewhere
      on the stack (pushed before.) Modern CPUs keep
      track of return addresses in an internal return
      stack buffer. Each time a call is executed, an
      entry is added and is used for the return address.
     On a cold encounter (a.k.a. static prediction) a
      branch is sometimes predicted to
      ▪ fall through if it goes forward.
      ▪ be taken if it goes backward.

                  http://yaserzt.com/blog/                 32
   Best general advice is to arrange your code so
    that the most common path for branches is
    “not taken”. This improves the effectiveness
    of code prefetching and the trace cache.
   Branch prediction, register renaming and
    speculative execution work extremely well
    together.


              http://yaserzt.com/blog/               33
mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/   34
Clock 0 – Instruction 0


mov   RAX,[RBX+16]                    Load RAX from memory
add   RBX,16                          Assume cache miss – 300
cmp   RAX,0                           cycles to load
                                      Instruction starts and
je    IsNull
                                      dispatch continues...
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                        35
Clock 0 – Instruction 1


mov   RAX,[RBX+16]                    This instruction writes RBX,
add   RBX,16                          which conflicts with the
cmp   RAX,0                           read in instruction 0.
                                      Rename this instance of
je    IsNull
                                      RBX and continue…
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                         36
Clock 0 – Instruction 2


mov   RAX,[RBX+16]                    Value of RAX not available
add   RBX,16                          yet; cannot calculate value
cmp   RAX,0                           of Flags reg.
                                      Queue up behind
je    IsNull
                                      instruction 0…
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                         37
Clock 0 – Instruction 3


mov   RAX,[RBX+16]                    Flags reg. still not available.
add   RBX,16                          Predict that this branch is
cmp   RAX,0                           not taken.
                                      Assuming 4-wide dispatch,
je    IsNull
                                      instruction issue limit is
mov   [RBX-16],RCX                    reached.
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                           38
Clock 1 – Instruction 4


mov   RAX,[RBX+16]                    Store is speculative. Result
add   RBX,16                          kept in Store Buffer. Also,
cmp   RAX,0                           RBX might not be available
                                      yet (from instruction 1.)
je    IsNull
                                      Load/Store Unit is tied up
mov   [RBX-16],RCX                    from now on; can’t issue
mov   RCX,[RDX+0]                     any more memory ops in
mov   RAX,[RAX+8]                     this cycle.

           http://yaserzt.com/blog/                                          39
Clock 2 – Instruction 5


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0                           Had to wait for L/S Unit.
je    IsNull                          Assume this is another (and
                                      unrelated) cache miss. We
mov   [RBX-16],RCX
                                      have 2 overlapping cache
mov   RCX,[RDX+0]                     misses now.
mov   RAX,[RAX+8]                     L/S Unit is busy again.

           http://yaserzt.com/blog/                                         40
Clock 3 – Instruction 6


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull
                                      RAX is not ready yet (300-
mov   [RBX-16],RCX
                                      cycle latency, remember?!)
mov   RCX,[RDX+0]                     This load cannot even start
mov   RAX,[RAX+8]                     until instruction 0 is done.

           http://yaserzt.com/blog/                                          41
Clock 301 – Instruction 2


mov   RAX,[RBX+16]
add   RBX,16                          At clock 300 (or 301,) RAX is
cmp   RAX,0                           finally ready.
je    IsNull                          Do the comparison and
                                      update Flags register.
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                      42
Clock 301 – Instruction 6


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull                          Issue this load too. Assume
mov   [RBX-16],RCX                    a cache hit (finally!) Result
mov   RCX,[RDX+0]                     will be available in clock
mov   RAX,[RAX+8]                     304.

           http://yaserzt.com/blog/                                      43
Clock 302 – Instruction 3


mov   RAX,[RBX+16]
add   RBX,16                          Now the Flags reg. is ready.
cmp   RAX,0                           Check the prediction.
je    IsNull                          Assume prediction was
                                      correct.
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                      44
Clock 302 – Instruction 4


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull                          This speculative store can
                                      actually be committed to
mov   [RBX-16],RCX                    memory (or cache,
mov   RCX,[RDX+0]                     actually.)
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                     45
Clock 302 – Instruction 5


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull
mov   [RBX-16],RCX                    At clock 302, the result of
mov   RCX,[RDX+0]                     this load arrives.
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                      46
Clock 305 – Instruction 6


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull
mov   [RBX-16],RCX
mov   RCX,[RDX+0]                     Result arrived at clock 304;
mov   RAX,[RAX+8]                     instruction retired at 305.

           http://yaserzt.com/blog/                                      47
To summarize,
mov RAX,[RBX+16] • In 4 clocks, started 7 ops
add RBX,16          and 2 cache misses
cmp RAX,0        • Retired 7 ops in 306 cycles.
                 • Cache misses totally
je IsNull
                    dominate performance.
mov [RBX-16],RCX • The only real benefit came
mov RCX,[RDX+0]     from being able to have 2
mov RAX,[RAX+8]     overlapping cache misses!

           http://yaserzt.com/blog/           48
To get to the next cache
miss as early as possible.

       http://yaserzt.com/blog/   49
   Main memory is slow; S.L.O.W.
   Very slow
   Painfully slow
   And it specially has very bad (high) latency
   But all is not lost! Many (most) references to
    memory have high temporal and address locality.
   So we use a small amount of very fast memory to
    keep recently-accessed or likely-to-be-accessed
    chunks of main memory close to CPU.

              http://yaserzt.com/blog/                50
   Typically come is several levels (3 these days.)
   Each lower level is several times smaller, but
    several times faster than the level above.
   CPU can only see the L1 cache, each level only
    sees the level above, and only the highest
    level can communicate with main memory.
   Data is transferred between memory and
    cache in units of fixed size, called a cache line.
    The most common size today is 64 bytes.
               http://yaserzt.com/blog/                  51
   When any memory byte is                       Main Memory
    needed, its place in cache is                 Each block is the
    calculated;                                   size of a cache line
   CPU asks the cache;
   If there, the cache returns the           The Cache
    data;                                     Each block also
   If not, the data is pulled in             holds metadata
    from memory;                              like tag (address)
   If the calculated cache line is           and some flags
    occupied by data with a
    different tag, that data is
    evicted.
   If the line is dirty (modified) it
    is written back to memory
    first.
                   http://yaserzt.com/blog/                              52
   In this basic model, if the CPU periodically
    accesses memory addresses that differ by a
    multiple of the cache size, they will constantly
    evict each other out and most cache accesses
    will be misses. This is called cache thrashing.
   An application can innocently and very easily
    trigger this.


               http://yaserzt.com/blog/                53
   To alleviate this problem, each cache block is
    turned into an associative memory that can
    house more than one cache line.
   Each cache block holds more cache lines (2, 4,
    8 or more,) and still uses the tag to look up
    the line requested by the CPU in the block.
   When a new line comes in from memory, an
    LRU (or similar) policy is used to evict only the
    least-likely-to-be-needed line.
               http://yaserzt.com/blog/                 54
   References:
     Patterson & Hennessy - Computer Organization and Design
     Intel 64 and IA-32 Architectures Software Developer’s
      Manual – vol. 1, 2 and 3
     Click & Goetz – A Crash Course in Modern Hardware
     Agner Fog - The Microarchitecture of Intel, AMD and VIA
      CPUs
     Drepper - What Every Programmer Should Know About
      Memory


                 http://yaserzt.com/blog/                       55

Modern CPUs and Caches - A Starting Point for Programmers

  • 1.
    Yaser Zhian Fanafzar GameStudio IGDI, Workshop 07, January 2nd, 2013
  • 2.
    Some notes about the subject  CPUs and their gimmicks  Caches and their importance  How CPU and OS handle memory logically http://yaserzt.com/blog/ 2
  • 3.
    These are very complex subjects  Expect very few details and much simplification  These are very complicated subjects  Expect much generalization and omission  No time  Even a full course would be hilariously insufficient  Not an expert  Sorry! Can’t help much.  Just a pile of loosely related stuff http://yaserzt.com/blog/ 3
  • 4.
    Pressure for performance  Backwards compatibility  Cost/power/etc.  The ridiculous “numbers game”  Law of diminishing returns  Latency vs. Throughput http://yaserzt.com/blog/ 4
  • 5.
    You can always solve your bandwidth (throughput) problems with money, but it is rarely so for lag (latency.)  Relative rate of improvements (from David Patterson’s keynote, HPEC 2004)  CPU, 80286 till Pentium 4: 21x vs. 2250x  Ethernet, 10Mb till 10Gb: 16x vs. 1000x  Disk, 3600 till 15000rpm: 8x vs. 143x  DRAM, plain till DDR: 4x vs. 120x http://yaserzt.com/blog/ 5
  • 6.
    At the simplest level, the von Neumann model stipulates:  Program is data and is stored in memory along with data (departing from Turing’s model)  Program is executed sequentially  Not the way computers function anymore…  Abstraction still used for thinking about programs  But it’s leaky as heck!  “Not Your Fathers’ von Neumann Machine!” http://yaserzt.com/blog/ 6
  • 7.
    Speed of Light: can’t send and receive signals to and from all parts of the die in a cycle anymore  Power: more transistors leads to more power, which leads to much more heat  Memory: the CPU isn’t even close to the bottleneck anymore. “All your base are belong to” memory  Complexity: adding more transistors for more sophisticated operation won’t give much of a speedup (e.g. doubling transistors might give 2%.) http://yaserzt.com/blog/ 7
  • 8.
    Family introduced with 8086 in 1978  Today, new members are still fully binary backward-compatible with that puny machine (5MHz clock, 20-bit addressing, 16-bit regs.)  It had very few registers  It had segmented memory addressing (joy!)  It had many complex instructions and several addressing modes http://yaserzt.com/blog/ 8
  • 9.
    1982 (80286): Protected mode, MMU  1985 (80386): 32-bit ISA, Paging  1989 (80486): Pipelining, Cache, Intgrtd. FPU  1993 (Pentium): Superscalar, 64-bit bus, MMX  1995 (P-Pro): μ-ops, OoO Exec., Register Renaming, Speculative Exec.  1997 (K6-2, PIII): 3DNow!/SSE  2003 (Opteron): 64-bit ISA  2006 (Core 2): Multi-core http://yaserzt.com/blog/ 9
  • 10.
    Registers got expanded from (all 16 bit, non really general purpose)  AX, BX, CX, DX  SI, DI, BP, SP  CS, DS, ES, SS, Flags, IP  To  16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI, R8-R15) plus RIP and Flags and others  16 x 128-bit XMM regs. (XMM0-...) ▪ Or 16 x 256-bit YMM regs. (YMM0-...)  More than a thousand logically different instructions (the usual, plus string processing, cryptography, CRC, complex numbers, etc.) http://yaserzt.com/blog/ 10
  • 11.
    The Fetch-Decode-Execute-Retire Cycle  Strategies for more performance:  More complex instructions, doing more in hardware (CISCing things up)  Faster CPU clock rates (the free lunch)  Instruction-Level Parallelism (SIMD + gimmicks)  Adding cores (free lunch is over!)  And then, there are gimmicks… http://yaserzt.com/blog/ 11
  • 12.
    Pipelining  µ-ops  Superscalar Pipelines  Out-of-order Execution  Speculative Execution  Register Renaming  Branch Prediction  Prefetching  Store Buffer  Trace Cache  … http://yaserzt.com/blog/ 12
  • 13.
    Classic sequential execution:  Length of instruction executions vary a lot (5-10 times usual, several orders of magnitude also happen.) Instruction 1 Instruction 2 Instruction 3 Instruction 4 http://yaserzt.com/blog/ 13
  • 14.
    It’s really morelike this for the CPU:  Instructions may have many sub-parts, and they engage different parts of the CPU F1 D1 E1 R1 F2 D2 E2 R2 F3 D3 E3 R3 F4 D4 E4 R4 http://yaserzt.com/blog/ 14
  • 15.
    So why notdo this:  This is called “pipelining”  It increases throughput (significantly)  Doesn’t decrease latency for single instructions F1 D1 E1 R1 F2 D2 E2 R2 F3 D3 E3 R3 F4 D4 E4 R4 http://yaserzt.com/blog/ 15
  • 16.
    But it hasits own share of problems  Hazards, stalls, flushing, etc.  Execution of i2 depends on the result of i1  After i2, we jump and the i3, i4,… are flushed out F1 D1 E1 R1 add EAX,120 F2 D2 E2 R2 jmp [EAX] F3 D3 E3 R3 mov [4*EBX+42],EDX F4 D4 E4 R4 add ECX,[EAX] http://yaserzt.com/blog/ 16
  • 17.
    Instructions are broken up into simple, orthogonal µ-ops  mov EAX,EDX might generate only one µ-op  mov EAX,[EDX] might generate two: 1. µld tmp0,[EDX] 2. µmov EAX,tmp0  add [EAX],EDX probably generates three: 1. µld tmp0,[EAX] 2. µadd tmp0,EDX 3. µst [EAX],tmp0 http://yaserzt.com/blog/ 17
  • 18.
    The CPU then, gets two layers:  The one that breaks up operations into µ-ops  The one that executes µ-ops  The part that executes µ-ops can be simpler (more RISCy) and therefore faster.  More complex instructions can be supported without (much) complicating the CPU  The pipelining (and other gimmicks) can happen at the µ-op level http://yaserzt.com/blog/ 18
  • 19.
    CPUs that issue (or retire) more than one instruction per cycle are called Superscalar  Can be thought of as a pipeline with more than one line  Simplest form: integer pipe plus floating-point pipe  These days, CPUs do 4 or more  Obviously requires more of each type of operational unit in the CPU http://yaserzt.com/blog/ 19
  • 20.
    To prevent your pipeline from stalling as much as possible, issue the next instructions even if you can’t start the current one.  But of course, only if there are no hazards (dependencies) and there are operational units available.  add RAX,RAX add RAX,RBX This can be and is started before the previous instruction. add RCX,RDX http://yaserzt.com/blog/ 20
  • 21.
    This obviously also applies at the µ-op level:  mov RAX,[mem0] Fetching mem1 is started long mul RAX,42 before the result of the multiply becomes available. add RAX,[mem1]  push RAX Pushing RAX is sub RSP,8 and then call Func mov [RSP],RAX. Since call instruction needs RSP too, it will only wait for the subtraction and not the store to finish to start. http://yaserzt.com/blog/ 21
  • 22.
    Consider this: mov RAX,[mem0] mul RAX,42 mov [mem1],RAX mov RAX,[mem2] add RAX,7 mov [mem3],RAX  Logically, the two parts are totally separate.  However, the use of RAX will stall the pipeline. http://yaserzt.com/blog/ 22
  • 23.
    Modern CPUs have a lot of temporary, unnamed registers at their disposal.  They will detect the logical independence, and will use one of those in the second block instead on RAX.  And they will track which reg. is which, where.  In effect, they are renaming another register to RAX.  There might not even be a real RAX! http://yaserzt.com/blog/ 23
  • 24.
    This is, for once, simpler than it might seem!  Every time a register is assigned to, a new temporary register is used in its stead.  Consider this: Rename happens mov RAX,[cached] mov RBX,[uncached] Renaming on mul means add RBX,RAX that it won’t clobber RAX mul RAX,42 (which we need for the add, that is waiting on the mov [mem0],RAX load of [uncached]) and we mov [mem1],RBX can do the multiply and reach the first store much sooner. http://yaserzt.com/blog/ 24
  • 25.
    The CPU always depends on knowing where the next instruction is, so it can go ahead and work on it.  That’s why branches in code are anathema to modern, deep pipelines and all the gimmicks they pull.  Only if the CPU could somehow guess where the target of each branch is going to be…  That’s where branch prediction comes in. http://yaserzt.com/blog/ 25
  • 26.
    So the CPU guesses the target of a jump (if it doesn’t know for sure,) and continues to speculatively execute instructions from there.  For a conditional jump, the CPU must also predict whether the branch is taken or not.  If the CPU is right, the pipeline flows smoothly. If not, the pipeline must be flushed and much time and resource is wasted on a misprediction. http://yaserzt.com/blog/ 26
  • 27.
    In this code: cmp RAX,0 jne [RBX] both the target and whether the jump happens or not must be predicted.  The above can effectively jump anywhere!  But usually branches are closer to this: cmp RAX,0 jne somewhere_specific  Which can only have two possible targets. http://yaserzt.com/blog/ 27
  • 28.
    In a simple form, when a branch is executed, its target is stored in a table called the BTB (or Branch Target Buffer.) When that branch is encountered again, the target address is predicted to be the value read from the BTB.  As you might guess, this doesn’t work for many situations (e.g. alternating branch.)  Also, the size of the BTB is limited, so CPU will forget about the last target of some jumps. http://yaserzt.com/blog/ 28
  • 29.
     A simpleexpansion on the previous idea is to use a saturating counter along with each entry of the BTB.  For example, with a 2-bit counter,  Branch is predicted not to be taken if the counter is 0 or 1.  The branch is predicted to be taken if the counter is 2 or 3.  Each time it is taken, counter is incremented, and vice versa. T T T Strongly Weakly Weakly Strongly NT Not Not T Taken Taken Taken Taken NT NT NT http://yaserzt.com/blog/ 29
  • 30.
    But this behaves very badly in common situations.  For an alternating branch,  If the counter starts in 00 or 11, it will mispredict 50%.  If the counter starts in 01, and the first time the branch is taken, it will mispredict 100%!  As an improvement, we can store the history of the last N occurrences of the branch in the BTB, and use 2N counters for each of the possible history patterns. http://yaserzt.com/blog/ 30
  • 31.
    For N=4 and 2-bit counters, we’ll have:  This is an extremely cool method of doing branch prediction! Branch History Prediction . (0 or 1) 0010 . . http://yaserzt.com/blog/ 31
  • 32.
    Some predictions are simpler:  For each ret instruction, the target is somewhere on the stack (pushed before.) Modern CPUs keep track of return addresses in an internal return stack buffer. Each time a call is executed, an entry is added and is used for the return address.  On a cold encounter (a.k.a. static prediction) a branch is sometimes predicted to ▪ fall through if it goes forward. ▪ be taken if it goes backward. http://yaserzt.com/blog/ 32
  • 33.
    Best general advice is to arrange your code so that the most common path for branches is “not taken”. This improves the effectiveness of code prefetching and the trace cache.  Branch prediction, register renaming and speculative execution work extremely well together. http://yaserzt.com/blog/ 33
  • 34.
    mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 34
  • 35.
    Clock 0 –Instruction 0 mov RAX,[RBX+16] Load RAX from memory add RBX,16 Assume cache miss – 300 cmp RAX,0 cycles to load Instruction starts and je IsNull dispatch continues... mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 35
  • 36.
    Clock 0 –Instruction 1 mov RAX,[RBX+16] This instruction writes RBX, add RBX,16 which conflicts with the cmp RAX,0 read in instruction 0. Rename this instance of je IsNull RBX and continue… mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 36
  • 37.
    Clock 0 –Instruction 2 mov RAX,[RBX+16] Value of RAX not available add RBX,16 yet; cannot calculate value cmp RAX,0 of Flags reg. Queue up behind je IsNull instruction 0… mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 37
  • 38.
    Clock 0 –Instruction 3 mov RAX,[RBX+16] Flags reg. still not available. add RBX,16 Predict that this branch is cmp RAX,0 not taken. Assuming 4-wide dispatch, je IsNull instruction issue limit is mov [RBX-16],RCX reached. mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 38
  • 39.
    Clock 1 –Instruction 4 mov RAX,[RBX+16] Store is speculative. Result add RBX,16 kept in Store Buffer. Also, cmp RAX,0 RBX might not be available yet (from instruction 1.) je IsNull Load/Store Unit is tied up mov [RBX-16],RCX from now on; can’t issue mov RCX,[RDX+0] any more memory ops in mov RAX,[RAX+8] this cycle. http://yaserzt.com/blog/ 39
  • 40.
    Clock 2 –Instruction 5 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 Had to wait for L/S Unit. je IsNull Assume this is another (and unrelated) cache miss. We mov [RBX-16],RCX have 2 overlapping cache mov RCX,[RDX+0] misses now. mov RAX,[RAX+8] L/S Unit is busy again. http://yaserzt.com/blog/ 40
  • 41.
    Clock 3 –Instruction 6 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull RAX is not ready yet (300- mov [RBX-16],RCX cycle latency, remember?!) mov RCX,[RDX+0] This load cannot even start mov RAX,[RAX+8] until instruction 0 is done. http://yaserzt.com/blog/ 41
  • 42.
    Clock 301 –Instruction 2 mov RAX,[RBX+16] add RBX,16 At clock 300 (or 301,) RAX is cmp RAX,0 finally ready. je IsNull Do the comparison and update Flags register. mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 42
  • 43.
    Clock 301 –Instruction 6 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull Issue this load too. Assume mov [RBX-16],RCX a cache hit (finally!) Result mov RCX,[RDX+0] will be available in clock mov RAX,[RAX+8] 304. http://yaserzt.com/blog/ 43
  • 44.
    Clock 302 –Instruction 3 mov RAX,[RBX+16] add RBX,16 Now the Flags reg. is ready. cmp RAX,0 Check the prediction. je IsNull Assume prediction was correct. mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 44
  • 45.
    Clock 302 –Instruction 4 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull This speculative store can actually be committed to mov [RBX-16],RCX memory (or cache, mov RCX,[RDX+0] actually.) mov RAX,[RAX+8] http://yaserzt.com/blog/ 45
  • 46.
    Clock 302 –Instruction 5 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull mov [RBX-16],RCX At clock 302, the result of mov RCX,[RDX+0] this load arrives. mov RAX,[RAX+8] http://yaserzt.com/blog/ 46
  • 47.
    Clock 305 –Instruction 6 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull mov [RBX-16],RCX mov RCX,[RDX+0] Result arrived at clock 304; mov RAX,[RAX+8] instruction retired at 305. http://yaserzt.com/blog/ 47
  • 48.
    To summarize, mov RAX,[RBX+16]• In 4 clocks, started 7 ops add RBX,16 and 2 cache misses cmp RAX,0 • Retired 7 ops in 306 cycles. • Cache misses totally je IsNull dominate performance. mov [RBX-16],RCX • The only real benefit came mov RCX,[RDX+0] from being able to have 2 mov RAX,[RAX+8] overlapping cache misses! http://yaserzt.com/blog/ 48
  • 49.
    To get tothe next cache miss as early as possible. http://yaserzt.com/blog/ 49
  • 50.
    Main memory is slow; S.L.O.W.  Very slow  Painfully slow  And it specially has very bad (high) latency  But all is not lost! Many (most) references to memory have high temporal and address locality.  So we use a small amount of very fast memory to keep recently-accessed or likely-to-be-accessed chunks of main memory close to CPU. http://yaserzt.com/blog/ 50
  • 51.
    Typically come is several levels (3 these days.)  Each lower level is several times smaller, but several times faster than the level above.  CPU can only see the L1 cache, each level only sees the level above, and only the highest level can communicate with main memory.  Data is transferred between memory and cache in units of fixed size, called a cache line. The most common size today is 64 bytes. http://yaserzt.com/blog/ 51
  • 52.
    When any memory byte is Main Memory needed, its place in cache is Each block is the calculated; size of a cache line  CPU asks the cache;  If there, the cache returns the The Cache data; Each block also  If not, the data is pulled in holds metadata from memory; like tag (address)  If the calculated cache line is and some flags occupied by data with a different tag, that data is evicted.  If the line is dirty (modified) it is written back to memory first. http://yaserzt.com/blog/ 52
  • 53.
    In this basic model, if the CPU periodically accesses memory addresses that differ by a multiple of the cache size, they will constantly evict each other out and most cache accesses will be misses. This is called cache thrashing.  An application can innocently and very easily trigger this. http://yaserzt.com/blog/ 53
  • 54.
    To alleviate this problem, each cache block is turned into an associative memory that can house more than one cache line.  Each cache block holds more cache lines (2, 4, 8 or more,) and still uses the tag to look up the line requested by the CPU in the block.  When a new line comes in from memory, an LRU (or similar) policy is used to evict only the least-likely-to-be-needed line. http://yaserzt.com/blog/ 54
  • 55.
    References:  Patterson & Hennessy - Computer Organization and Design  Intel 64 and IA-32 Architectures Software Developer’s Manual – vol. 1, 2 and 3  Click & Goetz – A Crash Course in Modern Hardware  Agner Fog - The Microarchitecture of Intel, AMD and VIA CPUs  Drepper - What Every Programmer Should Know About Memory http://yaserzt.com/blog/ 55