TDT 4260 – lecture 1 – 2011                                                                              Course goal
• Course introduction                                                                                   • To get a general and deep understanding of the
   –   course goals                                                                                       organization of modern computers and the
   –   staff                                                                                              motivation for different computer architectures. Give
   –   contents                                                                                           a base for understanding of research themes within
   –   evaluation                                                                                         the field.
   –   web, ITSL
                                                                                                        • High level
• Textbook
                                                                                                        • Mostly HW and low-level SW
   – Computer Architecture, A
     Quantitative Approach, Fourth                                                                      • HW/SW interplay
     Edition                                                                                            • Parallelism
        • by John Hennessy & David Patterson
           (HP90 - 96 – 03) - 06                                                                        • Principles, not details
• Today: Introduction (Chapter 1)
   – Partly covered                                                                                         inspire to learn more
                  1                                                                     Lasse Natvig                   2                                  Lasse Natvig




 Contents                                                                                                TDT-4260 / DT8803
                                                                                                       • Recommended background
• Computer architecture fundamentals, trends, measuring                                                   – Course TDT4160 Computer Fundamentals, or
  performance, quantitative principles. Instruction set                                                     equivalent.
  architectures and the role of compilers. Instruction-level                                           • http://www.idi.ntnu.no/emner/tdt4260/
  parallelism, thread-level parallelism, VLIW.                                                            – And Its Learning
• Memory hierarchy design, cache. Multiprocessors, shared                                              • Friday 1215-1400
  memory architectures, vector processors, NTNU/Notur                                                     – And/or some Thursdays 1015-1200
  supercomputers,
  supercomputers distributed shared memory
                                        memory,                                                           – 12 lectures planned
  synchronization, multithreading.                                                                        – some exceptions may occur
• Interconnection networks, topologies                                                                 • Evaluation
• Multicores,homogeneous and heterogeneous, principles and                                                – Obligatory exercise (counts 20%). Written
  product examples                                                                                          exam counts 80%. Final grade (A to F) given
                                                                                                            at end of semester. If there is a re-sit
• Green computing (introduction)                                                                            examination, the examination form may
• Miniproject - prefetching                                                                                 change from written to oral.



                  3                                                                     Lasse Natvig                   4                                  Lasse Natvig




Lecture plan                                    Subject to change
                                                                                                       EMECS, new European Master's
 Date  and lecturer     Topic                                                                          Course in Embedded Computing Systems
 1:  14 Jan (LN, AI)    Introduction, Chapter 1 / Alex: PfJudge

 2:  21 Jan (IB)        Pipelining, Appendix A; ILP, Chapter 2

 3: 28 Jan (IB)         ILP, Chapter 2; TLP, Chapter 3

 4: 4 Feb (LN)          Multiprocessors, Chapter 4 

 5: 11 Feb MG(?))       Prefetching + Energy Micro guest lecture

 6: 18 Feb (LN)         Multiprocessors continued 

 7: 25 Feb (IB)         Piranha CMP + Interconnection networks 

 8: 4 Mar (IB)          Memory and cache, cache coherence  (Chap. 5)

 9: 11 Mar (LN)         Multicore architectures (Wiley book chapter) + Hill Marty Amdahl 
                        multicore ... Fedorova ... assymetric multicore ...

 10: 18 Mar (IB)        Memory consistency (4.6) + more on memory

 11: 25 Mar (JA, AI)    (1) Kongull and other NTNU and NOTUR supercomputers   (2) Green 
                        computing
 12: 1 Apr (IB/LN)      Wrap up lecture, remaining stuff

 13: 8 Apr              Slack – no lecture planned 
                  5                                                                     Lasse Natvig                   6                                  Lasse Natvig




                                                                                                                                                                         1
Preliminary reading list,                              subject to change!!!                  People involved
• Chap.1: Fundamentals, sections 1.1 - 1.12 (pages 2-54)
• Chap.2: ILP, sections 2.1 - 2.2 and parts of 2.3 (pages 66-81), section 2.7
      (pages 114-118), parts of section 2.9 (pages 121-127, stop at speculation),            Lasse Natvig
      section 2.11 - 2.12 (pages 138 - 141). (Sections 2.4 - 2.6 are covered by similar
      material in our computer design course)                                                Course responsible, lecturer
•     Chap.3: Limits on ILP, section 3.1 and parts of section 3.2 (pages 154 -159),          lasse@idi.ntnu.no
      section 3.5 - 3.8 (pages 172-185).
•     Chap.4: Multiprocessors and TLP, sections 4.1 - 4.5, 4.8 - 4.10
•     Chap.5: Memory hierachy, section 5.1 - 5.3 (pages 288 - 315).                          Ian Bratt
•     App A: section A 1 (Expected to be repetition from other courses)
                      A.1                                                                    Lecturer                   (Also t Til
                                                                                                                        (Al at Tilera.com)
                                                                                                                                         )
•     Appendix E, interconnection networks, pages E2-E14, E20-E25, E29-E37                   ianbra@idi.ntnu.no
      and E45-E51.
•     App. F: Vector processors, sections F1 - F4 and F8 (pages F-2 - F-32, F-
      44 - F-45)                                                                             Alexandru Iordan
•     Data prefetch mechanisms (ACM Computing Survey)
                                                                                             Teaching assistant         (Also PhD-student)
•     Piranha, (To be announced)
                                                                                             iordan@idi.ntnu.no
•     Multicores (New bookchapter) (To be announced)
•     (App. D; embedded systems?)  see our new course TDT4258
      Mikrokontroller systemdesign                                                                       http://www.idi.ntnu.no/people/

                   7                                                        Lasse Natvig                     8                               Lasse Natvig




    research.idi.ntnu.no/multicore                                                           Prefetching ---pfjudge




        Some few highlights:
        - Green computing, 2xPhD + master students
        - Multicore memory systems, 3 x PhD theses
        - Multicore programming and parallel computing
        - Cooperation with industry




                   9                                                        Lasse Natvig                     10                              Lasse Natvig




”Computational computer architecture”                                                        Experiment Infrastructure
    • Computational science and engineering (CSE)                                          • Stallo compute cluster
       – Computational X, X = comp.arch.                                                      –   60 Teraflop/s peak
    • Simulates new multicore architectures                                                   –   5632 processing cores
       – Last level, shared cache fairness (PhD-student M. Jahre)                             –   12 TB total memory
       – Bandwidth aware prefetching (PhD-student M. Grannæs)                                 –   128 TB centralized disk
    • Complex cycle-accurate simulators                                                       –   Weighs 16 tons
       – 80 000 lines C++ 20 000 lines python
                       C++,
       – Open source, Linux-based
                                                                                           • Multi-core research
    • Design space exploration (DSE)
                                                                                              – About 60 CPU years allocated per
       – one dimension for each arch. parameter                                                 year to our projects
       – DSE sample point = specific multicore configuration                                  – Typical research paper uses 5 to
       – performance of a selected set of configurations evaluated by                           12 CPU years for simulation
         simulating the execution of a set of workloads                                         (extensive, detailed design space
                                                                                                exploration)



                   11                                                       Lasse Natvig                     12                              Lasse Natvig




                                                                                                                                                            2
The End of Moore’s law
                                                                                                  Motivational background
for single-core microprocessors
                                                                               • Why multicores
                                                                                                              – in all market segments from mobile phones to supercomputers
                                                                               •                              The ”end” of Moores law
                                                                               •                              The power wall
                                                                               •                              The memory wall
                                                                               •                              The bandwith problem
                                                                               •                              ILP limitations
                                                                               •                              The complexity wall
                              But Moore’s law still holds for
                              FPGA, memory and
                              multicore processors

             13                                                Lasse Natvig                                                              14                                                                                     Lasse Natvig




 Energy & Heat Problems                                                                           The Memory Wall
                                                                                                                                     1000

 • Large power                                                                                                                                                       “Moore’s Law”

   consumption                                                                                                                       100                                                            CPU
                                                                                                                                                                                                    60%/year
   – Costly
                                                                                                                       Performance




                                                                                                                                                   P-M gap grows 50% / year
   – Heat problems                                                                                                                   10
   – Restricted battery
                                                                                                                                                                                                          DRAM
     operation time                                                                                                                                                                                       9%/year
                                                                                                                                                                                                          9%/
                                                                                                                                     1
 • Google ”Open
  House Trondheim                                                                                                           1980                              1990                             2000

  2006”                                                                        • The Processor Memory Gap
   – ”Performance/Watt
     is the only flat
                                                                               • Consequence: deeper memory hierachies
     trend line”                                                                                              – P – Registers – L1 cache – L2 cache – L3 cache – Memory - - -
                                                                                                              – Complicates understanding of performance
                                                                                                                   • cache usage has an increasing influence on performance

             15                                                Lasse Natvig                                                              16                                                                                     Lasse Natvig




 The I/O pin or Bandwidth problem                                                                 The limitations of ILP
                                                                                                  (Instruction Level Parallelism)
                                            • # I/O signaling pins                                in Applications
                                                  – limited by physical
                                                    tecnology                                                 30                                                                      3
                                                                                                                                                                                                                                                   
                                                  – speeds have not                                                                                                                  2.5
                                                                                                                                                                                                          
                                                                                                              25
                                                    increased at the same
                                                                               Fraction of total cycles (%)




                                                    rate as processor clock                                   20                                                                      2
                                                    rates                                                                                                                                           
                                                                                                                                                                               dup
                                                                                                                                                                           Speed




                                                                                                                                                                                     1.5
                                            • Projections                                                     15

                                                  – from ITRS (International                                  10                                                                      1         
                                                    Technology Roadmap
                                                    for Semiconductors)                                        5                                                                     0.5

                                                                                                               0                                                                      0
                          [Huh, Burger and Keckler 2001]                                                           0           1           2   3    4     5          6+                    0                   5           10                  15
                                                                                                                       Number of instructions issued                                                    Instructions issued per cycle




             17                                                Lasse Natvig                                                              18                                                                                     Lasse Natvig




                                                                                                                                                                                                                                                        3
Reduced Increase in Clock Frequency                                                            Solution: Multicore architectures
                                                                                               (also called Chip Multi-processors - CMP)

                                                                                               • More power-efficient
                                                                                                   – Two cores with clock frequency f/2
                                                                                                     can potentially achieve the same
                                                                                                     speed as one at frequency f with 50%
                                                                                                     reduction in total energy consumption
                                                                                                     [Olukotun & Hammond 2005]
                                                                                               • Exploits Thread Level
                                                                                                 Parallelism (TLP)
                                                                                                   – in addition to ILP
                                                                                                   – requires multiprogramming or
                                                                                                     parallel programming
                                                                                               • Opens new possibilities for
                                                                                                 architectural innovations
                  19                                                            Lasse Natvig                  20                                                      Lasse Natvig




 Why heterogeneous multicores?                                                                   CPU – GPU – convergence
• Specialized HW is                                                                              (Performance – Programmability)
                                                            Cell BE processor
  faster than general
  HW                                                                                                                                                 Processors: Larrabee,
                                                                                                                                                     Fermi, …
     – Math co-processor                                                                                                                             Languages: CUDA,
                                                                                                                                                     OpenCL, …
     – GPU, DSP, etc…
• Benefits of
  customization
     – Similar to ASIC vs. general
       purpose programmable
       HW
• Amdahl’s law
     – Parallel speedup limited by
       serial fraction
          •  1 super-core
                  21                                                            Lasse Natvig                  22                                                      Lasse Natvig




 Parallel processing – conflicting                                                              Multicore programming challenges
 goals                                                                                         • Instability, diversity, conflicting goals … what to do?
                                                   Performance                                 • What kind of parallel programming?
 The P6-model: Parallel Processing                                                                 – Homogeneous vs. heterogeneous
 challenges: Performance, Portability,                                                             – DSL vs. general languages
 Programmability and Power efficiency                                                              – Memory locality
                                                                Portability                    • What to teach?
                                                                                                   – Teaching should be founded on
                                                                                                     active research
                                          Programmability              Powerefficiency         • Two layers of programmers
                                                                                                       y       p g
                                                                                                  – The Landscape of Parallel Computing Research: A View from
• Examples;                                                                                         Berkeley [Asan+06]
   – Performance tuning may reduce portability                                                         • Krste Asanovic presentation at ACACES Summerschool 2007
         • Eg. Datastructures adapted to cache block size                                         – 1) Programmability layer (Productivity layer) (80 - 90%)
                                                                                                     • ”Joe the programmer”
    – New languages for higher programmability may reduce performance
      and increase power consumption                                                              – 2) Performance layer (Efficiency layer) (10 - 20%)
                                                                                               • Both layers involved in HPC
                                                                                               • Programmability an issue also at the performance-layer


                  23                                                            Lasse Natvig                  24                                                      Lasse Natvig




                                                                                                                                                                                     4
Parallel Computing Laboratory, U.C. Berkeley,
(Slide adapted from Dave Patterson )                                                                                          Classes of computers
             Easy to write correct programs that run efficiently on manycore                                                 • Servers
                                                                                                                                –   storage servers
                                                             Personal     Image        Hearing,                 Parallel        –   compute servers (supercomputers)
                                                                                                    Speech
                                                              Health     Retrieval      Music                   Browser
                                                                                                                                –   web servers
                                                                             Design Patterns/Motifs                             –   high availability
                                                                  Composition & Coordination Language (C&CL)                    –   scalability
                                                                                                                                –   throughput oriented (response time of less importance)
                                                 ormance




                                                                           C&CL Compiler/Interpreter
                                                                                                                             • Desktop (price 3000 NOK – 50 000 NOK)
                                                                                                                                – the largest market
                                                                                                                                         g
                            Diagnosing Power/Perfo




                                                                                  Parallel
                                                                                  P ll l
                                                                                 Libraries
                                                                                                    Parallel Frameworks         – price/performance focus
                                                                                                                                – latency oriented (response time)
                                                                                                                             • Embedded systems
                                                           Efficiency Languages              Sketching
                                                                                                                                – the fastest growing market (”everywhere”)
                                                                                             Autotuners                         – TDT 4258 Microcontroller system design
                                                            Legacy                                 Communication & Synch.       – ATMEL, Nordic Semic., ARM, EM, ++
                                                                            Schedulers
                                                             Code                                       Primitives
                                                                          Efficiency Language Compilers
                                                                                                OS Libraries & Services
                                                                     Legacy OS
                                                                                                      Hypervisor

                                                                Multicore/GPGPU                     RAMP Manycore
               25                                                                                          Lasse Natvig                      26                                                        Lasse Natvig
                                                                                                      25




                                                                                                                              Borgar  FXI Technologies
   Falanx (Mali)
   ARM                                                                                                                                                                         ”An idependent compute
                                                                                                                                                                                platform to gather the

   Norway                                                                                                                                                                       fragmented mobile space
                                                                                                                                                                                and thus help accelerate the
                                                                                                                                                                                prolifitation of content and
                                                                                                                                                                                applications eco- systems (I.e
                                                                                                                                                                                build an ARM based SoC, put it
                                                                                                                                                                                                           ,p
                                                                                                                                                                                in a memory card, connect it to
                                                                                                                                                                                the web- and voila, you got
                                                                                                                                                                                iPhone for the masses ).”


                                                                                                                             • http://www.fxitech.com/
                                                                                                                                – ”Headquartered in Trondheim
                                                                                                                                      • But also an office in Silicon Valley …”


               27                                                                                             Lasse Natvig                   28                                                        Lasse Natvig




   Trends                                                                                                                            Comp. Arch. is an
                                                                                                                                     Integrated Approach
  • For technology, costs, use
                                                                                                                             • What really matters is the functioning of the
  • Help predicting the future                                                                                                 complete system
  • Product development time                                                                                                     – hardware, runtime system, compiler, operating system, and
     – 2-3 years                                                                                                                   application
     –  design for the next technology                                                                                          – In networking, this is called the “End to End argument”
                                                                                                                             • Computer architecture is not just about
     – Why should an architecture live longer than a product?
                                                                                                                               transistors(not at all), individual instructions, or
                                                                                                                               particular implementations
                                                                                                                                 – E.g., Original RISC projects replaced complex instructions with a
                                                                                                                                   compiler + simple instructions




               29                                                                                             Lasse Natvig                   30                                                        Lasse Natvig




                                                                                                                                                                                                                      5
Computer Architecture is
Design and Analysis                                                                                TDT4260 Course Focus
                          Architecture is an iterative process:                           Understanding the design techniques, machine
                          • Searching the space of possible designs
           Design
                          • At all levels of computer systems                              structures, technology factors, evaluation
Analysis
                                                                                           methods that will determine the form of
                                                                                           computers in 21st Century
                                                                                                                     Technology           Parallelism
                                                                                                                                                                   Programming
Creativity
C   ti it                                                                                                                                                          Languages
                                                                                           Applications                                                                     Interface Design
                             Cost /                                                                                          Computer Architecture:                              (ISA)
                             Performance                                                                                     • Organization
                             Analysis                                                                                        • Hardware/Software Boundary
                                                                                                                                                                               Compilers

                                                         Good Ideas                                          Operating            Measurement &
                                                                                                             Systems                 Evaluation                              History
                                            Mediocre Ideas
                Bad Ideas
                    31                                                Lasse Natvig                             32                                                                     Lasse Natvig




                                                                                             Moore’s Law: 2X transistors /
Holistic approach                                                                            “year”
 e.g., to programmability




                          Parallel & concurrent programming


                         Operating System & system software


                         Multicore, interconnect, memory
                                                                                      •     “Cramming More Components onto Integrated Circuits”
                                                                                             –    Gordon Moore, Electronics, 1965
                                                                                      •     # of transistors / cost-effective integrated circuit double
                                                                                            every N months (12 ≤ N ≤ 24)


                    33                                                Lasse Natvig                             34                                                                     Lasse Natvig




 Tracking Technology
                                                                                      Latency Lags Bandwidth (last ~20 years)
 Performance Trends
                                                                                                 10000
  • 4 critical implementation technologies:                                          CPU high,                                            • Performance Milestones
                                                                                                                      Processor
       –     Disks,                                                                  Memory low                                           • Processor: ‘286, ‘386, ‘486, Pentium,
       –     Memory,                                                                 (“Memory                                               Pentium Pro, Pentium 4 (21x,2250x)
                                                                                     Wall”) 1000
       –     Network,                                                                                                                     • Ethernet: 10Mb, 100Mb, 1000Mb,
                                                                                                                             Network
       –     Processors                                                                                                                     10000 Mb/s (16x,1000x)
                                                                                          Relative    Memory
  • Compare for Bandwidth vs. Latency                                                       BW
                                                                                                  100
                                                                                                                         Disk             • Memory Module: 16bit plain DRAM,
                                                                                                                                            Page Mode DRAM 32b 64b SDRAM,
                                                                                                                                            P     M d DRAM, 32b, 64b, SDRAM
    improvements in performance over time                                                 Improve
                                                                                            ment                                            DDR SDRAM (4x,120x)
  • Bandwidth: number of events per unit time                                                                                             • Disk : 3600, 5400, 7200, 10000, 15000
       – E.g., M bits/second over network, M bytes / second from                                    10                                      RPM (8x, 143x)
         disk
                                                                                                                                              (Processor latency = typical # of pipeline-stages * time
  • Latency: elapsed time for a single event                                                                     (Latency improvement
                                                                                                             = Bandwidth improvement)
                                                                                                                                              pr. clock-cycle)

       – E.g., one-way network delay in microseconds,                                                1
         average disk access time in milliseconds                                                        1              10              100
                                                                                                         Relative Latency Improvement


                    35                                                Lasse Natvig                             36                                                                     Lasse Natvig




                                                                                                                                                                                                         6
COST and COTS                                                                     Speedup                     Superlinear speedup ?

• Cost                                                                         • General definition:
                                                                                                                             Performance (p processors)
   – to produce one unit                                                           Speedup (p processors) =                  Performance (1 processor)
   – include (development cost / # sold units)
   – benefit of large volume
• COTS                                                                         • For a fixed problem size (input data set),
   – commodity off the shelf
          dit ff th h lf                                                         performance = 1/time
                                                                                    – Speedup
                                                                                                                     Time (1 processor)
                                                                                      fixed problem (p processors) =
                                                                                                                      Time (p processors)


                                                                               • Note: use best sequential algorithm in the uni-processor
                                                                                 solution, not the parallel algorithm with p = 1


               37                                               Lasse Natvig                  38                                                Lasse Natvig




 Amdahl’s Law (1967) (fixed problem size)                                          Gustafson’s “law” (1987)
                                                                                   (scaled problem size, fixed execution time)
• “If a fraction s of a
  (uniprocessor)                                                               • Total execution time on
  computation is inherently                                                      parallel computer with n
  serial, the speedup is at                                                      processors is fixed
  most 1/s”                                                                         – serial fraction s’
• Total work in computation                                                         – parallel fraction p’
  – serial fraction s                                                               – s’ + p’ = 1 (100%)
  – parallel fraction p
    p                                                                          •   S (n) Time’(1)/Time’(n)
                                                                                   S’(n) = Time (1)/Time (n)
  – s + p = 1 (100%)
                                                                                   = (s’ + p’n)/(s’ + p’)
• S(n) = Time(1) / Time(n)                                                         = s’ + p’n = s’ + (1-s’)n
 = (s + p) / [s +(p/n)]                                                            = n +(1-n)s’
                                                                               • Reevaluating Amdahl's law,
 = 1 / [s + (1-s) / n]                                                           John L. Gustafson, CACM May
                                                                                 1988, pp 532-533. ”Not a new
  = n / [1 + (n - 1)s]                                                           law, but Amdahl’s law with
                                                                                 changed assumptions”
• ”pessimistic and famous”
               39                                               Lasse Natvig                  40                                                Lasse Natvig




 How the serial fraction limits speedup


• Amdahl’s law

• Work hard to
  reduce the
  serial part of
  the application
    – remember IO
    – think different
      (than traditionally                  = serial fraction
      or sequentially)

               41                                               Lasse Natvig




                                                                                                                                                               7
1




         TDT4260 Computer architecture
                 Mini-project

    PhD candidate Alexandru Ciprian Iordan
    Institutt for datateknikk og informasjonsvitenskap
2




    What is it…? How much…?
    • The mini-project is the exercise part of TDT4260
      course

    • This year the students will need to develop and
      evaluate a PREFETCHER

    • The mini-project accounts for 20 % of the final grade
      in TDT4260
          • 80 % for report
          • 20 % for oral presentation
3




    What will you work with…

    • Modified version of M5 (for development and
      evaluation)

    • Computing time on Kongull cluster (for
      benchmarking)

    • More at: http://dm-ark.idi.ntnu.no/
4




    M5
    • Initially developed by the University of Michigan

    • Enjoys a large community of users and developers

    • Flexible object-oriented architecture

    • Has support for 3 ISA: ALPHA, SPARC and MIPS
5




    Team work…

    • You need to work in groups of 2-4 students



    • Grade is based on written paper AND oral
      presentation (chose you best speaker)
6




    Time Schedule and Deadlines




              More on It’s learning
7




    Web page presentation
Contents
                                                     • Instruction level parallelism                Chap 2
                                                     • Pipelining (repetition)                      App A
TDT 4260                                               ▫ Basic 5-step pipeline
                                                     • Dependencies and hazards                     Chap 2.1
App A.1, Chap 2
                                                       ▫ Data, name, control, structural
Instruction Level Parallelism
                                                     • Compiler techniques for ILP                  Chap 2.2
                                                     • (Static prediction                           Chap 2.3)
                                                       ▫ Read this on your own
                                                     • Project introduction




                                                    Pipelining
Instruction level parallelism (ILP)                 (1/3)

• A program is sequence of instructions typically
  written to be executed one after the other
• Poor usage of CPU resources! (Why?)
• Better: Execute instructions in parallel
  ▫ 1: Pipeline
       Partial overlap of instruction execution
  ▫ 2: Multiple issue
       Total overlap of instruction execution
• Today: Pipelining




Pipelining (2/3)                                     Pipelining (3/3)
• Multiple different stages executed in parallel     • Good Utilization: All stages are ALWAYS in use
  ▫ Laundry in 4 different stages                      ▫ Washing, drying, folding, ...
  ▫ Wash / Dry / Fold / Store                          ▫ Great usage of resources!
• Assumptions:                                       • Common technique, used everywhere
  ▫   Task can be split into stages                    ▫ Manufacturing, CPUs, etc
  ▫   Storage of temporary data                      • Ideal: time_stage = time_instruction / stages
                                                       ▫   But stages are not perfectly balanced
  ▫   Stages synchronized
                                                       ▫   But transfer between stages takes time
  ▫   Next operation known before last finished?
                                                       ▫   But pipeline may have to be emptied
                                                       ▫   ...
Example: MIPS64 (2/2)
Example: MIPS64 (1/2)                                                                      Time (clock cycles)

                                                                       Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5       Cycle 6 Cycle 7
•   RISC                       • Pipeline                     I




                                                                                              ALU
•   Load/store                   ▫ IF: Instruction fetch      n         Ifetch    Reg               DMem     Reg

                                                              s
•   Few instruction formats      ▫ ID: Instruction decode /   t
                                   register fetch             r.




                                                                                                       ALU
•   Fixed instruction length     ▫ EX: Execute / effective
                                                                                 Ifetch      Reg             DMem     Reg



•   64-bit                         address (EA)               O
                                                              r




                                                                                                               ALU
    ▫ DADD = 64 bits ADD         ▫ MEM: Memory access         d
                                                                                           Ifetch     Reg             DMem     Reg


    ▫ LD = 64 bits L(oad)        ▫ WB: Write back (reg)       e
                                                              r
• 32 registers (R0 = 0)




                                                                                                                        ALU
                                                                                                    Ifetch    Reg             DMem     Reg


• EA = offset(Register)




Big Picture:                                                  Big Picture (continued):
• What are some real world examples of                         • Computer Architecture is the study of design
  pipelining?                                                    tradeoffs!!!!
• Why do we pipeline?                                          • There is no “philosophy of architecture” and no
• Does pipelining increase or decrease instruction               “perfect architecture”. This is engineering, not
  throughput?                                                    science.
• Does pipelining increase or decrease instruction             • What are the costs of pipelining?
  latency?                                                     • For what types of devices is pipelining not a
                                                                 good choice?




Improve speedup?                                              Dependencies and hazards
• Why not perfect speedup?                                     • Dependencies
    ▫ Sequential programs                                          ▫ Parallel instructions can be executed in parallel
                                                                   ▫ Dependent instructions are not parallel
    ▫ One instruction dependent on another                            I1: DADD            R1, R2, R3
    ▫ Not enough CPU resources                                        I2: DSUB            R4, R1, R5
• What can be done?                                                ▫ Property of the instructions
    ▫ Forwarding (HW)                                          • Hazards
                                                                   ▫ Situation where a dependency causes an instruction to
    ▫ Scheduling (SW / HW)                                           give a wrong result
    ▫ Prediction (SW / HW)                                         ▫ Property of the pipeline
• Both hardware (dynamic) and compiler (static)                    ▫ Not all dependencies give hazards
  can help                                                            Dependencies must be close enough in the instruction
                                                                      stream to cause a hazard
Dependencies                                                                                  Hazards
      • (True) data dependencies                                                                     • Data hazards
         ▫ One instruction reads what an earlier has written                                           ▫ Overlap will give different result from sequential
      • Name dependencies                                                                              ▫ RAW / WAW / WAR
         ▫ Two instructions use the same register / mem loc                                          • Control hazards
         ▫ But no flow of data between them                                                            ▫ Branches
         ▫ Two types: Anti and output dependencies                                                     ▫ Ex: Started executing the wrong instruction
      • Control dependencies                                                                         • Structural hazards
         ▫ Instructions dependent on the result of a branch                                            ▫ Pipeline does not support this combination of instr.
      • Again: Independent of pipeline implementation                                                  ▫ Ex: Register with one port, two stages want to read




       Data dependency                               Hazard?
       Figure A.6, Page A-16
                                                                                                        Data Hazards (1/3)
                                                                                                      • Read After Write (RAW)
I                                                                                                       InstrJ tries to read operand before InstrI writes
                                               ALU




                                                              Reg
        add r1,r2,r3      Ifetch    Reg              DMem

n                                                                                                       it
s
                                                        ALU




t       sub r4,r1,r3               Ifetch    Reg              DMem     Reg                                         I: add r1,r2,r3
r.                                                                                                                 J: sub r4,r1,r3
                                                                ALU




                                            Ifetch    Reg              DMem    Reg
O       and r6,r1,r7
r                                                                                                     • Caused by a true data dependency
d
                                                                                                      • This hazard results from an actual need for
                                                                         ALU




                                                     Ifetch    Reg             DMem    Reg
e       or     r8,r1,r9
r                                                                                                       communication.
                                                                                 ALU




                                                              Ifetch    Reg            DMem   Reg
        xor r10,r1,r11




      Data Hazards (2/3)                                                                            Data Hazards (3/3)
                                                                                                    • Write After Write (WAW)
     • Write After Read (WAR)                                                                         InstrJ writes operand before InstrI writes it.
       InstrJ writes operand before InstrI reads it
                                                                                                                   I: sub r1,r4,r3
                      I: sub r4,r1,r3                                                                              J: add r1,r2,r3
                      J: add r1,r2,r3

                                                                                                    • Caused by an output dependency
     • Caused by an anti dependency
       This results from reuse of the name “r1”                                                     • Can’t happen in MIPS 5 stage pipeline because:
                                                                                                      ▫ All instructions take 5 stages, and
     • Can’t happen in MIPS 5 stage pipeline because:                                                 ▫ Writes are always in stage 5
       ▫ All instructions take 5 stages, and                                                        • WAR and WAW can occur in more
       ▫ Reads are always in stage 2, and
       ▫ Writes are always in stage 5
                                                                                                      complicated pipes
Forwarding                                                                                                                               Can all data hazards be solved via
      Figure A.7, Page A-18                                                                                                                    forwarding???
                                 IF ID/RF EX                            MEM          WB                                                                                    IF ID/RF EX                            MEM          WB

I                                                                                                                                        I

                                                                  ALU




                                                                                                                                                                                                           ALU
                                                                                      Reg                                                                                                                                        Reg
       add r1,r2,r3             Ifetch           Reg                     DMem
                                                                                                                                                Ld     r1,r2              Ifetch           Reg                     DMem

n                                                                                                                                        n
s                                                                                                                                        s

                                                                               ALU




                                                                                                                                                                                                                         ALU
t      sub r4,r1,r3                             Ifetch          Reg                  DMem     Reg
                                                                                                                                         t      add r4,r1,r3                              Ifetch          Reg                   DMem     Reg

r.                                                                                                                                       r.

                                                                                       ALU




                                                                                                                                                                                                                                  ALU
                                                            Ifetch         Reg                DMem          Reg                                                                                      Ifetch          Reg                 DMem          Reg
O      and r6,r1,r7                                                                                                                      O      and r6,r1,r7
r                                                                                                                                        r
d                                                                                                                                        d
                                                                                                ALU




                                                                                                                                                                                                                                           ALU
                                                                         Ifetch       Reg                   DMem          Reg                                                                                      Ifetch        Reg                   DMem         Reg
e      or      r8,r1,r9                                                                                                                  e      or      r8,r1,r9
r                                                                                                                                        r
                                                                                                              ALU




                                                                                                                                                                                                                                                         ALU
                                                                                     Ifetch    Reg                        DMem     Reg                                                                                          Ifetch    Reg                   DMem        Reg
       xor r10,r1,r11                                                                                                                           xor r10,r1,r11




     Structural Hazards (Memory Port)                                                                                                         Hazards, Bubbles                                 (Similar to Figure A.5, Page A-15)
     Figure A.4, Page A-14
                                                                                                                                                  Time (clock cycles)
          Time (clock cycles)
                                                                                                                                                       Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5                                 Cycle 6 Cycle 7
              Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5                                Cycle 6 Cycle 7

                                                                                                                                                                                    ALU
                                                                                                                                          I Load        Ifetch    Reg                              DMem            Reg
                                          ALU




 I Load Ifetch          Reg                              DMem            Reg
                                                                                                                                          n
 n                                                                                                                                        s
                                                                                                                                                                                                     ALU
                                                                                                                                                                                                                                 Reg
 s                                                                                                                                        t
                                                                                                                                             Instr 1             Ifetch            Reg                            DMem
                                                            ALU




                                                                                      Reg
 t
    Instr 1            Ifetch            Reg                            DMem

                                                                                                                                          r.
 r.
                                                                                                                                                                                                                     ALU
                                                                                                                                                                              Ifetch                Reg                         DMem        Reg
                                                                                                                                                Ld r1, r2
                                                                           ALU




                                    Ifetch                Reg                         DMem       Reg
       Instr 2                                                                                                                            O
 O                                                                                                                                        r
 r                                                                                                                                              Stall                                        Bubble              Bubble        Bubble    Bubble           Bubble
                                                                                                                                          d
                                                                                        ALU




                                                         Ifetch          Reg                    DMem                Reg
 d     Instr 3
                                                                                                                                          e
 e
                                                                                                                                                                                                                                                 ALU
                                                                                                                                          r     Add r1, r1, r1                                                    Ifetch          Reg                        DMem         Reg
                                                                                                      ALU




 r      Instr 4                                                         Ifetch         Reg                        DMem           Reg




                                                                                                                                               How do you “bubble” the pipe?                               How can we avoid this hazard?




     Control hazards (1/2)
                                                                                                                                              Control hazards (2/2)
      • Sequential execution is predictable,                                                                                                  • What can be done?
        (conditional) branches are not
                                                                                                                                               ▫ Always stop (previous slide)
      • May have fetched instructions that should not be
        executed                                                                                                                                     Also called freeze or flushing of the pipeline
      • Simple solution (figure): Stall the pipeline (bubble)                                                                                  ▫ Assume no branch (=assume sequential)
        ▫ Performance loss depends on number of branches in the program                                                                              Must not change state before branch instr. is complete
          and pipeline implementation
        ▫ Branch penaltyC                                                                                                                      ▫ Assume branch
                                                                                                                                                     Only smart if the target address is ready early
                                                                                                                                               ▫ Delayed branch
                                                                                                                                                     Execute a different instruction while branch is evaluated
                                                                                                                                                 Static techniques (fixed rule or compiler)
             Possibly wrong instruction                                  Correct instruction
Example
  • Assume branch conditionals are evaluated in the EX
                                                                      Dynamic scheduling
    stage, and determine the fetch address for the following
    cycle.                                                             • So far: Static scheduling
  • If we always stall, how many cycles are bubbled?                        ▫ Instructions executed in program order
  • Assume branch not taken, how many bubbles for an                        ▫ Any reordering is done by the compiler
    incorrect assumption?
  • Is stalling on every branch ok?                                    • Dynamic scheduling
  • What optimizations could be done to improve stall                       ▫ CPU reorders to get a more optimal order
    penalty?                                                                   Fewer hazards, fewer stalls, ...
                                                                            ▫ Must preserve order of operations where
                                                                              reordering could change the result
                                                                            ▫ Covered by TDT 4255 Hardware design




                                                                      Example
 Compiler techniques for ILP                                          Source code:                       Notice:
                                                                      for (i = 1000; i >0; i=i-1)        • Lots of dependencies
  • For a given pipeline and superscalarity                                                              • No dependencies between iterations
                                                                        x[i] = x[i] + s;
    ▫ How can these be best utilized?                                                                    • High loop overhead
    ▫ As few stalls from hazards as possible                                                                Loop unrolling

  • Dynamic scheduling                                               MIPS:
    ▫ Tomasulo’s algorithm etc. (TDT4255)                            Loop: L.D              F0,0(R1)                ; F0 = x[i]
    ▫ Makes the CPU much more complicated                                  ADD.D            F4,F0,F2                ; F2 = s
  • What can be done by the compiler?                                      S.D              F4,0(R1)                ; Store x[i] + s
    ▫ Has ”ages” to spend, but less knowledge                              DADDUI           R1,R1,#-8               ; x[i] is 8 bytes
    ▫ Static scheduling, but what else?                                    BNE              R1,R2,Loop              ; R1 = R2?




                                                                                                        Loop:     L.D           F0,0(R1)

 Static scheduling                                                    Loop unrolling                              ADD.D         F4,F0,F2
                                                                                                                  S.D           F4,0(R1)
Loop: L.D          F0,0(R1)     Loop: L.D              F0,0(R1)     Loop:   L.D           F0,0(R1)                L.D           F6,-8(R1)
      stopp                           DADDUI           R1,R1,#-8            ADD.D         F4,F0,F2                ADD.D         F8,F6,F2
      ADD.D        F4,F0,F2           ADD.D            F4,F0,F2             S.D           F4,0(R1)                S.D           F8,-8(R1)
      stopp                           stopp                                 DADDUI        R1,R1,#-8               L.D           F10,-16(R1)
      stopp                           stopp                                 BNE           R1,R2,Loop              ADD.D         F12,F10,F2
      S.D          F4,0(R1)           S.D              F4,8(R1)                                                   S.D           F12,-16(R1)
      DADDUI       R1,R1,#-8          BNE              R1,R2,Loop
                                                                                                                  L.D           F14,-24(R1)
      stopp                                                         • Reduced loop overhead
                                                                                                                  ADD.D         F16,F14,F2
      BNE          R1,R2,Loop                                       • Requires number of iterations               S.D           F16,-24(R1)
                                                                      divisible by n (here n=4)
                                                                                                                  DADDUI        R1,R1,#-32
                                                                    • Register renaming                           BNE           R1,R2,Loop
                                                                    • Offsets have changed
            Result: From 9 cycles per iteration to 7                • Stalls not shown
               (Delays from table in figure 2.2)
Loop:   L.D           F0,0(R1)      Loop:   L.D         F0,0(R1)
        ADD.D         F4,F0,F2              L.D         F6,-8(R1)
        S.D           F4,0(R1)              L.D         F10,-16(R1)   Loop unrolling: Summary
        L.D           F6,-8(R1)             L.D         F14,-24(R1)
        ADD.D         F8,F6,F2              ADD.D       F4,F0,F2      • Original code         9 cycles per element
        S.D           F8,-8(R1)             ADD.D       F8,F6,F2      • Scheduling            7 cycles per element
        L.D           F10,-16(R1)           ADD.D       F12,F10,F2    • Loop unrolling        6,75 cycles per element
        ADD.D         F12,F10,F2            ADD.D       F16,F14,F2
                                                                       ▫ Unrolled 4 iterations
        S.D           F12,-16(R1)           S.D         F4,0(R1)
        L.D           F14,-24(R1)           S.D         F8,-8(R1)     • Combination           3,5 cycles per element
        ADD.D         F16,F14,F2            DADDUI      R1,R1,#-32     ▫ Avoids stalls entirely
        S.D           F16,-24(R1)           S.D         F12,-16(R1)
        DADDUI        R1,R1,#-32            S.D         F16,-24(R1)
                                            BNE         R1,R2,Loop
                                                                      Compiler reduced execution time by 61%
        BNE           R1,R2,Loop



   Avoids stall after: L.D(1), ADD.D(2), DADDUI(1)




   Loop unrolling in practice
    • Do not usually know upper bound of loop
    • Suppose it is n, and we would like to unroll the loop
      to make k copies of the body
    • Instead of a single unrolled loop, we generate a pair
      of consecutive loops:
        ▫ 1st executes (n mod k) times and has a body that is the
          original loop
        ▫ 2nd is the unrolled body surrounded by an outer loop
          that iterates (n/k) times
    • For large values of n, most of the execution time will
      be spent in the unrolled loop
Review
                                                             • Name real-world examples of pipelining
                                                             • Does pipelining lower instruction latency?
                                                             • What is the advantage of pipelining?
                                                             • What are some disadvantages of pipelining?
 TDT 4260                                                    • What can a compiler do to avoid processor
 Chap 2, Chap 3                                                stalls?
 Instruction Level Parallelism (cont)                        • What are the three types of data dependences?
                                                             • What are the three types of pipeline hazards?




                                                            Getting CPI below 1
 Contents                                                   • CPI ≥ 1 if issue only 1 instruction every clock cycle
                                                            • Multiple-issue processors come in 3 flavors:
 • Very Large Instruction Word           Chap 2.7
                                                            1. Statically-scheduled superscalar processors
   ▫ IA-64 and EPIC                                              • In-order execution
 • Instruction fetching                  Chap 2.9                • Varying number of instructions issued (compiler)
                                                            2. Dynamically-scheduled superscalar processors
 • Limits to ILP                         Chap 3.1/2              • Out-of-order execution
 • Multi-threading                       Chap 3.5                • Varying number of instructions issued (CPU)
                                                            3. VLIW (very long instruction word) processors
                                                                 • In-order execution
                                                                 • Fixed number of instructions issued




                                                            VLIW: Very Large Instruction Word (2/2)
VLIW: Very Large Instruction Word (1/2)
                                                             • Assume 2 load/store, 2 fp, 1 int/branch
                                                                 ▫ VLIW with 0-5 operations.
• Each VLIW has explicit coding for multiple
                                                                 ▫ Why 0?
  operations
 ▫ Several instructions combined into packets                • Important to avoid empty instruction slots
 ▫ Possibly with parallelism indicated                           ▫ Loop unrolling
                                                                 ▫ Local scheduling
• Tradeoff instruction space for simple decoding
                                                                 ▫ Global scheduling
 ▫ Room for many operations
                                                                     Scheduling across branches
 ▫ Independent operations => execute in parallel
 ▫ E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1    • Difficult to find all dependencies in advance
   branch                                                        ▫ Solution1: Block on memory accesses
                                                                 ▫ Solution2: CPU detects some dependencies
Loop:    L.D             F0,0(R1)
Recall:                                      L.D             F6,-8(R1)
                                                                                Loop Unrolling in VLIW
Unrolled Loop                                L.D
                                             L.D
                                                             F10,-16(R1)
                                                             F14,-24(R1)
                                                                           Memory         Memory          FP                 FP         Int. op/   Clock
                                                                           reference 1    reference 2     operation 1        op. 2      branch
that minimizes                               ADD.D           F4,F0,F2      L.D F0,0(R1)    L.D F6,-8(R1)                                             1
                                             ADD.D           F8,F6,F2      L.D F10,-16(R1) L.D F14,-24(R1)                                           2
stalls for Scalar                            ADD.D           F12,F10,F2    L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2   ADD.D F8,F6,F2           3
                                                                           L.D F26,-48(R1)                 ADD.D F12,F10,F2 ADD.D F16,F14,F2         4
                                             ADD.D           F16,F14,F2
Source code:                                                                                               ADD.D F20,F18,F2 ADD.D F24,F22,F2         5
                                             S.D             F4,0(R1)      S.D 0(R1),F4    S.D -8(R1),F8   ADD.D F28,F26,F2                          6
for (i = 1000; i >0; i=i-1)                                                S.D -16(R1),F12 S.D -24(R1),F16                                           7
                                             S.D             F8,-8(R1)
  x[i] = x[i] + s;                                                         S.D -32(R1),F20 S.D -40(R1),F24                          DSUBUI R1,R1,#48 8
                                             DADDUI          R1,R1,#-32
                                                                           S.D -0(R1),F28                                           BNEZ R1,LOOP     9
                                             S.D             F12,-16(R1)
Register mapping:                                                           Unrolled 7 iterations to avoid delays
                                             S.D             F16,-24(R1)
                                                                            7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
s F2                                         BNE             R1,R2,Loop
                                                                            Average: 2.5 ops per clock, 50% efficiency
i R1                                                                        Note: Need more registers in VLIW (15 vs. 6 in SS)




 Problems with 1st Generation VLIW                                         VLIW Tradeoffs
• Increase in code size                                                    • Advantages
  ▫ Loop unrolling                                                           ▫ “Simpler” hardware because the HW does not have to
  ▫ Partially empty VLIW                                                       identify independent instructions.
• Operated in lock-step; no hazard detection HW                            • Disadvantages
  ▫ A stall in any functional unit pipeline causes entire processor to       ▫ Relies on smart compiler
    stall, since all functional units must be kept synchronized
                                                                             ▫ Code incompatibility between generations
  ▫ Compiler might predict function units, but caches hard to predict
                                                                             ▫ There are limits to what the compiler can do (can’t move
  ▫ Moder VLIWs are “interlocked” (identify dependences between
                                                                               loads above branches, can’t move loads above stores)
    bundles and stall).
• Binary code compatibility
                                                                           • Common uses
  ▫ Strict VLIW => different numbers of functional units and unit            ▫ Embedded market where hardware simplicity is
    latencies require different versions of the code                           important, applications exhibit plenty of ILP, and binary
                                                                               compatibility is a non-issue.




 IA-64 and EPIC                                                            Instruction bundle (VLIW)
 • 64 bit instruction set architecture
    ▫ Not a CPU, but an architecture
    ▫ Itanium and Itanium 2 are CPUs
      based on IA-64
 • Made by Intel and Hewlett-Packard (itanium 2 and 3
   designed in Colorado)
 • Uses EPIC: Explicitly Parallel Instruction Computing
 • Departure from the x86 architecture
 • Meant to achieve out-of-order performance with in-
   order HW + compiler-smarts
    ▫ Stop bits to help with code density
    ▫ Support for control speculation (moving loads above
      branches)
    ▫ Support for data speculation (moving loads above stores)
                   Details in Appendix G.6
Functional units and template
 • Functional units:
                                                               Code example (1/2)
   ▫ I (Integer), M (Integer + Memory), F (FP), B (Branch),
     L + X (64 bit operands + special inst.)
 • Template field:
   ▫ Maps instruction to functional unit
   ▫ Indicates stops: Limitations to ILP




                                                               Control Speculation
Code example 2/2                                                • Can the compiler schedule an independent load
                                                                  above a branch?
                                                                         Bne R1, R2, TARGET
                                                                         Ld R3, R4(0)
                                                                • What are the problems?
                                                                • EPIC provides speculative loads
                                                                         Ld.s R3, R4(0)
                                                                         Bne R1, R2, TARGET
                                                                         Check R4(0)




Data Speculation                                               EPIC Conclusions
                                                                • Goal of EPIC was to maintain advantages of VLIW, but
 • Can the compiler schedule an independent load                  achieve performance of out-of-order.
   above a store?                                               • Results:
          St R5, R6(0)                                            ▫ Complicated bundling rules saves some space, but
          Ld R3, R4(0)                                              makes the hardware more complicated
 • What are the problems?                                         ▫ Add special hardware and instructions for scheduling
 • EPIC provides “advanced loads” and an ALAT                       loads above stores and branches (new complicated
   (Advanced Load Address Table)                                    hardware)
          Ld.a R3, R4(0)    creates entry in ALAT                 ▫ Add special hardware to remove branch penalties
          St R5, R6(0)      looks up ALAT, if match, jump to        (predication)
                             fixup code                           ▫ End result is a machine as complicated as an out-of-
                                                                    order, but now also requiring a super-sophisticated
                                                                    compiler.
Branch Target Buffer (BTB)
 Instruction fetching
                                                                       • Predicts next
                                                                         instruction
  • Want to issue >1 instruction every cycle                             address, sends it
                                                                         out before
  • This means fetching >1 instruction                                   decoding
     ▫ E.g. 4-8 instructions fetched every cycle                         instruction
                                                                       • PC of branch sent
  • Several problems                                                     to BTB
     ▫ Bandwidth / Latency                                             • When match is
     ▫ Determining which instructions                                    found, Predicted
                                                                         PC is returned
         Jumps                                                         • If branch
         Branches                                                        predicted taken,
  • Integrated instruction fetch unit                                    instruction fetch
                                                                         continues at
                                                                         Predicted PC




 Branch Target Buffer (BTB)
                                                   Possible
                                                   Optimizations????
                                                                         Return Address Predictor
                                                                       • Small buffer of                               70%
                                                                                                                                                                 go
• Predicts next                                                          return
                                                                                             Misprediction frequency


  instruction                                                            addresses acts                                60%                                       m88ksim
  address, sends it                                                      as a stack                                                                              cc1
  out before                                                                                                           50%
                                                                       • Caches most                                                                             compress
  decoding
  instruction                                                            recent return                                 40%                                       xlisp
• PC of branch sent                                                      addresses                                                                               ijpeg
                                                                                                                       30%
  to BTB                                                               • Call ⇒ Push a                                                                           perl
• When match is                                                          return address                                20%                                       vortex
  found, Predicted                                                       on stack
  PC is returned                                                                                                       10%
                                                                       • Return ⇒ Pop
• If branch                                                              an address off                                0%
  predicted taken,                                                       stack & predict
  instruction fetch                                                                                                          0     1      2       4      8         16
                                                                         as new PC
  continues at                                                                                                                   Return address buffer entries
  Predicted PC




                                                                                                                                                      Chapter 3
 Integrated Instruction Fetch Units                                      Limits to ILP
  • Recent designs have implemented the fetch                             • Advances in compiler technology + significantly
    stage as a separate, autonomous unit                                    new and different hardware techniques may be
    ▫ Multiple-issue in one simple pipeline stage is too                    able to overcome limitations assumed in studies
      complex                                                             • However, unlikely such advances when coupled
  • An integrated fetch unit provides:                                      with realistic hardware will overcome these
    ▫ Branch prediction                                                     limits in near future
    ▫ Instruction prefetch                                                • How much ILP is available using existing
    ▫ Instruction memory access and buffering                               mechanisms with increasing HW budgets?
Ideal HW Model                                              Upper Limit to ILP: Ideal Machine
                                                            (Figure 3.1)
1. Register renaming – infinite virtual registers
    all register WAW & WAR hazards are avoided




                                                            Instructions Per Clock
                                                                                                        160                                                              FP: 75 - 150                     150.1
2. Branch prediction – perfect; no mispredictions
                                                                                                        140
3. Jump prediction – all jumps perfectly predicted                                                                           Integer: 18 - 60                                            118.7
                                                                                                        120
    2 & 3 ⇒ no control dependencies; perfect speculation
    & an unbounded buffer of instructions available                                                     100

                                                                                                                                                                        75.2
4. Memory-address alias analysis – addresses known &                                                     80
                                                                                                                                    62.6
a load can be moved before a store provided addresses                                                    60        54.8

not equal                                                                                                40
    1&4 eliminates all but RAW                                                                           20
                                                                                                                                                    17.9

5. perfect caches; 1 cycle latency for all instructions;
                                                                                                          0
unlimited instructions issued/clock cycle                                                                              gcc       espresso               li              fpppp           doducd           tomcatv

                                                                                                                                                             Programs




                                                               More Realistic HW: Window Impact
                                                               Figure 3.2
                                                                                                                                                                                        FP: 9 - 150
Instruction window
                                                                                                         160                                                                                              150

• Ideal HW need to know entire code                                                                      140
                                                                                                                                                                                         119
• Obviously not practical
                                                                               Instructions Per Clock




                                                                                                         120
                                                                                                                       Integer: 8 - 63
  ▫ Register dependencies scales quadratically                                                           100
                                                            IPC




• Window: The set of instructions examined for                                                            80
                                                                                                                                                                        75
                                                                                                                                  63
  simultaneous execution                                                                                  60
                                                                                                                  55
                                                                                                                                                                         61
                                                                                                                                                                             49
                                                                                                                                                                                           59                60

                                                                                                                                                                                                                  45
• How does the size of the window affect IPC?                                                             40
                                                                                                                   36
                                                                                                                                    41
                                                                                                                                                                                35                                 34

  ▫ Too small window => Can’t see whole loops                                                                                          15
                                                                                                                                        13
                                                                                                                                                  18
                                                                                                                                                   1512 9                         14           16
                                                                                                                                                                                                15                  14
                                                                                                          20            10 8
                                                                                                                         10                  8        11                                             9
  ▫ Too large window => Hard to implement
                                                                                                              0
                                                                                                                       gcc        espresso                   li           fpppp            doduc             tomcatv

                                                                                                                                             Infinite        2048       512       128     32




                                                                       Multi-threaded execution
Thread Level Parallelism (TLP)
 • ILP exploits implicit parallel operations within                    • Multi-threading: multiple threads share the
   a loop or straight-line code segment                                  functional units of 1 processor via overlapping
                                                                                                 ▫ Must duplicate independent state of each thread e.g., a
 • TLP explicitly represented by the use of
                                                                                                   separate copy of register file, PC and page table
   multiple threads of execution that are                                                        ▫ Memory shared through virtual memory mechanisms
   inherently parallel                                                                           ▫ HW for fast thread switch; much faster than full
 • Use multiple instruction streams to improve:                                                    process switch ≈ 100s to 1000s of clocks
    1. Throughput of computers that run many programs                  • When switch?
    2. Execution time of a single application implemented                                        ▫ Alternate instruction per thread (fine grain)
       as a multi-threaded program (parallel program)                                            ▫ When a thread is stalled, perhaps for a cache miss,
                                                                                                   another thread can be executed (coarse grain)
Fine-Grained Multithreading                                              Coarse-Grained Multithreading
                                                                       • Switch threads only on costly stalls (L2 cache miss)
• Switches between threads on each instruction                         • Advantages
 ▫ Multiples threads interleaved                                                        ▫ No need for very fast thread-switching
• Usually round-robin fashion, skipping stalled                                         ▫ Doesn’t slow down thread, since switches only when
  threads                                                                                 thread encounters a costly stall
• CPU must be able to switch threads every clock                       • Disadvantage: hard to overcome throughput losses
                                                                         from shorter stalls, due to pipeline start-up costs
• Hides both short and long stalls                                                      ▫ Since CPU issues instructions from 1 thread, when a stall
 ▫ Other threads executed when one thread stalls                                          occurs, the pipeline must be emptied or frozen
• But slows down execution of individual threads                                        ▫ New thread must fill pipeline before instructions can
 ▫ Thread ready to execute without stalls will be delayed by                              complete
   instructions from other threads                                     • => Better for reducing penalty of high cost stalls,
• Used on Sun’s Niagara                                                  where pipeline refill << stall time




                                                                                  Simultaneous Multi-threading
Do both ILP and TLP?                                                            One thread, 8 units                                 Two threads, 8 units
                                                               Cycle M M FX FX FP FP BR CC                                    Cycle M M FX FX FP FP BR CC
 • TLP and ILP exploit two different kinds of                                       1                                               1
   parallel structure in a program                                                  2                                               2
 • Can a high-ILP processor also exploit TLP?                                       3                                               3
   ▫ Functional units often idle because of stalls or                               4                                               4
     dependences in the code
                                                                                    5                                               5
 • Can TLP be a source of independent instructions
                                                                                    6                                               6
   that might reduce processor stalls?
                                                                                    7                                               7
 • Can TLP be used to employ functional units that
                                                                                    8                                               8
   would otherwise lie idle with insufficient ILP?
                                                                                    9                                               9
 • => Simultaneous Multi-threading (SMT)
   ▫ Intel: Hyper-Threading                                    M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes




Simultaneous Multi-threading (SMT)                                                      Multi-threaded categories
                                                                                                                                                       Simultaneous
                                                                                        Superscalar   Fine-Grained Coarse-Grained    Multiprocessing
                                                               Time (processor cycle)




                                                                                                                                                       Multithreading
• A dynamically scheduled processor already has
  many HW mechanisms to support multi-threading
 ▫ Large set of virtual registers
     Virtual = not all visible at ISA level
     Register renaming
 ▫ Dynamic scheduling
• Just add a per thread renaming table and keeping
  separate PCs
 ▫ Independent commitment can be supported by logically
   keeping a separate reorder buffer for each thread
                                                                                           Thread 1             Thread 3                Thread 5
                                                                                           Thread 2             Thread 4                Idle slot
Design Challenges in SMT
• SMT makes sense only with fine-grained
  implementation
 ▫ How to reduce the impact on single thread performance?
 ▫ Give priority to one or a few preferred threads
• Large register file needed to hold multiple contexts
• Not affecting clock cycle time, especially in
 ▫ Instruction issue - more candidate instructions need to
   be considered
 ▫ Instruction completion - choosing which instructions to
   commit may be challenging
• Ensuring that cache and TLB conflicts generated
  by SMT do not degrade performance
TDT 4260 – lecture 4 – 2011                                                   Updated lecture plan pr. 4/2
• Contents                                                                   Date and lecturer         Topic

  – Computer architecture introduction                                       1: 14 Jan (LN, AI)        Introduction, Chapter 1 / Alex: PfJudge
                                                                             2: 21 Jan (IB)            Pipelining, Appendix A; ILP, Chapter 2
        •    Trends                                                          3: 3 Feb (IB)             ILP, Chapter 2; TLP, Chapter 3
        •    Moore’s law                                                     4: 4 Feb (LN)             Multiprocessors, Chapter 4
                                                                             5: 11 Feb MG              Prefetching + Energy Micro guest lecture by Marius Grannæs &
        •    Amdahl’s law                                                                              pizza
        •    Gustafson s
             Gustafson’s law                                                 6: 18 Feb (LN)            Multiprocessors continued
                                                                             7: 24 Feb (IB)            Memory and cache, cache coherence (Chap. 5)
  – Why multiprocessor?                            Chap 4.1                  8: 4 Mar (IB)             Piranha CMP + Interconnection networks
        • Taxonomy                                                           9: 11 Mar (LN)            Multicore architectures (Wiley book chapter) + Hill Marty
                                                                                                       Amdahl multicore ... Fedorova ... assymetric multicore ...
        • Memory architecture
        • Communication                                                      10: 18 Mar (IB)           Memory consistency (4.6) + more on memory
                                                                             11: 25 Mar (JA, AI)       (1) Kongull and other NTNU and NOTUR supercomputers (2)
  – Cache coherence                                Chap 4.2                                            Green computing
        • The problem                                                        12: 7 Apr (IB/LN)         Wrap up lecture, remaining stuff

        • Snooping protocols                                                 13: 8 Apr                 Slack – no lecture planned


                      1                                       Lasse Natvig                        2                                                             Lasse Natvig




Trends                                                                                   Comp. Arch. is an
                                                                                         Integrated Approach
• For technology, costs, use
                                                                             • What really matters is the functioning of the
• Help predicting the future                                                   complete system
• Product development time                                                         – hardware, runtime system, compiler, operating
  – 2-3 years                                                                        system, and application
  –  design for the next technology
                                                                                   – In networking, this is called the “End to End argument”
                                                                                        networking                      End        argument
  – Why should an architecture live longer than a product?                   • Computer architecture is not just about
                                                                               transistors(not at all), individual instructions, or
                                                                               particular implementations
                                                                                   – E.g., Original RISC projects replaced complex
                                                                                     instructions with a compiler + simple instructions


                      3                                       Lasse Natvig                        4                                                             Lasse Natvig




 Computer Architecture is
 Design and Analysis                                                                 TDT4260 Course Focus
                                                                              Understanding the design techniques, machine
                          Architecture is an iterative process:
                          • Searching the huge space of possible               structures, technology factors, evaluation
                          designs
             Design
                                                                               methods that will determine the form of
                          • At all levels of computer systems
  Analysis                                                                     computers in 21st Century
                                                                                                      Technology         Parallelism
                                                                                                                                               Programming
 Creativity
 C   ti it                                                                                                                                     Languages
                                                                                Applications                                                           Interface Design
                           Cost /                                                                         Computer Architecture:                            (ISA)
                           Performance                                                                    • Organization
                           Analysis                                                                       • Hardware/Software Boundary
                                                                                                                                                          Compilers

                                                Good Ideas                                    Operating         Measurement &
                                                                                              Systems              Evaluation                           History
                                         Mediocre Ideas
                  Bad Ideas
                      5                                       Lasse Natvig                        6                                                             Lasse Natvig




                                                                                                                                                                               1
Moore’s Law: 2X transistors /
    Holistic approach                                      NTNU-principle: Teaching based              “year”
                                                           on research, example, PhD-
       e.g., to programmability combined with              project of Alexandru Iordan:
       performance
                                                              TBP (Wool, TBB)
Energy aware task pool implementation




       Parallel & concurrent programming


       Operating System & system software



    Multicore, interconnect, memory                                                             •     “Cramming More Components onto Integrated Circuits”
                                                                                                       –    Gordon Moore, Electronics, 1965
                                            Multicore memory systems (Dybdahl-PhD,              •     # of transistors / cost-effective integrated circuit double
                                            Grannæs-PhD, Jahre-PhD, M5-sim, pfJudge)                  every N months (12 ≤ N ≤ 24)


                     7                                                          Lasse Natvig                             8                                                                Lasse Natvig




       Tracking Technology
                                                                                                Latency Lags Bandwidth (last ~20 years)
       Performance Trends
                                                                                                           10000
        • 4 critical implementation technologies:                                              CPU high,                                            • Performance Milestones
                                                                                                                                Processor
              –   Disks,                                                                       Memory low                                           • Processor: ‘286, ‘386, ‘486, Pentium,
              –   Memory,                                                                      (“Memory                                               Pentium Pro, Pentium 4 (21x,2250x)
                                                                                               Wall”) 1000
              –   Network,                                                                                                                          • Ethernet: 10Mb, 100Mb, 1000Mb,
                                                                                                                                       Network
              –   Processors                                                                                                                          10000 Mb/s (16x,1000x)
                                                                                                    Relative    Memory
        • Compare for Bandwidth vs. Latency                                                           BW
                                                                                                            100
                                                                                                                                   Disk             • Memory Module: 16bit plain DRAM,
                                                                                                                                                      Page Mode DRAM 32b 64b SDRAM,
                                                                                                                                                      P     M d DRAM, 32b, 64b, SDRAM
          improvements in performance over time                                                     Improve
                                                                                                      ment                                            DDR SDRAM (4x,120x)
        • Bandwidth: number of events per unit time                                                                                                 • Disk : 3600, 5400, 7200, 10000, 15000
              – E.g., M bits/second over network, M bytes / second from                                       10                                      RPM (8x, 143x)
                disk
                                                                                                                                                        -----------------
        • Latency: elapsed time for a single event                                                                         (Latency improvement
                                                                                                                       = Bandwidth improvement)
                                                                                                                                                        (Processor latency = typical # of pipeline-
              – E.g., one-way network delay in microseconds,                                                   1                                        stages * time pr. clock-cycle)
                average disk access time in milliseconds                                                           1              10              100
                                                                                                                   Relative Latency Improvement


                     9                                                          Lasse Natvig                             10                                                               Lasse Natvig




    COST and COTS                                                                                   Speedup
  • Cost                                                                                         • General definition:
                                                                                                                                                                       Performance (p processors)
       – to produce one unit                                                                         Speedup (p processors) =                                          Performance (1 processor)
       – include (development cost / # sold units)
       – benefit of large volume                                                                 • For a fixed problem size (input data set),
  • COTS                                                                                           performance = 1/time
                                                                                                       – Speedup
       – commodity off the shelf                                                                                                        Time (1 processor)
                                                                                                         fixed problem (p processors) =
            • much better performance/price pr. component                                                                                                Time (p processors)
            • strong influence on the selection of components for building
              supercomputers in more than 20 years                                               • Note: use best sequential algorithm in the uni-processor
                                                                                                   solution, not the parallel algorithm with p = 1

                                                                                                                                                           Superlinear speedup ?
                     11                                                         Lasse Natvig                             12                                                               Lasse Natvig




                                                                                                                                                                                                         2
Amdahl’s Law (1967) (fixed problem size)                                           Gustafson’s “law” (1987)
                                                                                    (scaled problem size, fixed execution time)
• “If a fraction s of a
  (uniprocessor)                                                                • Total execution time on
  computation is inherently                                                       parallel computer with n
  serial, the speedup is at                                                       processors is fixed
  most 1/s”                                                                           – serial fraction s’
• Total work in computation                                                           – parallel fraction p’
  – serial fraction s                                                                 – s’ + p’ = 1 (100%)
  – parallel fraction p
    p                                                                           •    S (n) Time’(1)/Time’(n)
                                                                                     S’(n) = Time (1)/Time (n)
  – s + p = 1 (100%)
                                                                                     = (s’ + p’n)/(s’ + p’)
• S(n) = Time(1) / Time(n)                                                           = s’ + p’n = s’ + (1-s’)n
 = (s + p) / [s +(p/n)]                                                              = n +(1-n)s’
                                                                                • Reevaluating Amdahl's law,
 = 1 / [s + (1-s) / n]                                                            John L. Gustafson, CACM May
                                                                                  1988, pp 532-533. ”Not a new
  = n / [1 + (n - 1)s]                                                            law, but Amdahl’s law with
                                                                                  changed assumptions”
• ”pessimistic and famous”
               13                                                Lasse Natvig                    14                                          Lasse Natvig




 How the serial fraction limits speedup
                                                                                     Single/ILP  Multi/TLP
                                                                                    • Uniprocessor trends
• Amdahl’s law                                                                         – Getting too complex
                                                                                       – Speed of light
                                                                                       – Diminishing returns from ILP
• Work hard to                                                                      • Multiprocessor
  reduce the                                                                           –   Focus in the textbook: 4-32 CPUs
  serial part of                                                                       –   Increased performance through parallelism
                                                                                       –   Multichip
  the application
                                                                                       –   Multicore ((Single) Chip Multiprocessors – CMP)
    – remember IO
                                                                                       –   Cost effective
    – think different
      (than traditionally                   = serial fraction                      • Right balance of ILP and TLP is unclear today
      or sequentially)                                                                 – Desktop vs. server?

               15                                                Lasse Natvig                    16                                          Lasse Natvig




 Other Factors                                                                     Multiprocessor – Taxonomy
 Multiprocessors                                                                • Flynn’s taxonomy (1966, 1972)
 • Growth in data-intensive applications                                              – Taxonomy = classification
     – Databases, file servers, multimedia, …                                         – Widely used, but perhaps a bit coarse
 • Growing interest in servers, server performance                              • Single Instruction Single Data (SISD)
 • Increasing desktop performance less important                                      – Common uniprocessor
     – Outside of graphics
                                                                                • Si l I t ti M lti l Data (SIMD)
                                                                                  Single Instruction Multiple D t
 • Improved understanding in how to use                                               – “ = Data Level Parallelism (DLP)”
   multiprocessors effectively
     – Especially in servers where significant natural TLP
                                                                                • Multiple Instruction Single Data (MISD)
                                                                                      – Not implemented?
 • Advantage of leveraging design investment by
   replication                                                                        – Pipeline / Stream processing / GPU ?
     – Rather than unique design                                                • Multiple Instruction Multiple Data (MIMD)
 • Power/cooling issues  multicore                                                   – Used today
                                                                                      – “ = Thread Level Parallelism (TLP)”
               17                                                Lasse Natvig                    18                                          Lasse Natvig




                                                                                                                                                            3
Flynn’s taxonomy (1/2)                                                      Flynn’s taxonomy (2/2), MISD
   Single/Multiple Instruction/Data                                                 Single/Multiple Instruction/Data
   Stream                                                                           Stream




      SISD uniprocessor




   SIMD w/distributed memory                                                              MISD (software pipeline)
                                       MIMD w/shared memory


             19                                               Lasse Natvig                     20                                                             Lasse Natvig




 Advantages to MIMD                                                          MIMD: Memory architecture
• Flexibility
   – High single-user performance, multiple programs, multiple threads
   – High multiple-user performance
   – Combination
                                                                             P1                                  Pn
• Built using commercial off-the-shelf (COTS)                                                                                 P1                                             Pn
                                                                             $                                   $
  components t
                                                                                  Interconnection network (IN)        Mem     $                               Mem
                                                                                                                                                                             $
   – 2 x Uniprocessor = Multi-CPU
   – 2 x Uniprocessor core on a single chip = Multicore                       Mem                        Mem                       Interconnection network (IN)


                                                                             Centralized Memory                             Distributed Memory



             21                                               Lasse Natvig                     22                                                             Lasse Natvig




                                                                             Distributed (Shared) Memory
                                                                             Multiprocessor
 Centralized Memory Multiprocessor                                            • Pro: Cost-effective way to scale memory
                                                                                bandwidth
 • Also called                                                                     • If most accesses are to
   • Symmetric Multiprocessors (SMPs)                                                local memory
   • Uniform Memory Access (UMA) architecture                                 • Pro: Reduces latency
 • Shared memory becomes bottleneck
                  y                                                             of local memory accesses
                                                                              • Con: Communication
 • Large caches  single memory can satisfy
                                                                                becomes more complex
   memory demands of small number of processors
 • Can scale to a few dozen processors by using a                             • Pro/Con: Possible to change software to
   switch and by using many memory banks                                        take advantage of memory that is close,
                                                                                but this can also make SW less portable
 • Scaling beyond that is hard
                                                                                    – Non-Uniform Memory Access (NUMA)


             23                                               Lasse Natvig                     24                                                             Lasse Natvig




                                                                                                                                                                                  4
MP (MIMD), cluster of SMPs                                                                                           Distributed memory

                                                                                                                                                                          P   P                  P
  Proc.             Proc.               Proc.                        Proc.        Proc.              Proc.
                                                                                                                      1. Shared address space
                                                                                                                                                                                  Network
  Caches        Caches                 Caches                        Caches      Caches             Caches
                                                                                                                        •     Logically shared, physically distributed
                                                                                                                                                                                     M
                                                                                                                        •     Distributed Shared Memory (DSM)
          Node Interc. Network
               Interc                                                        Node Interc. Network
                                                                                  Interc                                                                                      Conceptual Model
                                                                                                                        •     NUMA architecture
                                                                                                                                         hit t
                                                                                                                                                                              P      P               P
    Memory                              I/O                                Memory                       I/O           2. Separate address spaces
                                                                                                                                                                              M      M               M
                                                                                                                        •     Every P-M module is a separate
                                                                                                                              computer                                              Network
                                       Cluster Interconnection Network                                                                                                            Implementation
                                                                                                                        •     Multicomputer
                                                                                                                        •     Clusters
• Combination of centralized and distributed                                                                            •     Not a focus in this course

• Like an early version of the kongull-cluster
                       25                                                                          Lasse Natvig                      26                                                     Lasse Natvig




 Communication models                                                                                                 Limits to parallelism
                                                                                                                      • We need separate processes and threads!
 • Shared memory                                                                                                            – Can’t split one thread among CPUs/cores
     – Centralized or Distributed Shared Memory                                                                       • Parallel algorithms needed
                                                                                                                            – Separate field
     – Communication using LOAD/STORE
                                                                                                                            – Some problems are inherently serial
     – Coordinated using traditional OS methods                                                                                • P-complete problems
               • Semaphores, monitors, etc.                                                                                        – Part of parallel complexity theory
     – Busy-wait more acceptable than for uniprosessor
       Busy wait                                                                                                               • See minicourse TDT6 - Heterogeneous and green
                                                                                                                                 computing
 • Message passing                                                                                                             • http://www.idi.ntnu.no/emner/tdt4260/tdt6
     – Using send (put) and receive (get)
               • Asynchronous / Synchronous                                                                           • Amdahl’s law
     – Libraries, standards                                                                                                 – Serial fraction of code limits speedup
               • …, PVM, MPI, …                                                                                             – Example: speedup = 80 with 100 processors require
                                                                                                                              maximum 0,25% of the time spent on serial code


                       27                                                                          Lasse Natvig                      28                                                     Lasse Natvig




SMP: Cache Coherence Problem                                                                                           Enforcing coherence
               P1                                      P2                                          P3
                                                                                                                      • Separate caches makes multiple copies frequent
                            u =?                                u =?                                          u =7          – Migration
              cache         4                           cache    5                                   cache        3            • Moved from shared memory to local cache
           u :5                                                                                   u :5
                                                                                                                               • Speeds up access, reduces memory bandwidth requirements
                                                                                                                            – Replication
                                                                                                                               • Several local copies when item is read by several
                                                                                                                                                 p                       y
           1                                                                        I/O devices
                                                                                     /O
                                u :5                                   2                                                       • Speeds up access, reduces memory contention
                                   Memory                                                                             • Need coherence protocols to track shared data
  • Processors see different values for u after event 3                                                                     – Directory based
  • Old (stale) value read in event 4 (hit)                                                                                     • Status in shared location (Chap. 4.4)
  • Event 5 (miss) reads                                                                                                    – (Bus) snooping
          – correct value (if write-through caches)                                                                             • Each cache maintains local status
          – old value (if write-back caches)                                                                                    • All caches monitor broadcast medium
  • Unacceptable to programs, and frequent!                                                                                     • Write invalidate / Write update
                       29                                                                          Lasse Natvig                      30                                                     Lasse Natvig




                                                                                                                                                                                                           5
Snooping: Write invalidate
                                                                                       Snooping: Write update
 • Several reads or one write: No change
 • Writes require exclusive access                                                     • Also called write broadcast
 • Writes to shared data: All other cache copies                                       • Must know which cache blocks are shared
   invalidated
   i   lid t d                                                                         • Usually Write-Through
     – Invalidate command and address broadcasted                                        – Write to shared data: Broadcast, all caches listen and updates their
     – All caches listen (snoops) and invalidates if necessary                             copy (if any)
                                                                                         – Read miss: Main memory is up to date
 • Read miss:
     – Write-Through: Memory always up to date
     – Write-Back: Caches listen and any exclusive copy is
       put on the bus
                31                                                     Lasse Natvig                32                                                 Lasse Natvig




 Snooping: Invalidate vs. Update
• Repeated writes to the same address (no reads) requires
  several updates, but only one invalidate
                                                                                       An Example Snoopy Protocol
• Invalidates are done at cache block level, while updates are                        • Invalidation protocol, write-back cache
  done of individual words
                                                                                      • Each cache block is in one state
• Delay from a word is written until it can be read is shorter for
                                                                                         – Shared : Clean in all caches and up-to-date in
  updates
                                                                                           memory, block can be read
• Invalidate most common
                                                                                         – Exclusive : One cache has only copy, its
   – Less bus traffic
                                                                                           writeable, and dirty
   – Less memory traffic
                                                                                         – Invalid : block contains no data
   – Bus and memory bandwidth typical bottleneck




                33                                                     Lasse Natvig                34                                                 Lasse Natvig




                                                                                      Snooping: Invalidation protocol (2/6)
 Snooping: Invalidation protocol (1/6)
    Processor            Processor           Processor                Processor
        0                    1                   2                       N-1               Processor          Processor      Processor               Processor
                                                                                               0                  1              2                      N-1
                               read   x


                                                                                                              x o
                                                                                                                shared
                                     read miss

                              Interconnection Network
                                                                                                                 Interconnection Network



                 x   o
                                                         I/O System                                     x o                           I/O System
                Main Memory
                                                                                                   Main Memory
                35                                                     Lasse Natvig                36                                                 Lasse Natvig




                                                                                                                                                                     6
Snooping: Invalidation protocol (3/6)                                               Snooping: Invalidation protocol (4/6)

  Processor         Processor          Processor                  Processor           Processor         Processor      Processor               Processor
      0                 1                  2                         N-1                  0                 1              2                      N-1
                                                read   x

                    x o                                                                                 x o             x o
                      shared                                                                              shared         shared

                               read miss
                       Interconnection Network                                                             Interconnection Network




              x o                                    I/O System
                                                                                                  x o                             I/O System

          Main Memory                                                                         Main Memory
              37                                                     Lasse Natvig                 38                                              Lasse Natvig




Snooping: Invalidation protocol (5/6)                                               Snooping: Invalidation protocol (6/6)

  Processor         Processor          Processor                  Processor           Processor         Processor      Processor               Processor
      0                 1                  2                         N-1                  0                 1              2                      N-1
                         write   x

                    x o                 x o                                                             x 1
                      shared                shared                                                      exclusive

                               invalidate
                       Interconnection Network                                                             Interconnection Network




              x o                                    I/O System
                                                                                                  x o                             I/O System

          Main Memory                                                                         Main Memory
              39                                                     Lasse Natvig                 40                                              Lasse Natvig




                                                                                                                                                                 7
Prefetching

              Marius Grannæs

              Feb 11th, 2011




www.ntnu.no                    M. Grannæs, Prefetching
2

    About Me

          • PhD from NTNU in Computer Architecture in 2010
          • “Reducing Memory Latency by Improving Resource Utilization”
          • Supervised by Lasse Natvig
          • Now working for Energy Micro
          • Working on energy profiling, caching and prefetching
          • Software development




www.ntnu.no                                                  M. Grannæs, Prefetching
3

    About Energy Micro
          • Fabless semiconductor company
          • Founded in 2007 by ex-chipcon founders
          • 50 employees
          • Offices around the world
          • Designing the world most energy friendly microcontrollers
          • Today: EFM32 Gecko
          • Next friday: EFM32 Tiny Gecko (cache)
          • May(ish): EFM32 Giant Gecko (cache + prefetching)
          • Ambition: 1% marketshare...




www.ntnu.no                                                   M. Grannæs, Prefetching
3

    About Energy Micro
          • Fabless semiconductor company
          • Founded in 2007 by ex-chipcon founders
          • 50 employees
          • Offices around the world
          • Designing the world most energy friendly microcontrollers
          • Today: EFM32 Gecko
          • Next friday: EFM32 Tiny Gecko (cache)
          • May(ish): EFM32 Giant Gecko (cache + prefetching)
          • Ambition: 1% marketshare...
          • of a $30 bn market.



www.ntnu.no                                                   M. Grannæs, Prefetching
4

    What is Prefetching?



       Prefetching
       Prefetching is a technique for predicting future prefetches and
       fetching the data into the cache




www.ntnu.no                                                    M. Grannæs, Prefetching
5

    The Memory Wall
                            100000
                                                              CPU performance
                                                            Memory performance

                            10000
              Performance




                             1000



                              100



                               10



                                1
                                1980   1985   1990   1995    2000        2005        2010
                                                     Year


       W.Wulf and S. McKee, "Hitting the Memory Wall: Implications of
       the Obvious"
www.ntnu.no                                                                 M. Grannæs, Prefetching
6

    A Useful Analogy
          • An Intel Core i7 can execute 147600 Million Instructions per
            second.
          • ⇒ A carpenter can hammer one nail per second.




www.ntnu.no                                                    M. Grannæs, Prefetching
6

    A Useful Analogy
          • An Intel Core i7 can execute 147600 Million Instructions per
            second.
          • ⇒ A carpenter can hammer one nail per second.




          • DDR3-1600 RAM can perform 65 Million transfers per second.




www.ntnu.no                                                    M. Grannæs, Prefetching
6

    A Useful Analogy
          • An Intel Core i7 can execute 147600 Million Instructions per
            second.
          • ⇒ A carpenter can hammer one nail per second.




          • DDR3-1600 RAM can perform 65 Million transfers per second.
          • ⇒ The carpenter must wait 38 minutes per nail.



www.ntnu.no                                                    M. Grannæs, Prefetching
7

    Solution




www.ntnu.no    M. Grannæs, Prefetching
7

    Solution




       Solution outline:
        1 You bring an entire box of nails.
        2 Keep the box close to the carpenter
www.ntnu.no                                     M. Grannæs, Prefetching
8

    Analysis: Carpenting
       How long (on average) does it take to get one nail?




www.ntnu.no                                                  M. Grannæs, Prefetching
8

    Analysis: Carpenting
       How long (on average) does it take to get one nail?

       Nail latency

                         LNail = LBox + pBox   is empty   · (LShop + LTraffic )


                LNail Time to get one nail.
                LBox Time to check and fetch one nail from the box.
       pBox   is empty   Probabilty that the box you have is empty.
               LShop Time to go to the shop (38 minutes).
               LTraffic Time lost due to traffic.

www.ntnu.no                                                                      M. Grannæs, Prefetching
9

    Solution: (For computers)




          • Faster, but smaller memory closer to the processor.
          • Temporal locality
              • If you needed X in the past, you are probably going to need X
                in the near future.
          • Spatial locality
              • If you need X , you probably need X + 1




www.ntnu.no                                                       M. Grannæs, Prefetching
9

    Solution: (For computers)




          • Faster, but smaller memory closer to the processor.
          • Temporal locality
              • If you needed X in the past, you are probably going to need X
                in the near future.
          • Spatial locality
              • If you need X , you probably need X + 1
       ⇒ If you need X, put it in the cache, along with everything else
       close to it (cache line)


www.ntnu.no                                                       M. Grannæs, Prefetching
10

     Analysis: Caches
       Nail latency

                 LSystem = LCache + pMiss · (LMain Memory + LCongestion )


              LSystem Total system latency.
              LCache Latency of the cache.
                pMiss Probabilty of a cache miss.
       LMain Memory Main memory latency.
          LCongestion Latency due to main memory congestion.




www.ntnu.no                                                         M. Grannæs, Prefetching
11

     DRAM in perspective
          • “Incredibly slow” DRAM has a response time of 15.37 ns.
          • Speed of light is 3 · 108 m/s.
          • Physical distance from processor to DRAM chips is typically
              20cm.




www.ntnu.no                                                   M. Grannæs, Prefetching
11

     DRAM in perspective
          • “Incredibly slow” DRAM has a response time of 15.37 ns.
          • Speed of light is 3 · 108 m/s.
          • Physical distance from processor to DRAM chips is typically
              20cm.
                              2 · 20 · 10− 3m
                                              = 0.13ns                       (1)
                                3 · 108 m/s

          • Just 2 orders of magnitude!




www.ntnu.no                                                   M. Grannæs, Prefetching
11

     DRAM in perspective
          • “Incredibly slow” DRAM has a response time of 15.37 ns.
          • Speed of light is 3 · 108 m/s.
          • Physical distance from processor to DRAM chips is typically
              20cm.
                              2 · 20 · 10− 3m
                                              = 0.13ns                          (1)
                                3 · 108 m/s

          • Just 2 orders of magnitude!
          • Intel Core i7 - 147600 Million Instructions per second.
          • Ultimate laptop - 5 · 1050 operations per second/kg.
            Lloyd, Seth, “Ultimate physical limits to computation”



www.ntnu.no                                                      M. Grannæs, Prefetching
12

     When does caching not work?
       The four Cs:
         • Cold/Compulsory:
              • The data has not been referenced before
          • Capacity
              • The data has been referenced before, but has been thrown out,
                because of the limited size of the cache.
          • Conflict
              • The data has been thrown out of a set-assosciative cache
                because it would not fit in the set.
          • Coherence
              • Another processor (in a muti-processor/core environment) has
                invalidated the cacheline.




www.ntnu.no                                                     M. Grannæs, Prefetching
12

     When does caching not work?
       The four Cs:
         • Cold/Compulsory:
              • The data has not been referenced before
          • Capacity
              • The data has been referenced before, but has been thrown out,
                because of the limited size of the cache.
          • Conflict
              • The data has been thrown out of a set-assosciative cache
                because it would not fit in the set.
          • Coherence
              • Another processor (in a muti-processor/core environment) has
                invalidated the cacheline.
       We can buy our way out of Capacity and Conflict misses, but not
       Cold or Coherence misses!


www.ntnu.no                                                     M. Grannæs, Prefetching
13

     Cache Sizes
                                10000




                                                                                                          E
                                                                                                 Co ium 4



                                                                                                                 i7
                                                                                          um I




                                                                                                                 re
                                                                               II

                                                                                            II
                                                                                            4




                                                                                                      2
                                                                                                              Co
                                                                                                     nt
                                                                     o
                                                                          um

                                                                                     um
                                                                  Pr




                                                                                                   re
                                                                                                  Pe
                                                                         nti

                                                                                    nti
                                                                                          nti
                                                             um
                                1000




                                                                    Pe

                                                                                Pe
                                                                                     Pe
                                                            nti
              Cache size (kB)




                                                           Pe
                                 100




                                                      m
                                               X




                                                     tiu
                                           6D




                                                      n
                                                   Pe
                                          48
                                          80




                                  10



                                   1
                                   1985    1990           1995                      2000           2005               2010
                                                                     Year




www.ntnu.no                                                                                                   M. Grannæs, Prefetching
14

     Core i7 (Lynnfield) - 2009




www.ntnu.no                      M. Grannæs, Prefetching
15

     Pentium M - 2003




www.ntnu.no             M. Grannæs, Prefetching
16

     Prefetching

       Prefetching increases the performance of caches by predicting
       what data is needed and fetching that data into the cache before it
       is referenced. Need to know:
          • What to prefetch?
          • When to prefetch?
          • Where to put the data?
          • How do we prefetch? (Mechanism)




www.ntnu.no                                                    M. Grannæs, Prefetching
17

     Prefetching Terminology

       Good Prefetch
       A prefetch is classified as Good if the prefetched block is
       referenced by the application before it is replaced.




www.ntnu.no                                                    M. Grannæs, Prefetching
17

     Prefetching Terminology

       Good Prefetch
       A prefetch is classified as Good if the prefetched block is
       referenced by the application before it is replaced.

       Bad Prefetch
       A prefetch is classified as Bad if the prefetched block is not
       referenced by the application before it is replaced.




www.ntnu.no                                                     M. Grannæs, Prefetching
18

     Accuracy


       The accuracy of a given prefetch algorithm that yields G good
       prefetches and B bad prefetches is calculated as:

       Accuracy
                     G
       Accuracy =   G+B




www.ntnu.no                                                  M. Grannæs, Prefetching
19

     Coverage


       If a conventional cache has M misses without using any prefetch
       algorithm, the coverage of a given prefetch algorithm that yields G
       good prefetches and B bad prefetches is calculated as:

       Accuracy
                    G
       Coverage =   M




www.ntnu.no                                                    M. Grannæs, Prefetching
20

     Prefetching
       System Latency

              Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion )




www.ntnu.no                                                       M. Grannæs, Prefetching
20

     Prefetching
       System Latency

                Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion )


          • If a prefetch is good:
               • pmiss is lowered
               • ⇒ Lsystem decreases




www.ntnu.no                                                         M. Grannæs, Prefetching
20

     Prefetching
       System Latency

                Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion )


          • If a prefetch is good:
               • pmiss is lowered
               • ⇒ Lsystem decreases
          • If a prefetch is bad:
               • pmiss becomes higher because useful data might be replaced
               • Lcongestion becomes higher because of useless traffic
               • ⇒ Lsystem increases




www.ntnu.no                                                         M. Grannæs, Prefetching
21

     Prefetching Techniques
       Types of prefetching:
         • Software
              •   Special instructions.
              •   Most modern high performance processors have them.
              •   Very flexible.
              •   Can be good at pointer chasing.
              •   Requires compiler or programmer effort.
              •   Processor executes prefetches instead of computation.
              •   Static (performed at compile-time).
          • Hardware
          • Hybrid




www.ntnu.no                                                       M. Grannæs, Prefetching
21

     Prefetching Techniques
       Types of prefetching:
          • Software
          • Hardware
              • Dedicated hardware analyzes memory references.
              • Most modern high performance processors have them.
              • Fixed functionality.
              • Requires no effort by the programmer or compiler.
              • Off-loads prefetching to hardware.
              • Dynamic (performed at run-time)
          • Hybrid




www.ntnu.no                                                   M. Grannæs, Prefetching
21

     Prefetching Techniques

       Types of prefetching:
          • Software
          • Hardware
          • Hybrid
              • Dedicated hardware unit.
              • Hardware unit programmed by software.
              • Some effort required by the programmer or compiler.




www.ntnu.no                                                      M. Grannæs, Prefetching
22

     Software Prefetching
              f o r ( i =0; i < 10000; i ++) {
                  acc += data [ i ] ;
              }




www.ntnu.no                                      M. Grannæs, Prefetching
22

     Software Prefetching
              f o r ( i =0; i < 10000; i ++) {
                  acc += data [ i ] ;
              }

                 MOV    r1, 0           ;   Acc
                 MOV    rO, #0          ;   i
          Label: LOAD   r2, r0(#data)   ;   Cache miss! (400 cycles!)
                 ADD    r1, r2          ;   acc += date[i]
                 INC    r0              ;   i++
                 CMP    r0, #100000     ;   i < 100000
                 BL     Label           ;   branch if less




www.ntnu.no                                               M. Grannæs, Prefetching
23

     Software Prefetching II
              f o r ( i =0; i < 10000; i ++) {
                  acc += data [ i ] ;
              }

       Simple optimization using __builtin_prefetch()
              f o r ( i =0; i < 10000; i ++) {
                  _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;
                  acc += data [ i ] ;
              }




www.ntnu.no                                                                     M. Grannæs, Prefetching
23

     Software Prefetching II
              f o r ( i =0; i < 10000; i ++) {
                  acc += data [ i ] ;
              }

       Simple optimization using __builtin_prefetch()
              f o r ( i =0; i < 10000; i ++) {
                  _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;
                  acc += data [ i ] ;
              }

       Why add 10 (and not 1?)
       Prefetch Distance - Memory latency >> Computation latency.


www.ntnu.no                                                                     M. Grannæs, Prefetching
24

     Software Prefetching III
              f o r ( i =0; i < 10000; i ++) {
                  _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;
                  acc += data [ i ] ;
              }

       Note:
         • data[0] → data[9] will not be prefetched.
         • data[10000] → data[10009] will be prefetched, but not used.




www.ntnu.no                                                                     M. Grannæs, Prefetching
24

     Software Prefetching III
              f o r ( i =0; i < 10000; i ++) {
                  _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;
                  acc += data [ i ] ;
              }

       Note:
         • data[0] → data[9] will not be prefetched.
         • data[10000] → data[10009] will be prefetched, but not used.
                               G         9990
                Accuracy =           =          = 0.999 = 99, 9%
                             G+B        10000
                                G      9990
                   Coverage =      =         = 0.999 = 99, 9%
                               M      10000



www.ntnu.no                                                                     M. Grannæs, Prefetching
25

     Complex Software
              f o r ( i =0; i < 10000; i ++) {
                  _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;
                  i f ( someFunction ( i ) == True )
                  {
                     acc += data [ i ] ;
                  }
              }

       Does prefetching pay off in this case?




www.ntnu.no                                                                     M. Grannæs, Prefetching
25

     Complex Software
              f o r ( i =0; i < 10000; i ++) {
                  _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ;
                  i f ( someFunction ( i ) == True )
                  {
                     acc += data [ i ] ;
                  }
              }

       Does prefetching pay off in this case?
         • How many times is someFunction(i) true?
         • How much memory bus access is perfomed in
           someFunction(i)?
         • Does power matter?
       We have to profile the program to know!

www.ntnu.no                                                                     M. Grannæs, Prefetching
26

     Dynamic Data Structures I

          typedef s t r u c t {
            i n t data ;
            node_t n e x t ;
          } node_t ;

              w h i l e ( ( node = node−>n e x t ) ! = NULL ) {
                 acc += node−>data ;
              }




www.ntnu.no                                                       M. Grannæs, Prefetching
27

     Dynamic Data Structures II

          typedef s t r u c t {
            i n t data ;
            node_t n e x t ;
            node_t jump ;
          } node_t ;

              w h i l e ( ( node = node−>n e x t ) ! = NULL ) {
                 _ _ b u l t i n _ p r e f e t c h ( node−>jump ) ;
                 acc += node−>data ;
              }




www.ntnu.no                                                           M. Grannæs, Prefetching
28

     Hardware Prefetching
       Software prefetching:
          • Need programmer effort to implement
          • Prefetch instructions is not computing
          • Compile-time
          • Very flexible




www.ntnu.no                                          M. Grannæs, Prefetching
28

     Hardware Prefetching
       Software prefetching:
          • Need programmer effort to implement
          • Prefetch instructions is not computing
          • Compile-time
          • Very flexible
       Hardware prefetching:
          • No programmer effort
          • Does not displace compute instructions
          • Run-time
          • Not flexible



www.ntnu.no                                          M. Grannæs, Prefetching
29

     Sequential Prefetching

       The simplest prefetcher, but suprisingly effective due to spatial
       locality.

       Sequential Prefetching
       Miss on address X ⇒ Fetch X+n, X+n+1 ... , X+n+j

       n Prefetch distance
        j Prefetch degree
       Collectively known as prefetch agressiveness.




www.ntnu.no                                                     M. Grannæs, Prefetching
30

     Sequential Prefetching II
                    5
                                                                    Sequential
                   4.5

                    4

                   3.5
         Speedup




                    3

                   2.5

                    2

                   1.5

                    1
                         libquantum



                                      milc



                                              leslie3d



                                                         GemsFDTD



                                                                          lbm



                                                                                 sphinx3
www.ntnu.no                                                                                M. Grannæs, Prefetching
                                             Benchmark
31

     Reference Prediction Tables
       Tien-Fu Chen and Jean-Loup Baer (1995)
          • Builds upon sequential prefetching, stride directed prefetching.
          • Observation: Non-unit strides in many applications
              • 2, 4, 6, 8, 10 (stride 2)
          • Observation: Each load instruction has a distinct access
              pattern
       Reference Prediction Tables (RPT):
          • Table index by the load instruction
          • Simple state machine
          • Store a single delta of history.




www.ntnu.no                                                     M. Grannæs, Prefetching
32

     Reference Prediction Tables
              Cache Miss:




                  PC        Last Addr.   Delta   State




                Initial           Training       Prefetch




www.ntnu.no                                         M. Grannæs, Prefetching
33

     Reference Prediction Tables
        Cache Miss:
                    1



                PC      Last Addr.   Delta   State




              Initial         Training       Prefetch




www.ntnu.no                                     M. Grannæs, Prefetching
34

     Reference Prediction Tables
        Cache Miss:
                    1

              100          1              --    Init
                PC      Last Addr.   Delta     State




              Initial          Training        Prefetch




www.ntnu.no                                       M. Grannæs, Prefetching
35

     Reference Prediction Tables
        Cache Miss:
                    1 3

              100          3     1
                                          2   --
                                                   Train
                PC      Last Addr.   Delta         State




              Initial          Training            Prefetch




www.ntnu.no                                           M. Grannæs, Prefetching
36

     Reference Prediction Tables
        Cache Miss:
                    1 3 5

              100          5     3
                                          2   2
                                                  Prefetch
                PC      Last Addr.   Delta         State




              Initial          Training           Prefetch




www.ntnu.no                                          M. Grannæs, Prefetching
37

     Reference Prediction Tables
                    5
                                                                    Sequential
                   4.5                                                   RPT

                    4

                   3.5
         Speedup




                    3

                   2.5

                    2

                   1.5

                    1
                         libquantum



                                      milc



                                              leslie3d



                                                         GemsFDTD



                                                                          lbm



                                                                                 sphinx3
www.ntnu.no                                  Benchmark                                     M. Grannæs, Prefetching
38

     Global History Buffer

       K. Nesbit, A. Dhodapkar and J.Smith (2004)
          • Observation: Predicting more complex patterns require more
              history
          • Observation: A lot of history in the RPT is very old
       Program Counter/Delta Correlation (PC/DC)
          • Store all misses in a FIFO called Global History Buffer (GHB)
          • Linked list of all misses from one load instruction
          • Traversing linked list gives a history for that load




www.ntnu.no                                                        M. Grannæs, Prefetching
39

     Global History Buffer
              Index Table      Global History Buffer

                 PC      Ptr     Address      Ptr




                 100




                                     3


              Delta Buffer




                                     1
www.ntnu.no                                         M. Grannæs, Prefetching
40

     Global History Buffer
              Index Table      Global History Buffer

                 PC      Ptr     Address      Ptr




                 100                 5




                                     3


              Delta Buffer




                                     1
www.ntnu.no                                         M. Grannæs, Prefetching
41

     Global History Buffer
              Index Table      Global History Buffer

                 PC      Ptr     Address      Ptr




                 100                 5




                                     3


              Delta Buffer




                                     1
www.ntnu.no                                         M. Grannæs, Prefetching
42

     Global History Buffer
              Index Table      Global History Buffer

                 PC      Ptr     Address      Ptr




                 100                 5




                                     3


              Delta Buffer




                                     1
www.ntnu.no                                         M. Grannæs, Prefetching
43

     Global History Buffer
              Index Table      Global History Buffer

                 PC      Ptr     Address      Ptr




                 100                 5




                                     3


              Delta Buffer




                                     1
www.ntnu.no                                         M. Grannæs, Prefetching
44

     Global History Buffer
              Index Table      Global History Buffer

                 PC      Ptr     Address      Ptr




                 100                 5




                                     3


              Delta Buffer




                                     1
www.ntnu.no                                         M. Grannæs, Prefetching
45

     Global History Buffer
              Index Table        Global History Buffer

                 PC        Ptr     Address      Ptr




                 100                   5




                                       3


              Delta Buffer

                       2




                                       1
www.ntnu.no                                           M. Grannæs, Prefetching
46

     Global History Buffer
              Index Table        Global History Buffer

                 PC        Ptr     Address      Ptr




                 100                   5




                                       3


              Delta Buffer

                       2
                       2


                                       1
www.ntnu.no                                           M. Grannæs, Prefetching
47

     Delta Correlation


          • In the previous example, the delta buffer only contained two
              values (2,2).
          • Thus it is easy to guess that the next delta is also 2.
          • We can then prefetch: Current address + Delta = 5 + 2 = 7




www.ntnu.no                                                      M. Grannæs, Prefetching
47

     Delta Correlation


          • In the previous example, the delta buffer only contained two
              values (2,2).
          • Thus it is easy to guess that the next delta is also 2.
          • We can then prefetch: Current address + Delta = 5 + 2 = 7

       What if the pattern is repeating, but not regular?
       1, 2, 3, 4, 5, 1, 2, 3, 4, 5




www.ntnu.no                                                      M. Grannæs, Prefetching
48

     Delta Correlation


              10       11       13       16       17       19       22       24        25
                   1        2        3        1        2        3        1        2




www.ntnu.no                                                                   M. Grannæs, Prefetching
49

     Delta Correlation


              10       11       13       16       17       19       22       24        25
                   1        2        3        1        2        3        1        2




www.ntnu.no                                                                   M. Grannæs, Prefetching
50

     Delta Correlation


              10       11       13       17       18       20       23       24        25
                   1        2        3        1        2        3        1        2




www.ntnu.no                                                                   M. Grannæs, Prefetching
51

     Delta Correlation


              10       11       13       17       18       20       23       24        25
                   1        2        3        1        2        3        1        2




www.ntnu.no                                                                   M. Grannæs, Prefetching
52

     Delta Correlation


              10       11       13       17       18       20       23       24        25
                   1        2        3        1        2        3        1        2




www.ntnu.no                                                                   M. Grannæs, Prefetching
53

     Delta Correlation


              10       11       13       17       18       20       23       24        25
                   1        2        3        1        2        3        1        2




www.ntnu.no                                                                   M. Grannæs, Prefetching
54

     Delta Correlation


              10       11       13       16       17       19       22       24        25
                   1        2        3        1        2        3        1        2




www.ntnu.no                                                                   M. Grannæs, Prefetching
55

     Delta Correlation


              10       11       13       16       17       19       22       23        25
                   1        2        3        1        2        3        1        2




www.ntnu.no                                                                   M. Grannæs, Prefetching
56

     Delta Correlation


              10       11       13       16       17       19       22       23        25
                   1        2        3        1        2        3        1        2




www.ntnu.no                                                                   M. Grannæs, Prefetching
57

     PC/DC
                    5
                                                                    Sequential
                   4.5                                                   RPT
                                                                       PC/DC
                    4

                   3.5
         Speedup




                    3

                   2.5

                    2

                   1.5

                    1
                         libquantum



                                      milc



                                              leslie3d



                                                         GemsFDTD



                                                                          lbm



                                                                                 sphinx3
www.ntnu.no                                  Benchmark                                     M. Grannæs, Prefetching
58

     Data Prefetching Championships

          • Organized by JILP
          • Held in conjunction with HPCA’09
          • Branch prediction championships
          • Everyone uses the same API (six function calls)
          • Same set of benchmarks
          • Third party evaluates performance
          • 20+ prefetchers submitted

       http://www.jilp.org/dpc/




www.ntnu.no                                                   M. Grannæs, Prefetching
59

     Delta Correlating Prediction Tables
          • Our submission to DPC-1
          • Observation: GHB pointer chasing is expensive.
          • Observation: History doesn’t really get old.
          • Observation: History would reach a steady state.
          • Observation: Deltas are typically small, while the address
            space is large.
          • Table indexed by the PC of the load
          • Each entry holds the history of the load in the form of deltas.
          • Delta Correlation




www.ntnu.no                                                     M. Grannæs, Prefetching
60

     Delta Correlating Prefetch Tables




              PC   Last Addr.   Last Pref.   D   D   D   D   D     D     Ptr




www.ntnu.no                                                      M. Grannæs, Prefetching
61

     Delta Correlating Prefetch Tables


              10




                   100      10            -        -   -   -   -   -     -     -

                   PC    Last Addr.   Last Pref.   D   D   D   D   D    D     Ptr




www.ntnu.no                                                            M. Grannæs, Prefetching
62

     Delta Correlating Prefetch Tables


              10     11




                   100       10            -        -   -   -   -   -     -     -

                   PC     Last Addr.   Last Pref.   D   D   D   D   D    D     Ptr




www.ntnu.no                                                             M. Grannæs, Prefetching
63

     Delta Correlating Prefetch Tables


              10     11




                   100       10            -        1   -   -   -   -     -     -

                   PC     Last Addr.   Last Pref.   D   D   D   D   D    D     Ptr




www.ntnu.no                                                             M. Grannæs, Prefetching
64

     Delta Correlating Prefetch Tables


              10     11




                   100       10            -        1   -   -   -   -     -     -

                   PC     Last Addr.   Last Pref.   D   D   D   D   D    D     Ptr




www.ntnu.no                                                             M. Grannæs, Prefetching
65

     Delta Correlating Prefetch Tables


              10     11




                   100       11            -        1   -   -   -   -     -     -

                   PC     Last Addr.   Last Pref.   D   D   D   D   D    D     Ptr




www.ntnu.no                                                             M. Grannæs, Prefetching
66

     Delta Correlating Prefetch Tables


              10     11   13




                   100         13          -        1   2   -   -   -     -     -

                   PC     Last Addr.   Last Pref.   D   D   D   D   D    D     Ptr




www.ntnu.no                                                             M. Grannæs, Prefetching
67

     Delta Correlating Prefetch Tables


              10     11   13    16




                   100         16          -        1   2   3   -   -     -     -

                   PC     Last Addr.   Last Pref.   D   D   D   D   D    D     Ptr




www.ntnu.no                                                             M. Grannæs, Prefetching
68

     Delta Correlating Prefetch Tables


              10     11   13    16     17   19       22




                   100         22           -             1   2   3   1   2    3      -

                   PC     Last Addr.    Last Pref.    D       D   D   D   D    D     Ptr




www.ntnu.no                                                                   M. Grannæs, Prefetching
69

     Delta Correlating Prefetch Tables
                         5
                                                                         Sequential
                        4.5                                                   RPT
                                                                            PC/DC
                                                                             DCPT
                         4

                        3.5
              Speedup




                         3

                        2.5

                         2

                        1.5

                         1
                              libquantum



                                           milc



                                                   leslie3d



                                                              GemsFDTD



                                                                               lbm



                                                                                      sphinx3
                                                  Benchmark




www.ntnu.no                                                                                     M. Grannæs, Prefetching
70

     DPC-1 Results
          1   Access Map Pattern Matching
          2   Global History Buffer - Local Delta Buffer
          3   Prefetching based on a Differential Finite Context Machine
          4   Delta Correlating Prediction Tables




www.ntnu.no                                                     M. Grannæs, Prefetching
70

     DPC-1 Results
          1   Access Map Pattern Matching
          2   Global History Buffer - Local Delta Buffer
          3   Prefetching based on a Differential Finite Context Machine
          4   Delta Correlating Prediction Tables



       What did the winning entries do differently?
          • AMPM - Massive reordering to expose more patterns.
          • GHB-LDB and PDFCM - Prefetch into the L1.




www.ntnu.no                                                     M. Grannæs, Prefetching
71

     Access Map Pattern Matching
          •   Winning entry by Ishii et al.
          •   Divides memory into hot zones
          •   Each zone is tracked by using a 2 bit vector
          •   Examines each zone for constant strides
          •   Ignores temporal information




       Lesson
       Because of reordering, modern processors/compilers can reorder
       loads, thus the temporal information might be off.

www.ntnu.no                                                  M. Grannæs, Prefetching
72

     Global History Buffer - Local Delta Buffer

          • Second place by Dimitrov et al.
          • Somewhat similar to DCPT
          • Improves PC/DC prefetching by including global correlation
          • Most common stride
          • Prefetches directly into the L1

       Lesson
       Prefetch into L1 gives that extra performance boost
       Most common stride




www.ntnu.no                                                  M. Grannæs, Prefetching
73
     Prefetching based on a Differential Finite
     Context Machine
          • Third place by Ramos et al.
          • Table with the most recent history for each load.
          • A hash of the history is computed and used to look up into a
              table containing the predicted stride
          • Repeat process to increase prefetching degree/distance
          • Separate prefetcher for L1

       Lesson
       Feedback to adjust prefetching degree/prefetching distance
       Prefetch into the L1



www.ntnu.no                                                     M. Grannæs, Prefetching
74

     Improving DCPT


       Partial Matching
       Technique for handling reordering, common strides, etc

       L1 Hoisting
       Technique for handling L1 prefetching




www.ntnu.no                                                     M. Grannæs, Prefetching
75

     Partial Matching

          • AMPM ignores all temporal information
          • Reordering the delta history is very expensive
              Reorder 5 accesses: 5! = 120 possibilities
          • Solution: Reduce spatial resolution by ignoring low bits

       Example delta stream
       8, 9, 10, 8, 10, 9
       ⇒ (Ignore lower 2 bits)




www.ntnu.no                                                    M. Grannæs, Prefetching
75

     Partial Matching

          • AMPM ignores all temporal information
          • Reordering the delta history is very expensive
              Reorder 5 accesses: 5! = 120 possibilities
          • Solution: Reduce spatial resolution by ignoring low bits

       Example delta stream
       8, 9, 10, 8, 10, 9
       ⇒ (Ignore lower 2 bits)
       8, 8, 8, 8, 8, 8




www.ntnu.no                                                    M. Grannæs, Prefetching
75

     Partial Matching

          • AMPM ignores all temporal information
          • Reordering the delta history is very expensive
              Reorder 5 accesses: 5! = 120 possibilities
          • Solution: Reduce spatial resolution by ignoring low bits

       Example delta stream
       8, 9, 10, 8, 10, 9
       ⇒ (Ignore lower 2 bits)
       8, 8, 8, 8, 8, 8 , 8




www.ntnu.no                                                    M. Grannæs, Prefetching
76

     L1 Hoisting

          • All three top entries had mechanisms for prefetching into L1
          • Problem: Pollution
          • Solution: Use the same highly accurate mechanism to
              prefetch into the L1.
          • In the steady state, only the last predicted delta will be used.
          • All other deltas has been prefetched and is either in the L2 or
              on it’s way.
          • Hoist the first delta from the L2 to the L1 to increase
              performance.




www.ntnu.no                                                      M. Grannæs, Prefetching
77

     L1 Hoisting II


       Example delta stream
       2, 3, 1, 2, 3, 1, 2, 3,




www.ntnu.no                      M. Grannæs, Prefetching
77

     L1 Hoisting II


       Example delta stream
       2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3




www.ntnu.no                                       M. Grannæs, Prefetching
77

     L1 Hoisting II


       Example delta stream
       2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3

       Steady state
       Prefetch the last delta into L2
       Hoist the first delta into L1




www.ntnu.no                                       M. Grannæs, Prefetching
78

     DCPT-P

                        7
                                                                             DCPT-P
                        6                                                     AMPM
                                                                            GHB-LDB
                                                                             PDFCM
                        5                                                       RPT
                                                                              PC/DC
              Speedup
                        4

                        3

                        2

                        1

                        0
                            mil       Ge           lib           les          lbm     sp
                                  c     ms            qu            lie                 hin
                                           FD            an            3d                  x3
                                              T            tum
                                               D




www.ntnu.no                                                                                     M. Grannæs, Prefetching
79

     Interaction with the memory controller


          • So far we’ve talked about what to prefetch (address)
          • When and how is equally important
          • Modern DRAM is complex
          • Modern DRAM controllers are even more complex
          • Bandwidth limited




www.ntnu.no                                                   M. Grannæs, Prefetching
80

     Modern DRAM

          • Can have multiple independent memory controllers
          • Can have multiple channels per controller
          • Typically multiple banks
          • Each bank contains several pages (rows) of data (typical 1k-
              8k)
          • Each page accesses is put in a single pagebuffer
          • Access time to the pagebuffer is much lower than a full access




www.ntnu.no                                                    M. Grannæs, Prefetching
81

     The 3D structure of modern DRAM




www.ntnu.no                        M. Grannæs, Prefetching
82

     The 3D structure of modern DRAM




www.ntnu.no                        M. Grannæs, Prefetching
83

     The 3D structure of modern DRAM




www.ntnu.no                        M. Grannæs, Prefetching
84

     The 3D structure of modern DRAM




www.ntnu.no                        M. Grannæs, Prefetching
85

     The 3D structure of modern DRAM




www.ntnu.no                        M. Grannæs, Prefetching
86

     Example



       Suppose a processor requires data at locations X1 and X2 that are
       located on the same page at times T1 and T2 .
       There are two separate outcomes:




www.ntnu.no                                                  M. Grannæs, Prefetching
87

     Case 1:
       The requests occur at roughly the same time:
          1   Read 1 (T1 ) enters the memory controller




www.ntnu.no                                               M. Grannæs, Prefetching
87

     Case 1:
       The requests occur at roughly the same time:
          1   Read 1 (T1 ) enters the memory controller
          2   The page is opened




www.ntnu.no                                               M. Grannæs, Prefetching
87

     Case 1:
       The requests occur at roughly the same time:
          1   Read 1 (T1 ) enters the memory controller
          2   The page is opened
          3   Read 2 (T2 ) enters the memory controller




www.ntnu.no                                               M. Grannæs, Prefetching
87

     Case 1:
       The requests occur at roughly the same time:
          1   Read 1 (T1 ) enters the memory controller
          2   The page is opened
          3   Read 2 (T2 ) enters the memory controller
          4   Data X1 is returned from DRAM




www.ntnu.no                                               M. Grannæs, Prefetching
87

     Case 1:
       The requests occur at roughly the same time:
          1   Read 1 (T1 ) enters the memory controller
          2   The page is opened
          3   Read 2 (T2 ) enters the memory controller
          4   Data X1 is returned from DRAM
          5   Data X2 is returned from DRAM




www.ntnu.no                                               M. Grannæs, Prefetching
87

     Case 1:
       The requests occur at roughly the same time:
          1   Read 1 (T1 ) enters the memory controller
          2   The page is opened
          3   Read 2 (T2 ) enters the memory controller
          4   Data X1 is returned from DRAM
          5   Data X2 is returned from DRAM
          6   The page is closed




www.ntnu.no                                               M. Grannæs, Prefetching
87

     Case 1:
       The requests occur at roughly the same time:
          1   Read 1 (T1 ) enters the memory controller
          2   The page is opened
          3   Read 2 (T2 ) enters the memory controller
          4   Data X1 is returned from DRAM
          5   Data X2 is returned from DRAM
          6   The page is closed
       Although there are two separate reads, the page is only opened
       once.




www.ntnu.no                                                 M. Grannæs, Prefetching
88

     Case 2:
       The requests are separated in time:
         1 Read 1 (T1 ) enters the memory controller




www.ntnu.no                                            M. Grannæs, Prefetching
88

     Case 2:
       The requests are separated in time:
         1 Read 1 (T1 ) enters the memory controller
         2 The page is opened




www.ntnu.no                                            M. Grannæs, Prefetching
88

     Case 2:
       The requests are separated in time:
         1 Read 1 (T1 ) enters the memory controller
         2 The page is opened
         3 Data X1 is returned from DRAM




www.ntnu.no                                            M. Grannæs, Prefetching
88

     Case 2:
       The requests are separated in time:
         1 Read 1 (T1 ) enters the memory controller
         2 The page is opened
         3 Data X1 is returned from DRAM
         4 The page is closed




www.ntnu.no                                            M. Grannæs, Prefetching
88

     Case 2:
       The requests are separated in time:
         1 Read 1 (T1 ) enters the memory controller
         2 The page is opened
         3 Data X1 is returned from DRAM
         4 The page is closed
         5 Read 2 (T2 ) enters the memory controller




www.ntnu.no                                            M. Grannæs, Prefetching
88

     Case 2:
       The requests are separated in time:
         1 Read 1 (T1 ) enters the memory controller
         2 The page is opened
         3 Data X1 is returned from DRAM
         4 The page is closed
         5 Read 2 (T2 ) enters the memory controller
         6 The page is opened again




www.ntnu.no                                            M. Grannæs, Prefetching
88

     Case 2:
       The requests are separated in time:
         1 Read 1 (T1 ) enters the memory controller
         2 The page is opened
         3 Data X1 is returned from DRAM
         4 The page is closed
         5 Read 2 (T2 ) enters the memory controller
         6 The page is opened again
         7 Data X2 is returned from DRAM




www.ntnu.no                                            M. Grannæs, Prefetching
88

     Case 2:
       The requests are separated in time:
         1 Read 1 (T1 ) enters the memory controller
         2 The page is opened
         3 Data X1 is returned from DRAM
         4 The page is closed
         5 Read 2 (T2 ) enters the memory controller
         6 The page is opened again
         7 Data X2 is returned from DRAM
         8 The page is closed
       The page is opened and closed twice. By prefetching X2 we can
       increase performance by reducing latency and increase memory
       throughput.


www.ntnu.no                                              M. Grannæs, Prefetching
89

     When does prefetching pay off?
       The break-even point:


       Prefetching Accuracy · Cost of Prefetching = Cost of Single Read


       What is the cost of prefetching?
        • Application dependant
        • Less than the cost of a single read, because:
              •   Able to utilize open pages
              •   Reduce latency
              •   Increase throughput
              •   Multiple banks
              •   Lower latency


www.ntnu.no                                                M. Grannæs, Prefetching
90

     Performance vs. Accuracy
                         100

                         90

                         80

                         70

                         60
              Accuracy




                         50

                         40

                         30

                         20                                  Sequential prefetching
                                                     Scheduled Region prefetching
                         10                     CZone/Delta Correlation prefetching
                                             Reference Predicton Tables prefetching
                                                                          Treshold
                          0
                           -40   -20   0             20            40            60
                                           IPC improvement (%)




www.ntnu.no                                                                           M. Grannæs, Prefetching
91

     Q&A



       Thank you for listening!




www.ntnu.no                       M. Grannæs, Prefetching
TDT 4260 – lecture 17/2                                                                            Updated lecture plan pr. 17/2
                                                                                                                             7/
• Contents                                                                                        Date and lecturer        Topic
                                                                                                  1: 14 Jan (LN, AI)       Introduction, Chapter 1 / Alex: PfJudge
    – Cache coherence                                      Chap 4.2                               2: 21 Jan (IB)           Pipelining, Appendix A; ILP, Chapter 2
       • Repetition                                                                               3: 3 Feb (IB)            ILP, Chapter 2; TLP, Chapter 3
                                                                                                  4: 4 Feb (LN)            Multiprocessors,
                                                                                                                           Multiprocessors Chapter 4
       • Snooping protocols                                                                       5: 11 Feb MG             Prefetching + Energy Micro guest lecture by Marius Grannæs &
                                                                                                                           pizza
• SMP performance                                            Chap 4.3                             6: 18 Feb (LN, MJ)       Multiprocessors continued // Writing a comp.arch. paper
                                                                                                                           (relevant for miniproject, by (MJ))
    – Cache performance                                                                           7: 24 Feb (IB)           Memory and cache, cache coherence (Chap. 5)
                                                                                                  8: 3 Mar (IB)            Piranha CMP + Interconnection networks
• Directory based cache coherence Chap 4.4
          y                          p
                                                                                                  9: 11 Mar (LN)           Multicore architectures (Wiley book chapter) + Hill Marty
• Synchronization                 Chap 4.5                                                                                 Amdahl multicore ... Fedorova ... assymetric multicore ...

• UltraSPARC T1 (Niagara)         Chap 4 8
                                       4.8                                                        10: 18 Mar (IB)
                                                                                                  11: 25 Mar (JA, AI)
                                                                                                                           Memory consistency (4.6) + more on memory
                                                                                                                           (1) Kongull and other NTNU and NOTUR supercomputers (2)
                                                                                                                           Green computing
                                                                                                  12: 7 Apr (IB/LN)        Wrap up lecture, remaining stuff
                                                                                                  13: 8 Apr                Slack – no lecture planned

                   1                                                           Lasse Natvig                            2                                                            Lasse Natvig




                                                                                                   IDI Open, a challenge for you?
Miniproject groups, updates?
Mi i   j t            d t ?
                                                                                                  • http://events.idi.ntnu.no/open11/
Rank        Prefetcher                 Group                      Score

1           rpt64k4_pf
                     f                 Farfetched
                                          f                       1.089
                                                                                                  • 2 april programming contest informal fun pizza
                                                                                                      april,               contest, informal, fun, pizza,
                                                                                                    coke (?), party (?), 100- 150 people, mostly
2           rpt_prefetcher_rpt_seq     L2Detour                   1.072
                                                                                                    students,
                                                                                                    students low threshold
3           teeest                     Group 6                    1.000
                                                                                                  • Teams: 3 persons, one PC, Java, C/C++ ?
                                                                                                  • P bl
                                                                                                    Problems: SSome simple, some t i k
                                                                                                                        i l           tricky
                                                                                                  • Our team ”DM-gruppas beskjedne venner” is
                                                                                                    challenging you students!
                                                                                                       – And we will challenge some of all the ICT companies in
                                                                                                         Trondheim
                   3                                                           Lasse Natvig                            4                                                            Lasse Natvig




SMP: Cache Coherence Problem                                                                        Enforcing coherence (recap)
            P1                         P2                                      P3
                                                                                                   • Separate caches speed up access
                   u =?                         u =?                                     u =7            – Migration
           cache       4                cache    5                               cache        3             • Moved from shared memory to local cache
        u :5                                                                  u :5
                                                                                                         – Replication
                                                                                                            • Several local copies when item is read by several

        1                                                       I/O devices
                                                                    d i                            • Need coherence protocols to track shared data
                           u :5                        2
                                                                                                         – (Bus) snooping
                              Memory                                                                         • Each cache maintains local status
    • Processors see different values for u after event 3                                                    • All caches monitor broadcast medium
    • Old (stale) value read in event 4 ( )
          (     )                       (hit)                                                                • Write invalidate / Write update
    • Event 5 (miss) reads
       – correct value (if write-through caches)
       – old value (if write back caches)
                       write-back
    • Unacceptable to programs, and frequent!
                   5                                                           Lasse Natvig                            6                                                            Lasse Natvig
State Machine (1/3)                                                      CPU Read hit              State Machine (2/3)
                                                                                                   State machine                                     Write miss/
                                                                                                                                                     Invalidate
State machine                                                                                        for bus                                         for this block
                                                  CPU Read miss          Shared                                                                                            Shared
  for CPU                          Invalid                                                           requests                          Invalid
                                                                                                                                                                         (read/only)
                                                                       (read/only)
  requests                                        Place read miss                                     for each
  for each                                        on bus                                             cache block
  cache block
                      CPU Write              CPU read miss
                                                                                                                            Write miss                                           Read miss
                     Place Write             Write back block,
                                                        block             CPU Read miss                                     for this block
                     Miss on bus             Place read miss              Place read miss
                                             on bus                       on bus
                                                                                                                        Write B k
                                                                                                                        W it Back                         Read miss
                                                         CPU Write                                                      Block; (abort                     for this block
                                                         Miss => Write Miss on Bus                                      memory access)                     Write Back Block;
                                                         Hit => Invalidate on Bus                                                                          (abort
                                                                                                                                                           ( b t excl.l
                                Exclusive                                                                                                                  memory access)
                                                                                                                                       Exclusive
                               (read/write)              CPU Write Miss
            CPU read hit                                                                                                              (read/write)
            CPU write hit                                Write b k
                                                         W it back cache bl k
                                                                       h block
                                                         Place write miss on bus
              7                                                                     Lasse Natvig                   8                                                           Lasse Natvig




 State Mach ne (3/3)
       Machine                                                                                     Directory based cache coherence (1/2)
• State machine                                                            CPU Read hit
             q
  for CPU requests                            Write miss/Inv
  for each                                    for this block
                                                                         Shared
  cache block and                  Invalid     CPU Read
                                                                       (read/only)                 • Large MP systems, lots of CPUs
   for bus requests                            Place read miss
   for each                                CPU Write      on bus                                   • Distributed memory preferable
  cache block                          Place Write
                                       Miss on bus                                                   – Increases memory bandwidth
                  Write miss                                              CPU Read miss
                  for this block        CPU read miss                     Place read miss
                                                                                                   • Snooping bus with broadcast?
                                        Write back block,                 on bus
         Write Back                     Place read miss CPU Write
                                                                                                     – A single bus become a bottleneck
         Block; (abort excl.            on bus            Miss => Write Miss on Bus
                                                                                                     – Other ways of communicating needed
         memory access)                                   Hit => Invalidate on Bus
                                                                                                       •     With these broadcasting is hard/expensive
                                                        Read i
                                                        R d miss Write Back
                                 Exclusive
                                                        for this block Block; (abort                 – Can avoid broadcast if we know exactly which caches
                               (
                               (read/write))                           memory access)                  have a copy  Directory
                                                                                                                py            y
            CPU read hit
                   d                                   CPU Write Miss, Write back cache
            CPU write hit                              block, Place write miss on bus
              9                                                                     Lasse Natvig                  10                                                           Lasse Natvig




                                                                                                   SMP performance (shared memory)
  Directory based cache coherence (2/2)
 • Directory knows which blocks are in which cache and their state                                   • Focus on cache performance
 • Directory can be partitioned and distributed                                                      • 3 types of cache misses in uniprocessor ( C’s)
                                                                                                          yp                         p         (3   )
 • Typical states:                                                                                         – Capacity            (too small for working set)
    – Shared                                                                                               – Compulsory          (cold-start)
    – Uncached                                                                                             – Conflict            (placement strategy)
    – Modified
                                                                                                     • Multiprocessor also give coherence misses
 • Protocol based on
                                                                                                           – True sharing
   messages
                                                                                                              • Misses because of sharing of data
 • Invalidate and
                                                                                                           – False sharing
   update sent only
                                                                                                              • Misses because of invalidates that would not have happened with
   where needed                                                                                                 cache block size = one word
    – Avoids broadcast,
             broadcast
      reduces traffic                                                          Fig 4.19
             11                                                                     Lasse Natvig                  12                                                           Lasse Natvig
Example: L3 cache size (fig 4.11)
 E    l         h       (f       )                                                                                                                                      Example: L3 cache size (fig 4.12)
• Al h S
  AlphaServer 4100                                                                                                                                                                                          3.25




                                                                                                                                                       ycles pe Instruction
                                                                                                                                                                                                               3
                   – 4 x Alpha @ 300 MHz                                                                                                                                                                    2.75
                                                                                                                                                                                                                                                                                         Instruction
                                                                                                                                                                                                                                                                                         Capacity/Conflict
                                                                                                                                                                                                                                                                                            p    y
                   – L1 8 KB I +
                     L1:               100                                                                                                                                                                   2.5                                                                         Cold
                     8 KB D
                                                         lized Execution Time




                                        90                                                                                                                                                                  2.25                                                                         False Sharing
                                                                                                                                                                                                               2                                                                         True Sharing
                   – L2: 96 KB          80




                                                                                                                                                              er
                                                                                                                                                                                                            1.75
                                                                                                                                                                                                            1 75
                   – L3: off-chip       70                                                                                                                                                                   1.5
                                                                                                                       Idle
                     2 MB               60
                                                                                                                       PAL Code
                                                                                                                                                                                                            1.25




                                                                                                                                                emory Cy
                                                                                50                                     Memory Access
                                                                                                                       M        A                                                                              1
                                                                                                                       L2/L3 Cache Access                                                                   0.75
                                                                                40
                                                                                                                       Instruction Execution                                                                 0.5
                                                                                30
                                                                                                                                                                                                            0.25




                                                                                                                                               Me
                                                    Normal




                                                                                20                                                                                                                             0
                                                                                10                                                                                                                                                                      1 MB        2 MB                4 MB            8 MB
                                                                                                                                                                                                                                                                           Cache size
                                                                                 0
                                                                                     1 MB      2 MB       4 MB   8MB
                                                                                               L3 Cache Size
                                        13                                                                                 Lasse Natvig                                                                                                                 14                                                                  Lasse Natvig




 Example: Increasing parallelism (fig 4.13)                                                                                                                                                                 Example: Increased block size (fig 4.14)
                                   3
                        uction




                                       Instruction
                                       Conflict/Capacity                                                                                                                                                                                           16
                                 2.5                                                                                                                                                                                                               15                                                          Insruction
 Memory Cycles per Instru




                                       Cold
                                                                                                                                                                                                tructions
                                                                                                                                                                                                        s




                                       False Sharing                                                                                                                                                                                               14
                                                                                                                                                                                                                                                                                                               Capacity/Conflict
                                                                                                                                                                                                                                        ructions




                                       True Sharing                                                                                                                                                                                                13
                                   2
                                                                                                                                                                                                                                                   12                                                          Cold
                                                                                                                                                                                                                                                   11
               p




                                                                                                                                                                              Misse per 1000 Inst


                                                                                                                                                                                                                   Misses per 1,000 instr




                                 1.5                                                                                                                                                                                                               10                                                          False Sharing
                                                                                                                                                                                                                                                    9
                                                                                                                                                                                                                                                    8                                                          True Sharing

                                   1                                                                                                                                                                                                                7
        C




                                                                                                                                                                                                                                                    6
                                                                                                                                                                                                                                                    5
                                 0.5
                                                                                                                                                                                                                                                    4
                                                                                                                                                                                  es


                                                                                                                                                                                                                   M




                                                                                                                                                                                                                                                    3
                                   0                                                                                                                                                                                                                2
                                       1                                2                4            6           8                                                                                                                                 1
                                                                                                                                                                                                                                                    0
                                                                                 Processor count
                                                                                                                                                                                                                                                               32      64              128             256
                                                                                                                                                                                                                                                                       Block size in bytes

                                        15                                                                                 Lasse Natvig                                                                                                                 16                                                                  Lasse Natvig
2/18/2011




1                                                                      2



                                                                           2nd Branch Prediction
                                                                           Championship
                                                                           • International competition similar to our prefetching
                                                                             exercise system

            How to Write a Computer Architecture Paper                     • Task: Implement your best possible branch predictor
                                                                             and write a paper about it
            TDT4260 Computer Architecture
            18. February 2011
                                                                           • Submission deadline: 15. April 2011
            Magnus Jahre


                                                                           • More info: http://www.jilp.org/jwac-2/




3                                                                      4




    How does pfJudge work?                                                 Storage Estimation
    • Each submitted file is one kongull job
                                                                           • We impose an storage limit of 8KB on your
       – Contains 12 M5 instances since there are 12 CPUs per core
       – Each M5 instance runs a different SPEC 2000 benchmark
                                                                             prefetchers
                                                                              – This limit is not checked by the exercise system

    • The kongull job added to the job queue
       – Status “Running” can mean running or queued, be patient           • This is realistic: hardware components are usually
       – Running a job can take a long time depending on load                designed with an area budget in mind
       – Kongull is usually able to empty the queue during the night

                                                                           • Estimating storage is simple
    • We can give you a regular user account on Kongull
                                                                              – Table based prefetcher: add up the bits used in each entry and
       – Remember that Kongull is a shared resource!                            multiply by the number of entries
       – Always calculate the expected CPU-hour demand of
         your experiment before submitting




5                                                                      6




                                                                           Research Workflow
                                                                                                                          Evaluate Solution on
                                                                                Recieve PhD                                Compute Cluster
                                                                               (get a real job)




     HOW TO USE A SIMULATOR




                                                                                                                                                        1
2/18/2011




7                                                                                  8




     Why simulate?                                                                      Know your model
     • Model of a system                                                                • You need to figure out which system is being
        – Model the interesting parts with high accuracy                                  modeled!
        – Model the rest of the system with sufficient accuracy

                                                                                        • Pfsys is a help to getting started, but to draw
     • “All models are wrong but some are useful” (G. Box,                                conclusions from you work you need to understand
       1979)                                                                              what you are modeling

     • The model does not necessarily have a one-to-one
       correspondence with the actual hardware
        – Try to model behavior
        – Simplify your code wherever possible




9                                                                                  10




                                                                                        Find Your Story
                                                                                        • A good computer architecture paper tells a story
                                                                                            – All good stories have a bad guy: the problem
                                                                                            – All good stories have a hero: the scheme


                                                                                        • Writing a good paper is all about finding and
                                                                                          identifying your story

      HOW TO WRITE A PAPER                                                              • Note that this story has to be told within the strict
                                                                                          structure of a scientific article




11                                                                                 12




     Paper Format                                                                       Typical Paper Outline
     • You will be pressed for space                                                    •   Abstract
                                                                                        •   Introduction
     • Try to say things as precisely as possible                                       •   Background/Related Work
        – Your first write up can be as much as 3x the page limit and it’s still
                     write-up                                         it s              •   The Scheme ( b tit t with a descriptive title)
                                                                                            Th S h       (substitute ith d       i ti titl )
          easy (possible) to get it under the limit
                                                                                        •   Methodology
     • Think about your plots/figures                                                   •   Results
        –   A good plot/figure gives a lot of information                               •   Discussion
        –   Is this figure the best way of conveying this idea?
                                                                                        •   Conclusion (with optional further work)
        –   Is this plot the best way for visualizing this data?
        –   Plots/figures need to be area efficient (but readable!)




                                                                                                                                                         2
2/18/2011




13                                                                                      14




     Abstract                                                                                Introduction
     • An experienced reader should be able to understand
       exactly what you have done from only reading the                                      • Introduces the larger research area that the paper is
       abstract                                                                                a part of
        – This is different from a summary
                                                                                             • Introduces the problem at hand
     • Should be short, varies from 150 to 200 word
       maximum
                                                                                             • Explains the scheme

     • Should include a description of the problem, the
                                                                                             • Level of abstraction: “20 000 feet”
       solution and the main results

     • Typically the last thing you write




15                                                                                      16




     Related Work                                                                            The Scheme
     • Reference the work that other researchers have done                                   • Explain your scheme in detail
       that is related to your scheme                                                           – Choose an informative title


     • Should be complete (i e contain all relevant work)
                          (i.e.                                                              • Trick: Add an informative figure that helps explain
        – Remember: you define the scope of your work                                          your scheme

     • Can be split into two sections: Background and Related                                • If your scheme is complex, an informative example
       Work                                                                                    may be in order
        – Background is an informative introduction to the field (often section 2)
        – Related work is a very dense section that includes all relevant
          references (often section n-1)




17                                                                                      18




     Methodology                                                                             Results
                                                                                             • Show that your scheme works
     • Explains your experimental setup

     • Should answer the following questions:                                                • Compare to other schemes that do the same thing
        –   Which simulator did you use?                                                        – Hopefully you are better, but you need to compare anyway
        –   How have you extended the simulator?
        –   Which parameters did you use for your simulations? (aim: reproducibility)
                                                                                             • Trick: “Oracle Scheme”
        –   Which benchmarks did you use?
        –   Why did you chose these benchmarks?
                                                                                                – Uses “perfect” information to create an upper bound on the
                                                                                                  performance of a class of schemes
                                                                                                – Prefetching: Best case is that all L2 accesses are hits
     • Important: should be realistic

     • If you are unsure about a parameter, run a simulation                                 • Sensitivity analysis
       to check its impact                                                                      – Check the impact of model assumptions on your
                                                                                                  scheme




                                                                                                                                                                      3
2/18/2011




19                                                                        20




     Discussion                                                                Conclusion
     • Only include this if you need it                                        • Repeat the main results of your work

     • Can be used if:                                                         • Remember that the abstract, introduction and
        – You have weaknesses in your model that you have not accounted          conclusion are usually read before the rest of the
          for                                                                    paper
        – You tested improvements to your scheme that did not give good
          enough results to be included in “The Scheme” section
                                                                               • Can include Further Work:
                                                                                  – Things you thought about doing that you did not have time to do




21




     Thank You




     Visit our website:
     http://research.idi.ntnu.no/multicore/




                                                                                                                                                             4
Review on ILP
                                                               • What is ILP ?
                                                               • Let the compiler find the ILP
TDT 4260                                                         ▫ Advantages?
                                                                 ▫ Disadvantages?
 Chap 5
                                                               • Let the HW find the ILP
 TLP & Memory Hierarchy
                                                                 ▫ Advantages?
                                                                 ▫ Disadvantages?




                                                               Multi-threaded execution
Contents
 • Multi-threading                        Chap 3.5             • Multi-threading: multiple threads share the
                                                                 functional units of 1 processor via overlapping
 • Memory hierarchy                       Chap 5.1
                                                                 ▫ Must duplicate independent state of each thread e.g., a
   ▫ 6 basic cache optimizations                                   separate copy of register file, PC and page table
 • 11 advanced cache optimizations Chap 5.2                      ▫ Memory shared through virtual memory mechanisms
                                                                 ▫ HW for fast thread switch; much faster than full
                                                                   process switch ≈ 100s to 1000s of clocks
                                                               • When switch?
                                                                 ▫ Alternate instruction per thread (fine grain)
                                                                 ▫ When a thread is stalled, perhaps for a cache miss,
                                                                   another thread can be executed (coarse grain)




Fine-Grained Multithreading                                    Coarse-Grained Multithreading
                                                               • Switch threads only on costly stalls (L2 cache miss)
• Switches between threads on each instruction                 • Advantages
 ▫ Multiples threads interleaved                                 ▫ No need for very fast thread-switching
• Usually round-robin fashion, skipping stalled                  ▫ Doesn’t slow down thread, since switches only when
  threads                                                          thread encounters a costly stall
                                                               • Disadvantage: hard to overcome throughput losses
• CPU must be able to switch threads every clock
                                                                 from shorter stalls, due to pipeline start-up costs
• Hides both short and long stalls                               ▫ Since CPU issues instructions from 1 thread, when a stall
 ▫ Other threads executed when one thread stalls                   occurs, the pipeline must be emptied or frozen
• But slows down execution of individual threads                 ▫ New thread must fill pipeline before instructions can
 ▫ Thread ready to execute without stalls will be delayed by       complete
   instructions from other threads                             • => Better for reducing penalty of high cost stalls,
• Used on Sun’s Niagara                                          where pipeline refill << stall time
Simultaneous Multi-threading
Do both ILP and TLP?                                                           One thread, 8 units                                 Two threads, 8 units
                                                              Cycle M M FX FX FP FP BR CC                                    Cycle M M FX FX FP FP BR CC
 • TLP and ILP exploit two different kinds of                                      1                                               1
   parallel structure in a system                                                  2                                               2
 • Can a high-ILP processor also exploit TLP?                                      3                                               3
   ▫ Functional units often idle because of stalls or                              4                                               4
     dependences in the code
                                                                                   5                                               5
 • Can TLP be a source of independent instructions
                                                                                   6                                               6
   that might reduce processor stalls?
                                                                                   7                                               7
 • Can TLP be used to employ functional units that
                                                                                   8                                               8
   would otherwise lie idle with insufficient ILP?
                                                                                   9                                               9
 • => Simultaneous Multi-threading (SMT)
   ▫ Intel: Hyper-Threading                                   M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes




Simultaneous Multi-threading (SMT)                                                     Multi-threaded categories
                                                                                                                                                      Simultaneous
                                                                                       Superscalar   Fine-Grained Coarse-Grained    Multiprocessing
                                                              Time (processor cycle)




                                                                                                                                                      Multithreading
• A dynamically scheduled processor already has
  many HW mechanisms to support multi-threading
 ▫ Large set of virtual registers
     Virtual = not all visible at ISA level
     Register renaming
 ▫ Dynamic scheduling
• Just add a per thread renaming table and keeping
  separate PCs
 ▫ Independent commitment can be supported by logically
   keeping a separate reorder buffer for each thread
                                                                                          Thread 1             Thread 3                Thread 5
                                                                                          Thread 2             Thread 4                Idle slot




Design Challenges in SMT
• SMT makes sense only with fine-grained
  implementation
  ▫ How to reduce the impact on single thread performance?
  ▫ Give priority to one or a few preferred threads
• Large register file needed to hold multiple contexts
• Not affecting clock cycle time, especially in
  ▫ Instruction issue - more candidate instructions need to
    be considered
  ▫ Instruction completion - choosing which instructions to
    commit may be challenging
• Ensuring that cache and TLB conflicts generated
  by SMT do not degrade performance
Why memory hierarchy? (fig 5.2)
                 100,000
                                                                                                    Why memory hierarchy?
                  10,000
                                                                                                    • Principle of Locality
                                                                                                         ▫ Spatial Locality
   Performance




                   1,000                                                                                    Addresses near each other are likely referenced close
                                                         Processor
                                                                               Processor-Memory             together in time
                    100                                                        Performance Gap
                                                                               Growing
                                                                                                         ▫ Temporal Locality
                                                                                                            The same address is likely to reused in the near
                     10
                                                               Memory                                       future
                      1
                           1980       1985              1990    1995    2000     2005      2010     • Idea: Store recently used elements a fast
                                                                Year                                  memories close to the processor
                                                                                                         ▫ Managed by software or hardware?




  Memory hierarchy                                                                                   Cache block placement
                                                                                                     Block 12 placed in cache with 8 Cache lines
  We want large, fast and cheap at the same time
                                                                                                                   Fully associative:           Direct mapped:              Set associative:
                                   Processor                                                                       block 12 can go              block 12 can go             block 12 can go
                                                                                                                   anywhere                     only into block 4           anywhere in set 0
                                                                                                                                                (12 mod 8)                  (12 mod 4)
                                  Control                                                                  Block                        Block                       Block
                                                                                                                     01234567                    01234567                   01234567
                                                                                        Memory               no.                          no.                         no.

                                                                        Memory
                                     Memory


                                               Memory




                                                               Memory
                      Datapath

                                                                                                                                                                            Set Set Set Set
                                                                                                                      Block Address                                          0 1 2 3




                       Speed: Fastest                                                   Slowest

                     Capacity: Smallest                                                 Largest
                                                                                        Cheapest             Block               1111111111222222222233
                           Cost: Most expensive                                                                no.
                                                                                                                       01234567890123456789012345678901




 Cache performance                                                                                 6 Basic Cache Optimizations
 Average access time = Hit time +
                                                                                                   Reducing Hit Time
                       Miss rate * Miss penalty                                                     1.    Giving Reads Priority over Writes
                                                                                                            Writes in write-buffer can be handled after a newer read if
• Miss rate alone is not an accurate measure                                                                not causing dependency problems
                                                                                                    2. Avoiding Address Translation during Cache Indexing
                                                                                                            Eg. use Virtual Memory page offset to index the cache
• Cache performance is important for CPU perf.                                                     Reducing Miss Penalty
  • More important with higher clock rate
                                                                                                    3. Multilevel Caches
                                                                                                            Both small and fast (L1) and large (&slower) (L2)
• Cache design can also affect instructions that don’t                                             Reducing Miss Rate
  access memory!                                                                                    4. Larger Block size (Compulsory misses)
  • Example: A set associative L1 cache on the critical path                                        5. Larger Cache size (Capacity misses)
    requires extra logic which will increase the clock cycle time                                   6. Higher Associativity (Conflict misses)
  • Trade off: Additional hits vs. cycle time reduction
1: Giving Reads Priority over
Writes                                                            Virtual memory
• Caches typically use a write buffer                             • Processes use a large virtual memory
  ▫   CPU writes to cache and write buffer                        • Virtual addresses are dynamically mapped to
  ▫   Cache controller transfers from buffer to RAM                 physical addresses using HW & SW
  ▫   Write buffer usually FIFO with N elements                   • Page, page frame, page error, translation
  ▫   Works well as long as buffer does not fill faster than        lookaside buffer (TLB) etc.
      it can be emptied                                                                                                          Physical address (PA)
                                                                                            Virtual address (VA)
                                              Cache                                                                                   0
                               Processor                   DRAM                                    0
                                                                                                                   address
                                                                                                       vir. page                       phy. page
                                                                                     Process 1:                    translation

                                            Write Buffer                                          2n-1
• Optimization
  ▫ Handle read misses before write buffer writes                                                  0
  ▫ Must check for conflicts with write buffer first                                 Process 2:
                                                                                                  2n-1
                                                                                                                                    2m-1




                                                                   2: Avoiding Address Translation during
                                                                   Cache Indexing
                                                                   • Virtual cache: Use virtual addresses in caches
                                                                     ▫ Saves time on translation VA -> PA
                                                                     ▫ Disadvantages
                                                                        Must flush cache on process switch
                                                                          Can be avoided by including PID in tag
                                                                        Alias problem: OS and a process can have two VAs pointing
                                                                        to the same PA
                                                                   • Compromise:”virtually indexed, physically tagged”
                                                                     ▫ Use page offset to index cache
                                                                     ▫ The same for VA and PA
                                                                     ▫ At the same time as data is read from cache, VA                         PA is
                                                                       done for the tag
                                                                     ▫ Tag comparison using PA
                                                                     ▫ But: Page size restricts cache size




3: Multilevel Caches (1/2)
                                                                  3: Multilevel Caches (2/2)
• Make cache faster to
  keep up with CPU or
  larger to reduce
                                                                  Average access time = L1 Hit time + L1 Miss rate *
  misses?                                                          (L2 Hit time + L2 Miss rate * L2 Miss penalty)
• Why not both?
                                                                  • Local miss rate
                                                                   ▫ #cache misses / # cache accesses
                                                                  • Global miss rate
                                                                   ▫ #cache misses / # CPU memory accesses

                                                                  • L1 cache speed affects CPU clock rate
• Multilevel caches
      Small and fast L1                                           • L2 cache speed affects only L1 miss penalty
      Large (and cheaper) L2                                       ▫ Can use more complex mapping for L2
                                                                   ▫ L2 can be large
4: Larger Block size
                                                                                                                                5: Larger Cache size
                                25%
                                                                                         Conflict
                                                                                         misses              1K
                                                                                                                                 • Simple method
                                20%                  Compulsory
                                                       misses                                                4K
                                                                                                                                 • Square-root Rule (quadrupling the size of the
                        Miss
                                15%
                                                                                                                    Capacity
                                                                                                                                   cache will half the miss rate)
                                                                                                             16K
                        Rate
                                10%
                                                                                                                     misses      • Disadvantages
                                                                                                             64K
                                                                                                                                    ▫ Longer hit time
                                    5%                                                                       256K                   ▫ Higher cost
                                    0%                                                               Trade-off                   • Most used for L2/L3 caches
                                         16


                                                       32


                                                                      64


                                                                                 128


                                                                                              256




                                                                                                     32 and 64 byte common

                                                   Block Size (bytes)




                                                                                                                               11 Advanced Cache Optimizations
                        6: Higher Associativity
                                                                                                                               Reducing hit time            Reducing Miss Penalty
                          • Lower miss rate                                                                                                                 7. Critical word first
                                                                                                                               1. Small and simple caches
                          • Disadvantages                                                                                      2.Way prediction             8. Merging write buffers
                               ▫ Can increase hit time                                                                         3.Trace caches
                               ▫ Higher cost                                                                                                                Reducing Miss Rate
                          • 8-way has similar performance to fully                                                             Increasing cache bandwidth   9. Compiler optimizations
                            associative                                                                                        4.Pipelined caches
                                                                                                                               5.Non-blocking caches        Reducing miss penalty or
                                                                                                                               6.Multibanked caches             miss rate via parallelism
                                                                                                                                                            10.Hardware prefetching
                                                                                                                                                            11. Compiler prefetching




                   1: Small and simple caches
      • Compare address to tag memory takes time                                                                               2: Way prediction
      • ⇒ Small cache can help hit time
                        ▫ E.g., L1 caches same size for 3 generations of AMD microprocessors: K6,                               • Extra bits are kept in the cache to predict
                          Athlon, and Opteron
                        ▫ Also L2 cache small enough to fit on chip with the processor avoids time
                                                                                                                                  which way (block) in a set the next access
                          penalty of going off chip                                                                               will hit
      • Simple ⇒ direct mapping                                                                                                   ▫ Can retrieve the tag early for comparison
                        ▫ Can overlap tag check with data transmission since no choice
                                                                                                                                  ▫ Achieves fast hit even with just one
      • Access time estimate for 90 nm using CACTI model 4.0
                        ▫ Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39,
                                                                                                                                    comparator
                          and 1.43 for 2-way, 4-way, and 8-way caches                                                             ▫ Several cycles needed to check other blocks
                   2.50
                                                                                                                                    with misses
Access time (ns)




                   2.00                    1-way   2-way   4-way   8-way

                   1.50
                   1.00
                   0.50
                    -
                            16 KB        32 KB         64 KB         128 KB     256 KB      512 KB    1 MB
                                                                   Cache size
3: Trace caches
 • Increasingly hard to feed modern superscalar                       4: Pipelined caches
   processors with enough instructions
                                                                       • Pipeline technology applied to cache lookups
 • Trace cache                                                               ▫ Several lookups in processing at once
    ▫ Stores dynamic instruction sequences rather than                       ▫ Results in faster cycle time
      ”bytes of data”                                                        ▫ Examples: Pentium (1 cycle), Pentium-III (2 cycles),
    ▫ Instruction sequence may include branches                                P4 (4 cycles)
        Branch prediction integrated in with the cache                       ▫ L1: Increases the number of pipeline stages needed
    ▫ Complex and relatively little used                                       to execute an instruction
    ▫ Used in Pentium 4: Trace cache stores up to 12K                        ▫ L2/L3: Increases throughput
      micro-ops decoded from x86 instructions (also                              Nearly for free since the hit latency on the order of 10 –
                                                                                 20 processor cycles and caches are easy to pipeline
      saves decode time)




5: Non-blocking caches (1/2)                                          5: Non-Blocking Cache Implementation
 • Non-blocking cache or lockup-free cache allow data                                                          • The cache can handle as
   cache to continue to supply cache hits during a miss                                                          many concurrent misses as
                                                                                                                 there are MSHRs
 • “hit under miss” reduces the effective miss penalty by
   working during miss vs. ignoring CPU requests                                                               • Cache must block when all
 • “hit under multiple miss” or “miss under miss” may                                                            valid bits (V) are set
                                                                      ...




   further lower the effective miss penalty by overlapping
                                                                                                               • Very common
   multiple misses
   ▫ Requires that the lower-level memory can service multiple
     concurrent misses
   ▫ Significantly increases the complexity of the cache controller
     as there can be multiple outstanding memory accesses
   ▫ Pentium Pro allows 4 outstanding memory misses                                  MHA = Miss Handling Architecture
                                                                               MSHR = Miss information/Status Holding Register
                                                                                DMHA = Dynamic Miss Handling Architecture




 5: Non-blocking Cache Performance
                                                                      6: Multibanked caches
                                                                      • Divide cache into independent banks that can support
                                                                        simultaneous accesses
                                                                        ▫ E.g.,T1 (“Niagara”) L2 has 4 banks
                                                                      • Banking works best when accesses naturally spread
                                                                        themselves across banks ⇒ mapping of addresses to banks
                                                                        affects behavior of memory system
                                                                      • Simple mapping that works well is “sequential interleaving”
                                                                            ▫ Spread block addresses sequentially across banks
                                                                            ▫ E,g, if there 4 banks, Bank 0 has all blocks whose address
                                                                              modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is
                                                                              1; …
7: Critical word first                                             8: Merging write buffers
• Don’t wait for full block before restarting CPU                  • Write buffer allows processor to continue while
• Early restart—As soon as the requested word of the block           waiting to write to memory
  arrives, send it to the CPU and let the CPU continue                                     ▫      If buffer contains modified blocks, the addresses
  execution                                                                                       can be checked to see if address of new data
• Critical Word First—Request the missed word first from                                          matches the address of a valid write buffer entry
  memory and send it to the CPU as soon as it arrives; let the                             ▫      If so, new data are combined with that entry
  CPU continue execution while filling the rest of the words in
  the block                                                        • Multiword writes more efficient to memory
  ▫ Long blocks more popular today ⇒ Critical Word 1st widely      • The Sun T1 (Niagara) processor, among many
    used                                                             others, uses write merging
                                 block




                                                                  10: Hardware prefetching
 9: Compiler optimizations                                        • Prefetching relies on having extra memory bandwidth that
• Instruction order can often be changed without                    can be used without penalty
                                                                  • Instruction Prefetching
  affecting correctness                                              ▫ Typically, CPU fetches 2 blocks on a miss: the requested block
  ▫ May reduce conflict misses                                         and the next consecutive block.
  ▫ Profiling may help the compiler                                  ▫ Requested block is placed in instruction cache when it returns,
                                                                       and prefetched block is placed into instruction stream buffer
• Compiler generate instructions grouped in basic                 • Data Prefetching
  blocks                                                             ▫ Pentium 4 can prefetch data into L2 cache from up to 8 streams
  ▫ If the start of a basic block is aligned to a cache block,       ▫ Prefetching invoked if 2 successive L2 cache misses to a page
    misses will be reduced
                                                                     efomn e po e e t
                                                                    P r r a c Im r v m n




                                                                                               2.20                                                                                                                    1.97
      Important for larger cache block sizes                                                   2.00
                                                                                               1.80
                                                                                               1.60                  1.45                                                                                    1.49
                                                                                                                                                                                                   1.40
• Data is even easier to move                                                                  1.40
                                                                                               1.20
                                                                                                            1.16                 1.18       1.20       1.21       1.26       1.29       1.32


                                                                                               1.00
  ▫ Lots of different compiler optimizations
                                                                                                                                        e




                                                                                                                                                                                                                   e
                                                                                                                                                              c
                                                                                                                            3d




                                                                                                                                                                        im




                                                                                                                                                                                   lu
                                                                                                                                                  el




                                                                                                                                                                                                        id
                                                                                                                                                                                               s
                                                                                                        p




                                                                                                                    cf




                                                                                                                                      is




                                                                                                                                                            re




                                                                                                                                                                                                                 ak
                                                                                                                                                                                             ca
                                                                                                      ga




                                                                                                                                                                                  p
                                                                                                                                                lg




                                                                                                                                                                                                      gr
                                                                                                                   m



                                                                                                                           m



                                                                                                                                    pw




                                                                                                                                                                      sw
                                                                                                                                                          ce




                                                                                                                                                                                ap




                                                                                                                                                                                                                u
                                                                                                                                              ga




                                                                                                                                                                                                     m
                                                                                                                                                                                           lu
                                                                                                                         fa




                                                                                                                                                                                                              eq
                                                                                                                                                        fa
                                                                                                                                   u
                                                                                                                                  w




                                                                                                SPECint2000                                                       SPECf p2000




 11: Compiler prefetching                                          Cache Coherency
  • Data Prefetch                                                  • Consider the following case. I have two processors
    ▫ Load data into register (HP PA-RISC loads)                     that are sharing address X.
    ▫ Cache Prefetch: load into cache                              • Both cores read address X
      (MIPS IV, PowerPC, SPARC v. 9)                               • Address X is brought from memory into the
    ▫ Special prefetching instructions cannot cause faults;          caches of both processors
      a form of speculative execution                              • Now, one of the processors writes to address X
                                                                     and changes the value.
  • Issuing Prefetch Instructions takes time
                                                                   • What happens? How does the other processor get
    ▫ Is cost of prefetch issues < savings in reduced
                                                                     notified that address X has changed?
      misses?
Two types of cache coherence
schemes
• Snooping
 ▫ Broadcast writes, so all copies in all caches will be
   properly invalidated or updated.
• Directory
 ▫ In a structure, keep track of which cores are caching
   each address.
 ▫ When a write occurs, query the directory and
   properly handle any other cached copies.
Contents
                                                  •   Introduction                               App E.1
                                                  •   Two devices                                App E.2
TDT 4260                                          •   Multiple devices                           App E.3
                                                  •   Topology                                   App E.4
Appendix E
Interconnection Networks                          •   Routing, arbitration, switching            App E.5




Conceptual overview                               Motivation
                                                  • Basic network technology assumed known
                                                  • Motivation
                                                      ▫ Increased importance
                                                          System-to-system connections
                                                          Intra system connections
                                                      ▫ Increased demands
                                                          Bandwidth, latency, reliability, ...
                                                      ▫    Vital part of system design




                                                  E.2: Connecting two devices
Types of networks
Number of devices and distance
• OCN – On-chip network
  ▫ Functional units, register files, caches, …
  ▫ Also known as: Network on Chip (NoC)                                                                   Destination
• SAN – System/storage area network                                                                        implicit
  ▫ Multiprocessor and multicomputer, storage
• LAN – Local area network
• WAN – Wide area network

• Trend: Switches replace buses
Software to Send and Receive                                                                               Network media
                                                                                                                   Twisted Pair:
 • SW Send steps                                                                                                                               Copper, 1mm thick, twisted to avoid
              1: Application copies data to OS buffer                                                                                          antenna effect (telephone)

              2: OS calculates checksum, starts timer                                                                                                           Used by cable
                                                                                                          Coaxial Cable:       Plastic Covering
              3: OS sends data to network interface HW and says start                                                             Braided outer conductor
                                                                                                                                                                companies: high BW,
                                                                                                                                                                good noise immunity
 • SW Receive steps                                                                                                                    Insulator
                                                                                                                                         Copper core
              3: OS copies data from network interface HW to OS buffer
                                                                                                                                                                            Light: 3 parts
              2: OS calculates checksum, if matches send ACK; if not,                                                                                                       are cable, light
                deletes message (sender resends when timer expires)                                                                                                         source, light
                                                                                                         Fiber Optics                       Total internal
              1: If OK, OS copies data to user address space and signals                                                                                                    detector.
                                                                                                          Transmitter          Air          reflection
                application to continue                                                                    – L.E.D
                                                                                                                                                             Receiver       Multimode
                                                                                                           – Laser Diode                                     – Photodiode   light disperse
 • Sequence of steps for SW: protocol                                                                           light
                                                                                                                                                                            (LED), Single
                                                                                                                source                       Silica                         mode single
                                                                                                                                                                            wave (laser)




 Basic Network Structure and Functions
               • Media and Form Factor
                                                                                                         Packet latency
                 Metal layers        InfiniBand    Ethernet
                                     connectors                                                                             Sender  Transmission time
                                                                                                             Sender        Overhead (size/bandwidth)
                                                                                          Fiber Optics
 Media type




                                                                                                                         (processor
                                                                                                                           busy)
                                                     Cat5E twisted pair                                                               Time of Transmission time         Receiver
                                                                                                                                       Flight  (size/bandwidth)         Overhead
                                                                             Coaxial                        Receiver
                                                                             cables                                                                                       (processor
                                         Printed                                                                                          Transport Latency
                                         circuit                                                                                                                            busy)
                                         boards
                                                   Myrinet
                                                   connectors                                                                                 Total Latency
                         OCNs            SANs                    LANs                         WANs
                     0.01        1            10                   100                 >1,000                  Total Latency = Sender Overhead + Time of Flight +
                                            Distance (meters)                                                               Message Size / bandwidth + Receiver Overhead
                                                                               9




E.3: Connecting multiple devices (1/3)                                                                    E.3: Connecting multiple devices (2/3)
• New issues
  ▫ Topology                                                         Shared Media (Ethernet)             • Two types of topology                                   Shared Media (Ethernet)

                What paths are possible for                           Node             Node       Node      ▫ Shared media                                         Node       Node       Node
                packets?                                                                                    ▫ Switched media
  ▫ Routing                                                          Switched Media (CM-5,ATM)                                                                     Switched Media (CM-5,ATM)
                Which of the possible paths are                                                          • Shared media (bus)
                allowable (valid) for packets?                        Node                      Node                                                               Node                Node
                                                                                                            ▫ Arbitration
  ▫ Arbitration
                                                                                   Switch                        Carrier Sensing                                             Switch
                When are paths available for
                                                                                                                 Collision Detection
                packets?
  ▫ Switching
                                                                      Node                      Node        ▫ Routing is simple                                    Node                Node

                How are paths allocated to                                                                       Only one possible path
                packets?
Connecting multiple devices (3/3)                                                      E.4: Interconnection Topologies
• Switched media                                                                        • One switch or bus can connect a limited number of
 ▫ “Point-to-point” connections                 Shared Media (Ethernet)
                                                                                          devices
 ▫ Routing for each packet                      Node          Node            Node          ▫ Complexity, cost, technology, …
 ▫ Arbitration for each connection                                                      • Interconnected switches needed for larger
                                                Switched Media (CM-5,ATM)                 networks
• Comparison                                    Node                     Node           • Topology: connection structure
 ▫ Much higher aggregate BW in                                                              ▫ What paths are possible for packets?
   switched network than shared                           Switch                            ▫ All pairs of devices must have path(s) available
   media network                                                                        • A network is partitioned by a set of links if their
 ▫ Shared media is cheaper                      Node                     Node
                                                                                          removal disconnects the graph
 ▫ Distributed arbitration simpler for                                                      ▫ Bisection bandwidth
   switched                                                                                 ▫ Important for performance




 Crossbar                                                                              Omega network
                                                                                                                     Source                      Destination
                                                                                                                          000                                000
 • Common topology for                                                                                                    001                                001
                                                      P
   connecting CPUs and I/O                                                                                                      0
                                                                                           2x2 switches                   010                                010

   units                                              P                                                                   011   1                        1   011


 • Also used for                          I/O         C                                                                   100
                                                                                                                          101
                                                                                                                                                             100
                                                                                                                                                             101
                                                                                         Straight        Crossover                       1               1
   interconnecting CPUs                   I/O         C
                                                                                                                                         0
                                                                                                                          110                                110
 • Fast and expensive                                     M     M        M     M
                                                                                      Upper broadcast   Lower broadcast
                                                                                                                          111                                111

   (O(N2))
 • Non-blocking                                                                         • Example of multistage network
                                                                                        • Usually log2n stages for n inputs - O(N log N)
                                                                                        • Can block




 Linear Arrays and Rings                                                                  Trees
                                     • Distributed switched
                                       networks
                                     • Node = switch + 1-n end
                                       nodes                                               • Diameter and average distance are
                                                                                             logarithmic
                                                                                               ▫ k-ary tree, height d = logk N
                                       External I/O                                            ▫ address = d-vector of radix k coordinates
  • Linear array= 1D grid                                            P          Mem              describing path down from root
                                                                     $
  • 2D grid                                                           Mem                  • Fixed number of connections per node (i.e.
  • Torus has wrap-around                                             ctrl
                                                                     and NI
                                                                                             fixed degree)
    connections
  • CRAY with 3D torus
                                                                     Switch                • Bisection bandwidth = 1 near the root
E.5: Routing, Arbitration, Switching                                        Routing
• Routing                                                                    • Shared Media
  ▫ Which of the possible paths are allowable for packets?                     ▫ Broadcast to everyone
  ▫ Set of operations needed to compute a valid path
                                                                             • Switched Media needs real routing. Options:
  ▫ Executed at source, intermediate, or even at destination nodes
                                                                               ▫ Source-based routing: message specifies path to the
• Arbitration
                                                                                 destination (changes of direction)
  ▫ When are paths available for packets?
  ▫ Resolves packets requesting the same resources at the same time            ▫ Virtual Circuit: circuit established from source to
  ▫ For every arbitration, there is a winner and possibly many losers            destination, message picks the circuit to follow
      Losers are buffered (lossless) or dropped on overflow (lossy)            ▫ Destination-based routing: message specifies
• Switching                                                                      destination, switch must pick the path
  ▫ How are paths allocated to packets?                                            Deterministic: always follow same path
  ▫ The winning packet (from arbitration) proceeds towards destination             Adaptive: pick different paths to avoid congestion, failures
  ▫ Paths can be established one fragment at a time or in their entirety           Randomized routing: pick between several good paths to
                                                                                   balance network load




 Routing mechanism                                                          Deadlock
  • Need to select output port for each input packet                        • How can it arise?
                                                                              ▫ necessary conditions:
    ▫ And fast…
                                                                                  shared resources
  • Simple arithmetic in regular topologies                                       incrementally allocated
    ▫ Ex: ∆x, ∆y routing in a grid (first ∆x then ∆y)                             non-preemptible                     TRC (0,0)   TRC (0,1)   TRC (0,2)   TRC (0,3)

         west (-x)         ∆x < 0
         east (+x)         ∆x > 0                                           • How do you handle it?
                           ∆x = 0, ∆y < 0
                                                                                                                      TRC (1,0)   TRC (1,1)   TRC (1,2)   TRC (1,3)
         south (-y)                                                           ▫ constrain how channel
         north (+y)        ∆x = 0, ∆y > 0                                       resources are allocated
  • Unidirectional links sufficient for torus (+x, +y)                          (deadlock avoidance)                  TRC (2,0)   TRC (2,1)   TRC (2,2)   TRC (2,3)


                                                                              ▫ Add a mechanism that                      X                     X
    • Dimension-order routing
                                                                                detects likely deadlocks
    ▫ Reduce relative address of each dimension in                              and fixes them
                                                                                                                      TRC (3,0)   TRC (3,1)   TRC (3,2)   TRC (3,3)



      order to avoid deadlock                                                   (deadlock recovery)




 Arbitration (1/2)                                                          Arbitration (2/2)
  • Several simultaneous                                                   • Three phases
    requests to shared                                                     • Multiple
    resource                                                                 requests
  • Ideal: Maximize usage of                                               • Better usage
    network resources                                                      • But:
  • Problem: Starvation                                                      Increased
    ▫ Fairness needed                                                        latency
  • Figure: Two phase arb.
    ▫ Request, Grant
    ▫ Poor usage
Switching                                     Store & Forward vs Cut-Through Routing
                                                     Store & Forward Routing                  Cut-Through Routing
• Allocating paths for packets                   Source                                Dest                          Dest

• Two techniques:                                3 2 1 0                                      32 1 0

 ▫ Circuit switching (connection oriented)           3 2 1     0                               3 2 1    0
                                                       3 2     1 0                               3 2    1     0
     Communication channel                               3     2 1 0                               3    2     1     0

     Allocated before first packet                             3 2 1 0                                  3     2     1 0
                                                                   3 2 1   0                                  3     2 1 0
     Packet headers don’t need routing info                          3 2   1 0                                      3 2 1 0

     Wastes bandwidth                                                  3   2 1 0
                                                                           3 2 1 0
 ▫ Packet switching (connection less)                                          3 2 1   0
                                              Time
     Each packet handled independently
     Can’t guarantee response time            • Cut-through (on blocking)
     Two types – next slide                     ▫ Virtual cut-through (spools rest of packet into buffer)
                                                ▫ Wormhole (buffers only a few flits, leaves tail along route)
Piranha:
 Designing a Scalable CMP-based System for
           Commercial Workloads

                     Luiz André Barroso
                 Western Research Laboratory


April 27, 2001                      Asilomar Microcomputer Workshop
What is Piranha?
lAscalable shared memory architecture based on chip
 multiprocessing (CMP) and targeted at commercial
 workloads

lAresearch prototype under development by Compaq
 Research and Compaq NonStop Hardware Development
 Group

lA departure from ever increasing processor complexity
 and system design/verification cycles
Importance of Commercial Applications
                Worldwide Server Customer Spending (IDC 1999)

                           Scientific & Other
                           engineering 3%
                               6%
                                                  Infrastructure
                   Collaborative                       29%
                       12%


                    Software
                  development
                     14%


                             Decision            Business
                             support            processing
                              14%                  22%



l Total   server market size in 1999: ~$55-60B
   – technical applications: less than $6B
   – commercial applications: ~$40B
Price Structure of Servers
                                               Normalized breakdown of HW cost
l IBM   eServer 680
                               100%
 (220KtpmC; $43/tpmC)           90%
                                80%
   § 24 CPUs
                                70%
   § 96GB DRAM, 18 TB Disk      60%
                                                                                        I/O
                                                                                        DRAM
   § $9M price tag              50%
                                                                                        CPU
                                40%
                                                                                        Base
                                30%
l Compaq    ProLiant ML370      20%
                                10%
 (32KtpmC; $12/tpmC)             0%
   § 4 CPUs                             IBM eServer 680        Compaq ProLiant ML570

   § 8GB DRAM, 2TB Disk                                           Price per component
                                      System
   § $240K price tag                                         $/CPU $/MB DRAM $/GB Disk
                              IBM eServer 680       $65,417                  $9        $359
                              Compaq ProLiant ML570 $6,048                   $4         $64

 - Storage prices dominate (50%-70% in customer installations)
 - Software maintenance/management costs even higher (up to $100M)
 - Price of expensive CPUs/memory system amortized
Outline
l Importance   of Commercial Workloads
l Commercial   Workload Requirements
l Trends   in Processor Design
l Piranha

l Design   Methodology
l Summary
Studies of Commercial Workloads
l   Collaboration with Kourosh Gharachorloo (Compaq WRL)

     – ISCA’98: Memory System Characterization of Commercial Workloads
       (with E. Bugnion)

     – ISCA’98: An Analysis of Database Workload Performance on
       Simultaneous Multithreaded Processors
       (with J. Lo, S. Eggers, H. Levy, and S. Parekh)
     – ASPLOS’98: Performance of Database Workloads on Shared-Memory
       Systems with Out-of-Order Processors
       (with P. Ranganathan and S. Adve)

     – HPCA’00: Impact of Chip-Level Integration on Performance of OLTP
       Workloads
       (with A. Nowatzyk and B. Verghese)

     – ISCA’01: Code Layout Optimizations for Transaction Processing
       Workloads
       (with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero)
Studies of Commercial Workloads: summary
l Memory     system is the main bottleneck
   –   astronomically high CPI
   –   dominated by memory stall times
   –   instruction stalls as important as data stalls
   –   fast/large L2 caches are critical
l Very   poor Instruction Level Parallelism (ILP)
   –   frequent hard-to-predict branches
   –   large L1 miss ratios
   –   Ld-Ld dependencies
   –   disappointing gains from wide-issue out-of-order techniques!
Outline
l Importance   of Commercial Workloads
l Commercial   Workload Requirements
l Trends   in Processor Design
l Piranha

l Design   Methodology
l Summary
Increasing Complexity of Processor Designs
  l Pushing    limits of instruction-level parallelism
     – multiple instruction issue
     – speculative out-of-order (OOO) execution
  l Driven by applications such as SPEC
  l Increasing design time and team size
      Processor     Year     Transisto r     Design         Design        Verification
      (SGI MIPS)   Shipped     Count          Team           Ti m e        Team S ize
                             (millions)        Size        (months)       (% of total)
      R2000         1985        0.10            20            15              15%
      R4000         1991        1.40            55            24              20%
      R10000        1996        6.80          >100            36             >35%

                                           courtesy: John Hennessy, IEEE Computer, 32(8)


  l Yielding   diminishing returns in performance
Exploiting Higher Levels of Integration
                       Alpha 21364
                                                                          Single                M            M
                         1GHz                                             chip          364          364
             MEM-CTL




                       21264 CPU                                                                IO          IO

                                                                                                M            M
                       64KB 64KB   Coherence Engine

                                                      Network Interface
                         I$  D$                                                         364          364
         0
31




                                                                                                IO          IO
             MEM-CTL




                         1.5MB                                                                  M            M
                          L2$                                                           364          364
                                                                           I/O                  IO          IO
         0
31




     l   lower latency, higher bandwidth                                           l   incrementally scalable
     l   reuse of existing CPU core                                                    glueless multiprocessing
         addresses complexity issues
Exploiting Parallelism in Commercial Apps
Simultaneous Multithreading (SMT)         Chip Multiprocessing (CMP)

                                                             CPU        CPU
                          thread 1




                                                  MEM-CTL
                          thread 2                          I$   D$    I$      D$
   time




                          thread 3
                          thread 4




                                                                            Coherence
                                                                             Network
                                                  MEM-CTL
                                                                 L2$

          Example: Alpha 21464                                                          I/O

                                                  Example: IBM Power4

  l   SMT superior in single-thread performance
  l   CMP addresses complexity by using simpler cores
Outline
l Importance   of Commercial Workloads
l Commercial   Workload Requirements
l Trends   in Processor Design
l Piranha
   – Architecture
   – Performance
l Design   Methodology
l Summary
Piranha Project
l Explore chip multiprocessing for scalable servers
l Focus on parallel commercial workloads
l Small team, modest investment, short design time
l Address complexity by using:
   – simple processor cores
   – standard ASIC methodology




            Give up on ILP, embrace TLP
Piranha Team Members
Research                              NonStop Hardware Development
   –   Luiz André Barroso (WRL)        ASIC Design Center
   –   Kourosh Gharachorloo (WRL)        –   Tom Heynemann
   –   David Lowell (WRL)                –   Dan Joyce
                                         –   Harland Maxwell
   –   Joel McCormack (WRL)
                                         –   Harold Miller
   –   Mosur Ravishankar (WRL)
                                         –   Sanjay Singh
   –   Rob Stets (WRL)
                                         –   Scott Smith
   –   Yuan Yu (SRC)
                                         –   Jeff Sprouse
                                         –    … several contractors

                        Former Contributors
                   Robert McNamara    Brian Robinson
                   Basem Nayfeh       Barton Sano
                   Andreas Nowatzyk   Daniel Scales
                   Joan Pendleton     Ben Verghese
                   Shaz Qadeer
Piranha Processing Node
                                                                               Alpha core:
               MEM-CTL MEM-CTL MEM-CTL MEM-CTL                                  1-issue, in-order,
                                                                                500MHz
               CPU     CPU     CPU     CPU                                     L1 caches:
                                                                                I&D, 64KB, 2-way
          HE           L2$             L2$             L2$             L2$     Intra-chip switch (ICS)
               I$ D$           I$ D$           I$ D$           I$ D$             32GB/sec, 1-cycle delay
                                                                               L2 cache:
 Router




                                                                                 shared, 1MB, 8-way
                                        ICS                                    Memory Controller (MC)
                                                                                 RDRAM, 12.8GB/sec
                                                                               Protocol Engines (HE & RE):
                       I$ D$           I$ D$           I$ D$           I$ D$     µprog., 1K µinstr.,
          RE L2$               L2$             L2$             L2$               even/odd interleaving
                                                                               System Interconnect:
                                                                                 4-port Xbar router
                       CPU             CPU             CPU             CPU       topology independent
               MEM-CTL MEM-CTL MEM-CTL MEM-CTL                                   32GB/sec total bandwidth


                                                                               Single Chip
Piranha I/O Node




                                 Router
                                               CPU
                    2 Links @             HE
                    8GB/s
                                               I$ D$D$ PCI-X


                                            FB
                                             ICS
                                              FB



                                          RE L2$


                                     MEM-CTL


 l   I/O node is a full-fledged member of system interconnect
      – CPU indistinguishable from Processing Node CPUs
      – participates in global coherence protocol
Example Configuration

             P       P        P


                                         P- I/O




                                         P- I/O
                                  P
             P       P




 l   Arbitrary topologies
 l   Match ratio of Processing to I/O nodes to application requirements
L2 Cache and Intra-Node Coherence
l No   inclusion between L1s and L2 cache
   – total L1 capacity equals L2 capacity
   – L2 misses go directly to L1
   – L2 filled by L1 replacements
l L2   keeps track of all lines in the chip
   – sends Invalidates, Forwards
   – orchestrates L1-to-L2 write-backs to maximize
     chip-memory utilization
   – cooperates with Protocol Engines to enforce
     system-wide coherence
Inter-Node Coherence Protocol
l ‘Stealing’    ECC bits for memory directory
     8x(64+8)        4X(128+9+7)    2X(256+10+22) 1X(512+11+53)
                                                                          Data-bits
                                                                          ECC
                                                                          Directory-bits
         0               28                44            53
l Directory     (2b state + 40b sharing info)
             state       info on sharers        state   info on sharers

             2b               20b               2b            20b

l Dual representation: limited pointer + coarse vector
l “Cruise Missile” Invalidations (CMI)          CMI
   – limit fan-out/fan-in serialization with CV                 010000001000




l Several    new protocol optimizations
Simulated Architectures
Single-Chip Piranha Performance
                                                                      350
                            350
Normalized Execution Time




                            300                                                              L2Miss
                                                                                             L2Hit
                            250     233
                                                                                             CPU
                            200                                                191
                                            145
                            150
                                                    100                                100
                            100
                             50                              34                                 44

                              0
                                     P1     INO     OOO      P8         P1     INO      OOO     P8
                                  500 MHz 1GHz      1GHz 500MHz      500 MHz 1GHz      1GHz 500MHz
                                   1-issue 1-issue 4-issue 1-issue    1-issue 1-issue 4-issue 1-issue
                                                OLTP                                DSS


l                           Piranha’s performance margin 3x for OLTP and 2.2x for DSS
l                           Piranha has more outstanding misses è better utilizes memory system
Single-Chip Performance (Cont.)
                        (Cont.)
          8                                                                     100




                                                   Normalized Breakdown of L1
          7                                                                      90
                                                                                 80
          6
                                                                                 70




                                                           Misses (%)
          5
Speedup




                                                                                 60                                L2 Miss
          4                                                                      50                                L2 Fwd
          3                                                                      40                                L2 Hit
                                                                                 30
          2
                                                                                 20
          1
                                                                                 10
          0                                                                       0
              0   1   2   3   4    5   6   7   8                                      P1      P2     P4       P8
                      Number of Cores                                                      500 MHz, 1-issue


l Near-linear                     scalability
              – low memory latencies
              – effectiveness of highly associative L2 and non-inclusive caching
Potential of a Full-Custom Piranha
                            120
Normalized Execution Time




                                     100                                 100                L2 Miss
                            100
                                                                                            L2 Hit
                             80                                                             CPU

                             60
                                                                                    43
                             40                 34
                                                          20                                   19
                             20

                                0
                                     OOO        P8        P8F            OOO        P8        P8F
                                     1GHz     500MHz    1.25GHz          1GHz     500MHz    1.25GHz
                                    4-issue   1-issue    1-issue        4-issue   1-issue    1-issue
                                               OLTP                                 DSS

                            l   5x margin over OOO for OLTP and DSS
                            l   Full-custom design benefits substantially from boost in core speed
Outline
l Importance   of Commercial Workloads
l Commercial   Workload Requirements
l Trends   in Processor Design
l Piranha

l Design   Methodology
l Summary
Managing Complexity in the Architecture
l Use   of many simpler logic modules
   –   shorter design
   –   easier verification
   –   only short wires*
   –   faster synthesis
   –   simpler chip-level layout
l Simplify   intra-chip communication
   – all traffic goes through ICS (no backdoors)
l Use of microprogrammed protocol engines
l Adoption of large VM pages
l Implement sub-set of Alpha ISA
   – no VAX floating point, no multimedia instructions, etc.
Methodology Challenges
l Isolated   sub-module testing
   – need to create robust bus functional models (BFM)
   – sub-modules’ behavior highly inter-dependent
   – not feasible with a small team

l System-level    (integrated) testing
   –   much easier to create tests
   –   only one BFM at the processor interface
   –   simpler to assert correct operation
   –   Verilog simulation is too slow for comprehensive testing
Our Approach:
l Design   in stylized C++ (synthesizable RTL level)
   – use mostly system-level, semi-random testing
   – simulations in C++ (faster & cheaper than Verilog)
        § simulation speed ~1000 clocks/second
   – employ directed tests to fill test coverage gaps
l Automatic    C++ to Verilog translation
   –   single design database
   –   reduce translation errors
   –   faster turnaround of design changes
   –   risk: untested methodology
l Using   industry-standard synthesis tools
l IBM   ASIC process (Cu11)
Piranha Methodology: Overview

         C++ RTL                    C++ RTL Models: Cycle
         Models                     accurate and “synthesizeable”

                                    PS1: Fast (C++) Logic
                    CLevel
                                    Simulator

                         Verilog    Verilog Models: Machine
      cxx   cxx                     translated from C++ models
                         Models
                                    Physical Design: leverages
                                    industry standard Verilog-based
                                    tools

PS1         PS1V         Physical   PS1V: Can “co-simulate” C++
                          Design    and Verilog module versions
                                    and check correspondence
cxx: C++ compiler
CLevel: C++-to-Verilog Translator
Summary
l CMP    architectures are inevitable in the near future
l Piranha   investigates an extreme point in CMP design
   – many simple cores
l Piranhahas a large architectural advantage over complex
 single-core designs (> 3x) for database applications
l Piranha   methodology enables faster design turnaround
l Key   to Piranha is application focus:
   – One-size-fits-all solutions may soon be infeasible
Reference
l Papers   on commercial workload performance & Piranha
   research.compaq.com/wrl/projects/Database
TDT 4260 – lecture 11/3 - 2011                                                  Miniproject – after the first
                                                                                deadline
• Miniproject status, update, presentation
                                                                                 Implementing 1        Comparison of 2          Improving on
• Synchronization, Textbook Chap 4.5                                             existing              or more existing         existing
  – And a short note on BSP (with excellent timing …)                            prefetcher            prefetchers              prefetcher
• Short presentation of NUTS, NTNU Test                                          Sequential                                     Improving
  Sattelite System http://nuts iet ntnu no/
                   http://nuts.iet.ntnu.no/                                      prefetcher            RPT and DCPT             sequential
                                                                                                                                prefetcher

• UltraSPARC T1 (Niagara),                                                                             Sequential
  Chap 4.8                                                                       RPT prefetcher        (tagged or               Improving DCPT
• And more on multicores                                                                               adaptive), RPT
                                                                                                       and DCPT


             1                                                   Lasse Natvig                 2                                               Lasse Natvig




Miniproject – after the first deadline                                          Miniproject presentations
 • Feedback
    – RPT and DCPT are popular choice; the report should                        • Friday 15/4 at 1415-1700 (max)
      properly motivate each group choice of prefetcher
      (the motivation should not be: “The code was easily
                                                                                • OK for all?
      available”)                                                                 – No … we are working on finding a time schedule that is OK for all
    – Several groups works on similar methods
       •  “find your story”
    – too much focus on getting the highest result in the
      PfJudge ranking; as stated in section 2.3. of the
      guidelines, the miniproject will be evaluated based on
      the following criteria:
       •   good use of language
       •   clarity of the problem statement
       •   overall document structure
       •   depth of understanding for the field of prefetching
       •   quality of presentation
             3                                                   Lasse Natvig                 4                                               Lasse Natvig




     IDI Open, a challenge
                                                                                 Synchronization
     for you?
                                                                                 • Important concept
                                                                                   – Synchronize access to shared resources
                                                                                   – Order events from cooperating processes correctly
                                                                                 • Smaller MP systems
                                                                                   – Implemented by uninterrupted instruction(s) atomically accessing
                                                                                     a value
                                                                                         l
                                                                                   – Requires special hardware support
                                                                                   – Simplifies construction of OS / parallel apps
                                                                                 • Larger MP systems  Appendix H (not in
                                                                                   course)



             5                                                   Lasse Natvig                 6                                               Lasse Natvig




                                                                                                                                                             1
Atomic exchange (swap)                                                           Implementing atomic exchange (1/2)
 • Swaps value in register for value in memory                                     • One alternative: Load Linked (LL) and
      – Mem = 0 means not locked, Mem = 1 means locked                               Store Conditional (SC)
      – How does this work                                                            – Used in sequence
          • Register <= 1 ; Processor want to lock                                        • If memory location accessed by LL changes, SC fails
          • Exchange(Register, Mem)
                                                                                          • If context switch between LL and SC SC fails
                                                                                                                             SC,
      – If Register = 0  Success
          • Mem was = 0  Was unlocked
                                                                                      – Implemented using a special link register
          • Mem is now = 1  Now locked                                                   • Contains address used in LL
      – If Register = 1  Fail                                                            • Reset if matching cache block is invalidated or if we get
          • Mem was = 1  Was locked                                                        an interrupt
          • Mem is now = 1  Still locked                                                 • SC checks if link register contains the same address. If
 • Exchange must be atomic!                                                                 so, we have atomic execution of LL & SC


                7                                                 Lasse Natvig                 8                                                                Lasse Natvig




                                                                                  Barrier sync. in BSP
 Implementing atomic exchange (2/2)                                              • The BSP-model
 • Example code EXCH (R4, 0(R1)):                                                   – Leslie G. Valiant, A
                                                                                      bridging model for
      try: MOV       R3, R4        ; mov exchange value
                                                                                      parallel computation,
           LL        R2, 0(R1)     ; load linked                                      [CACM 1990]
           SC        R3, 0(R1)     ; store conditional                              – Computations
           BEQZ      R3, try       ; branch if SC failed                              organised in
           MOV       R4, R2        ; put load value in R4                             supersteps
                                                                                    – Algorithms adapt to
                                                                                      compute platform
 • This can now be used to implement e.g. spin locks                                  represented through 4
                     DADDUI        R2, R0, #1        ; R0 always = 0                  parameters
      lockit:        EXCH          R2, 0(R1)         ; atomic exchange              – Helps the combination
                     BNEZ          R2, lockit        ; already locked?                of portability &
                                                                                      performance
                                                                                             http://www.seas.harvard.edu/news-events/press-releases/valiant_turing

                9                                                 Lasse Natvig                10                                                                Lasse Natvig




Multicore                                                                         Why multicores?
• Important and early example: UltraSPARC T1
• Motivation (See lecture 1)
  –    In all market segments from mobile phones to supercomputers
  –    End of Moores law for single-core
  –    The power wall
  –    The
       Th memory wall   ll
  –    The bandwith problem
  –    ILP limitations
  –    The complexity wall




                11                                                Lasse Natvig                12                                                                Lasse Natvig




                                                                                                                                                                               2
Chip Multithreading
                                                               Opportunities and challenges
                                                               • Paper by Spracklen & Abraham, HPCA-11 (2005)
                                                                 [SA05]
                                                               • CMT processors = Chip Multi-Threaded processors
                                                               • A spectrum of processor architectures
                                                                  – Uni-processors with SMT (one core)
                                                                  – (pure) Chip Multiprocessors (CMP) (one thread pr. core)
                                                                  – Combination of SMT and CMP (They call it CMT)
                                                                      • Best suited to server workloads (with high TLP)




             13                                 Lasse Natvig                14                                                      Lasse Natvig




Offchip Bandwidth                                              Sharing processor resources
                                                               • SMT
• A bottleneck                                                    – Hardware strand
                                                                      • ”HW for storing the state of a thread of execution”
• Bandwidth increasing, but also latency [Patt04]                     • Several strands can share resources within the core, such as execution
                                                                        resources
• Need more than 100 in-flight requests to fully utilize                  – This improves utilization of processor resources
  the available bandwidth                                                 – Reduces applications sensitivity to off-chip misses
                                                                             • Switch between threads can be very efficient
                                                               • (pure) CMP
                                                                  – Multiple cores can share chip resources such as memory controller,
                                                                    off-chip bandwidth and L2 cache
                                                                  – No sharing of HW resources between strands within core
                                                               • Combination (CMT)



             15                                 Lasse Natvig                16                                                      Lasse Natvig




1st generation CMT                                             2nd generation CMT
• 2 cores per chip                                             • 2 or more cores per chip
• Cores derived from                                           • Cores still derived from earlier
  earlier uniprocessor                                           uniprocessor designs
  designs
                                                               • Cores now share the L2 cache
• Cores do not share any                                          – Speeds inter-core co
                                                                               te co e communication
                                                                                              u cat o
  resources, except off-
                   t ff                                           – Advantageous as most commercial
  chip data paths                                                   applications have significant instruction
• Examples: Sun’s Gemini,                                           footprints
  Sun’s UltraSPARC IV (Jaguar),                                • Examples: Sun’s UltraSPARC
  AMD dual core Opteron, Intel
  dual-core Itanium (Montecito),                                 IV+, IBM’s Power 4/5
  Intel dual-core Xeon (Paxville,
  server)


             17                                 Lasse Natvig                18                                                      Lasse Natvig




                                                                                                                                                   3
3rd generation CMT                                                              Multicore generations (?)
 • CMT processors are best
   designed from the
   ground-up, optimized for a
   CMT design point
    – Lower power consumption
 • Multiple cores per chip
 • Examples:
    – Sun’s Niagara (T1)
        • 8 cores, each is 4-way SMT
        • Each core single-issue, short
          pipeline
        • Shared 3MB L2-cache
    – IBM’s Power-5
        • 2 cores, each 2-way SMT
              19                                                  Lasse Natvig                20                                                       Lasse Natvig




  CMT/Multicore design space                                                      CMT/Multicore challenges
 • Number of cores                                                               • Multipe threads (strands) share resources
    – Multiple simple or few complex?                                              – Maximize overall performance
        • Recent paper of Hill & Marty …                                               • Good resource utilization
           – See http://www.youtube.com/watch?v=KfgWmQpzD74
                                                                                       • Avoid ”starvation” (Units without work to do)
    – Heterogeneous cores                                                          – Cores must be ”good neighbours”
        • Serial fraction of parallel application                                      • Fairness, research by Magnus Jahre
           – Remember Amdahl’s law                                                     • See http://research.idi.ntnu.no/multicore/pub
        • O powerful core for single-threaded applications
          One         f l      f    i l th d d    li ti                          • P f t hi
                                                                                   Prefetching
 • Resource sharing                                                                – Agressive prefetching is OK in single-thread system since the entire
    – L2 cache! (and L3)                                                             system is idle on a miss
        • (Terminology: LL = Last Level cache)                                     – CMT/Multicore requires more careful prefetching
    – Floating point units                                                             • Prefetch operation may take resources used by other threads
    – New more expensive resources (amortized over multiple cores)                 – See research by Marius Grannæs (same link as above)
        • Shadow tags, more advanced cache techniques, HW accelerators,
          Cryptographic, OS functions (eg. memcopy),
          XML parsing, compression
                                                                                 • Speculative operations
           – Your innovation !!!                                                  – OK if using idle resources (delay until resource is idle)
                                                                                   – More careful (just as prefetching) / seldomly power efficient
              21                                                  Lasse Natvig                22                                                       Lasse Natvig




  UltraSPARC T1 (“Niagara”)                                                      T1 processor – ”logical” overview
• Target: Commercial server applications
  – High thread level parallelism (TLP)
     • Large numbers of parallel client requests
  – Low instruction level parallelism (ILP)
     • High cache miss rates
     • Many unpredictable branches

• Power, cooling, and space are
  major concerns for data centers
• Metric: (Performance / Watt) / Sq. Ft.
• Approach: Multicore, Fine-grain
  multithreading, Simple pipeline, Small L1                                                                                1.2 GHz at 72W typical, 79W peak
  caches, Shared L2                                                                                                        power consumption

              23                                                  Lasse Natvig                24                                                       Lasse Natvig




                                                                                                                                                                      4
T1 Architecture                                                                                                       T1 pipeline / 4 threads
• Also ships with 6 or 4 processors                                                                             • Single issue, in-order, 6-deep pipeline: F, S, D,
                                                                                                                  E, M, W
                                                                                                                • Shared units:
                                                                                                                      –   L1 cache, L2 cache
                                                                                                                      –   TLB
                                                                                                                      –   Exec.
                                                                                                                          Exec units
                                                                                                                      –   pipe registers
                                                                                                                • Separate units:
                                                                                                                      – PC
                                                                                                                      – instruction
                                                                                                                        buffer
                                                                                                                      – reg file
                                                                                                                      – store buffer

                        25                                                     Lasse Natvig                                         26                                                              Lasse Natvig




Miss Rates: L2 Cache Size, Block                                                                                Miss Latency: L2 Cache Size,
Size (fig. 4.27)                                                                                                Block Size (fig. 4.28)
                                                                                                                200
               2.5%
                                                                                                                180
                                                                                                                                                                                TPC-C
                                                                                                                                                                   T1           SPECJBB
               2.0%                                          TPC-C                                              160


                                                             SPECJBB                                            140
         ate
L2 Miss ra




               1.5%
               1 5%
                                                                                              L2 Miss latency




                                                                                                                120
                                                      T1
                                                                                                                100
               1.0%
                                                                                                                80

               0.5%                                                                                             60


                                                                                                                40
               0.0%
                                                                                                                20
                      1.5 MB;   1.5 MB;     3 MB;    3 MB;    6 MB;    6 MB;
                        32B       64B        32B      64B     32B       64B                                       0
                                                                                                                      1.5 MB; 32B        1.5 MB; 64B   3 MB; 32B    3 MB; 64B   6 MB; 32B   6 MB; 64B
                        27                                                     Lasse Natvig                                         28                                                              Lasse Natvig




                                                                                                                Average thread status (fig 4.30)
CPI Breakdown of Performance
                                    Per             Per      Effective    Effective
                                   Thread           core       CPI for      IPC for
Benchmark                            CPI             CPI       8 cores      8 cores

TPC C
TPC-C                                7.20
                                     7 20           1.80
                                                    1 80        0.23
                                                                0 23              4.4
                                                                                  44

SPECJBB                              5.60           1.40        0.18              5.7

SPECWeb99                           6.60            1.65        0.21              4.8




                        29                                                     Lasse Natvig                                         30                                                              Lasse Natvig




                                                                                                                                                                                                                   5
Performance Relative to Pentium D
                                   Not Ready Breakdown (fig 4.31)                                                                                                                                 6.5

                                                                                                                                                                                                   6

                                                                    100%
Fraction of cycles not ready




                                                                                                                                                                                                  5.5
                                                                                                                                              Other                                                                  +Power5   Opteron   Sun T1
                                                                    80%                                                                                                                            5




                                                                                                                                                                                        entiumD
                                                                                                                                              Pipeline delay                                      4.5
                                                                    60%




                                                                                                                                                                      ance relative to P
                                                                                                                                              L2 miss                                              4
                                                                    40%                                                                                                                           3.5
                                                                                                                                              L1 D miss
                                                                    20%                                                                                                                            3
                                                                                                                                              L1 I miss
                                                                                                                                                                                                  2.5




                                                                                                                                                                erform
                                                                        0%
                                                                                 TPC-C                   SPECJBB           SPECWeb99                                                               2




                                                                                                                                                               P
                                                                                                                                                                                                  1.5

                               • Other = ?                                                                                                                                                         1

                                                                     – TPC-C - store buffer full is largest contributor                                                                           0.5
                                                                     – SPEC-JBB - atomic instructions are largest contributor                                                                      0
                                                                     – SPECWeb99 - both factors contribute                                                                                              SPECIntRate    SPECFPRate    SPECJBB05    SPECW eb05   TPC-like

                                                                                31                                                           Lasse Natvig                                                       32                                                    Lasse Natvig




                                           Performance/mm2, Performance/Watt
                                                                        5.5

                                                                         5

                                                                        4.5
                                              alized to Pentium D




                                                                                     +Power5   Opteron   Sun T1
                                                                         4

                                                                        3.5

                                                                         3
                               Efficiency norma




                                                                        2.5
                                                                        25

                                                                         2

                                                                        1.5

                                                                         1

                                                                        0.5

                                                                         0
                                                                                           ^2




                                                                                           ^2
                                                                                             t




                                                                                             t




                                                                                           ^2




                                                                                            t




                                                                                                                           ^2




                                                                                                                                         t
                                                                                          at




                                                                                          at




                                                                                          at




                                                                                                                                       at
                                                                                         m




                                                                                         m
                                                                                        W




                                                                                        W




                                                                                         m




                                                                                                                          m
                                                                                        W




                                                                                                                                     /W
                                                                                      m




                                                                                      m




                                                                                      m




                                                                                                                        /m
                                                                                      e/




                                                                                     e/




                                                                                     5/




                                                                                                                                   -C
                                                                                    e/




                                                                                   e/




                                                                                   5/
                                                                                   at




                                                                                   at




                                                                                  B0




                                                                                                                      -C
                                                                                 at




                                                                                 at




                                                                                                                                  C
                                                                                B0
                                                                                tR




                                                                                 R




                                                                                                                     C
                                                                               JB




                                                                                                                                TP
                                                                               tR




                                                                                R




                                                                             FP
                                                                              In




                                                                             JB




                                                                                                                   TP
                                                                           FP
                                                                            In




                                                                            C
                                                                            C




                                                                           C




                                                                           C




                                                                          E
                                                                          C




                                                                        PE




                                                                         C




                                                                        PE




                                                                         E




                                                                       SP
                                                                       PE




                                                                       PE




                                                                      SP
                                                                       S




                                                                      S
                                                                    S




                                                                     S




                                                                                33                                                           Lasse Natvig




                                                                                                                                                                                                                                                                                     6
Cache Coherency
     And
 Memory Models
Review
●
  Does pipelining help instruction latency?
● Does pipelining help instruction throughput?

● What is Instruction Level Parallelism?

● What are the advantages of OoO machines?

● What are the disadvantages of OoO machines?

● What are the advantages of VLIW?

● What are the disadvantages of VLIW?

● What is an example of Data Spatial Locality?

● What is an example of Data Temporal Locality?

● What is an example of Instruction Spatial Locality?

● What is an example of Instruction Temporal Locality?

● What is a TLB?

● What is a packet switched network?
Memory Models (Memory Consistency)

  Memory Model: The system supports a given model if
  operations on memory follow specific rules. The data
  consistency model specifies a contract between
  programmer and system, wherein the system guarantees
  that if the programmer follows the rules, memory will be
  consistent and the results of memory operations will be
  predictable.
Memory Models (Memory Consistency)

  Memory Model: The system supports a given model if
  operations on memory follow specific rules. The data
  consistency model specifies a contract between
  programmer and system, wherein the system guarantees
  that if the programmer follows the rules, memory will be
  consistent and the results of memory operations will be
  predictable.


                    Huh??????
Sequential Consistency?
Simple Case
●
    Consider a simple two processor system

                          Memory


                         Interconnect


                      CPU 0        CPU 1

●   The two processors are coherent
    ● Programs running in parallel may communicate via

      memory addresses
    ● Special hardware is required in order to enable

      communication via memory addresses.
    ● Shared memory addresses are the standard form of

      communication for parallel programming
Simple Case
●
    CPU 0 wants to send a data word to CPU 1

                          Memory


                         Interconnect


                      CPU 0        CPU 1




●   What does the code look like ???
Simple Case
●
    CPU 0 wants to send a data word to CPU 1

                          Memory


                         Interconnect


                      CPU 0        CPU 1




●   What does the code look like ???

    ● Code on CPU0 writes a value to an address
    ● Code on CPU1 reads the address to get the new value
Simple Case
int shared_flag = 0;
int shared_value = 0;
                                       Memory
void sender_thread()
{
   shared_value = 42;                 Interconnect
   shared_flag = 1;
}                                  CPU 0        CPU 1

void receiver_thread()
{
   while (shared_flag == 0) { }
   Int new_value = shared_value;
   printf(“%in”, new_value);
}
Simple Case
                            Global variables are shared when using
                            pthreads. This means all threads within
int shared_flag = 0;        this process may access these variables
int shared_value = 0;
                                                      Memory
void sender_thread()
{
   shared_value = 42;                                Interconnect
   shared_flag = 1;
}                                               CPU 0          CPU 1

void receiver_thread()
{
   while (shared_flag == 0) { }
   Int new_value = shared_value;
   printf(“%in”, new_value);
}
Simple Case
                            Global variables are shared when using
                            pthreads. This means all threads within
int shared_flag = 0;        this process may access these variables
int shared_value = 0;
                           Sender writes to           Memory
void sender_thread()       the shared data,
                           then sets a
{                          shared data flag
   shared_value = 42;      that the receiver         Interconnect
   shared_flag = 1;        is polling
}                                               CPU 0          CPU 1

void receiver_thread()
{
   while (shared_flag == 0) { }
   Int new_value = shared_value;
   printf(“%in”, new_value);
}
Simple Case
                            Global variables are shared when using
                            pthreads. This means all threads within
int shared_flag = 0;        this process may access these variables
int shared_value = 0;
                           Sender writes to             Memory
void sender_thread()       the shared data,
                           then sets a
{                          shared data flag
   shared_value = 42;      that the receiver           Interconnect
   shared_flag = 1;        is polling
}                                                 CPU 0           CPU 1

void receiver_thread()
{
   while (shared_flag == 0) { }            Receiver is polling on the flag. When
                                           the flag is no longer zero, the
   Int new_value = shared_value;           receiver reads the shared_value and
   printf(“%in”, new_value);              prints it out.
}
Simple Case
                            Global variables are shared when using
                            pthreads. This means all threads within
int shared_flag = 0;        this process may access these variables
int shared_value = 0;
                           Sender writes to             Memory
void sender_thread()       the shared data,
                           then sets a
{                          shared data flag
   shared_value = 42;      that the receiver           Interconnect
   shared_flag = 1;        is polling
}                                                 CPU 0           CPU 1
               Any Problems???
void receiver_thread()
{
   while (shared_flag == 0) { }            Receiver is polling on the flag. When
                                           the flag is no longer zero, the
   Int new_value = shared_value;           receiver reads the shared_value and
   printf(“%in”, new_value);              prints it out.
}
Simple CMP Cache Coherency

Directory   Directory   Directory   Directory   ●
                                                 Four core machine supporting
                                                cache coherency
L2 Bank     L2 Bank     L2 Bank     L2 Bank
                                                ●Each core has a local L1 Data
                                                and Instruction cache.
               Interconnect                     ●The L2 cache is shared
                                                amongst all cores, and
                                                physically distributed into 4
   L1          L1          L1          L1
                                                disparate banks
 CPU 0       CPU 0       CPU 0       CPU 0
                                                ●The interconnect sends
                                                memory requests and
                                                responses back and forth
                                                between the caches
The Coherency Problem
  Directory   Directory   Directory   Directory

  L2 Bank     L2 Bank     L2 Bank     L2 Bank



                 Interconnect


     L1          L1          L1          L1

   CPU 0       CPU 0       CPU 0       CPU 0


  Ld R1,X
The Coherency Problem
        Directory   Directory   Directory   Directory
                                                        ●
                                                            Misses in Cache
        L2 Bank     L2 Bank     L2 Bank     L2 Bank



                       Interconnect

Miss!
           L1          L1          L1          L1

         CPU 0       CPU 0       CPU 0       CPU 0


        Ld R1,X
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                  ●
                                                      Misses in Cache
  L2 Bank     L2 Bank     L2 Bank     L2 Bank
                                                  ● Goes to “home”
                                                  l2 (home often
                                                  determined by
                 Interconnect                     hash of address)

     L1          L1          L1          L1

   CPU 0       CPU 0       CPU 0       CPU 0


  Ld R1,X
To
The Coherency Problem                                            Memory


  Directory   Directory   Directory   Directory
                                                  ●
                                                      Misses in Cache
  L2 Bank     L2 Bank     L2 Bank     L2 Bank
                                                  ● Goes to “home”
                                                  l2 (home often
                                                  determined by
                 Interconnect                     hash of address)

                                                  ●If miss at home
     L1          L1          L1          L1       L2, read data from
   CPU 0       CPU 0       CPU 0       CPU 0      memory


  Ld R1,X
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                  ●
                                                      Misses in Cache
  L2 Bank     L2 Bank     L2 Bank     L2 Bank
                                                  ● Goes to “home”
                                                  l2 (home often
                                                  determined by
                 Interconnect                     hash of address)

                                                  ●If miss at home
     L1          L1          L1          L1       L2, read data from
   CPU 0       CPU 0       CPU 0       CPU 0      memory

                                                  ●Deposit data in
  Ld R1,X                                         both home L2 and
                                                  Local L1
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                   ●
                                                       Misses in Cache
  L2 Bank     L2 Bank     L2 Bank     L2 Bank
                                                   ● Goes to “home”
                                                   l2 (home often
                                                   determined by
                 Interconnect                      hash of address)

                                                   ●If miss at home
     L1          L1          L1          L1        L2, read data from
   CPU 0       CPU 0       CPU 0       CPU 0       memory

                                                   ●Deposit data in
  Ld R1,X                                          both home L2 and
                                                   Local L1


                       Mem(X) is now in both the
                       L2 and ONE L1 cache
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                  ●CPU 3 reads the
  L2 Bank     L2 Bank     L2 Bank     L2 Bank     same address



                 Interconnect


     L1          L1          L1          L1

   CPU 0       CPU 1       CPU 2       CPU 3


  Ld R1,X                             Ld R2,X
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                     ●CPU 3 reads the
  L2 Bank     L2 Bank     L2 Bank     L2 Bank        same address

                                                     ●    Miss in L1

                 Interconnect


     L1          L1          L1          L1

   CPU 0       CPU 1       CPU 2       CPU 3      Miss!


  Ld R1,X                             Ld R2,X
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                  ●CPU 3 reads the
  L2 Bank     L2 Bank     L2 Bank     L2 Bank     same address

                                                  ●   Miss in L1

                 Interconnect
                                                  ●   Sends request to L2

                                                  ●   Hits in L2
     L1          L1          L1          L1

   CPU 0       CPU 1       CPU 2       CPU 3


  Ld R1,X                             Ld R2,X
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                  ●CPU 3 reads the
  L2 Bank     L2 Bank     L2 Bank     L2 Bank     same address

                                                  ●   Miss in L1

                 Interconnect
                                                  ●   Sends request to L2

                                                  ●   Hits in L2
     L1          L1          L1          L1       ●Data is placed in L1
   CPU 0       CPU 1       CPU 2       CPU 3      cache for CPU 3


  Ld R1,X                             Ld R2,X
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                  ● CPU now STORES
  L2 Bank     L2 Bank     L2 Bank     L2 Bank     to address X



                 Interconnect


     L1          L1          L1          L1

   CPU 0       CPU 1       CPU 2       CPU 3


  Ld R1,X                             Ld R2,X


  Store R2, X
                              What happens?????
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                  ● CPU now STORES
  L2 Bank     L2 Bank     L2 Bank     L2 Bank     to address X



                 Interconnect


     L1          L1          L1          L1

   CPU 0       CPU 1       CPU 2       CPU 3


  Ld R1,X                             Ld R2,X


  Store R2, X                 Special hardware is needed in
                              order to either update or
                              invalidate the data in CPU 3's
                              cache
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                  ● For this example, we
  L2 Bank     L2 Bank     L2 Bank     L2 Bank     will assume a
                                                  directory based
                                                  invalidate protocol,
                                                  with write-thru L1
                 Interconnect                     caches


     L1          L1          L1          L1

   CPU 0       CPU 1       CPU 2       CPU 3


  Ld R1,X                             Ld R2,X


  Store R2, X
The Coherency Problem
  Directory   Directory   Directory   Directory
                                                  ● Store updates the
  L2 Bank     L2 Bank     L2 Bank     L2 Bank     local L1 and writes-
                                                  thru to the L2


                 Interconnect


     L1          L1          L1          L1

   CPU 0       CPU 1       CPU 2       CPU 3


  Ld R1,X                             Ld R2,X


  Store R2, X
The Coherency Problem
  Directory   Directory    0, 3     Directory
                                                ● Store updates the
  L2 Bank     L2 Bank     L2 Bank   L2 Bank     local L1 and writes-
                                                thru to the L2

                                                ●At the L2, the
                 Interconnect                   directory is inspected,
                                                showing CPU3 is
                                                sharing the line
     L1          L1         L1         L1

   CPU 0       CPU 1      CPU 2      CPU 3


  Ld R1,X                           Ld R2,X


  Store R2, X
The Coherency Problem
  Directory   Directory    0, 3     Directory
                                                ● Store updates the
  L2 Bank     L2 Bank     L2 Bank   L2 Bank     local L1 and writes-
                                                thru to the L2

                                                ●At the L2, the
                 Interconnect                   directory is inspected,
                                                showing CPU3 is
                                                sharing the line
     L1          L1         L1         L1       ●The data in CPU3's
   CPU 0       CPU 1      CPU 2      CPU 3      cache is invalidated


  Ld R1,X                           Ld R2,X


  Store R2, X
The Coherency Problem
  Directory   Directory     0       Directory
                                                ● Store updates the
  L2 Bank     L2 Bank     L2 Bank   L2 Bank     local L1 and writes-
                                                thru to the L2

                                                ●At the L2, the
                 Interconnect                   directory is inspected,
                                                showing CPU3 is
                                                sharing the line
     L1          L1         L1         L1       ●The data in CPU3's
   CPU 0       CPU 1      CPU 2      CPU 3      cache is invalidated

                                                ●The L2 cache is
                                                updated with the new
  Ld R1,X                           Ld R2,X
                                                value


  Store R2, X
The Coherency Problem
  Directory   Directory     0       Directory
                                                ● Store updates the
  L2 Bank     L2 Bank     L2 Bank   L2 Bank     local L1 and writes-
                                                thru to the L2

                                                ●At the L2, the
                 Interconnect                   directory is inspected,
                                                showing CPU3 is
                                                sharing the line
     L1          L1         L1         L1       ●The data in CPU3's
   CPU 0       CPU 1      CPU 2      CPU 3      cache is invalidated

                                                ●The L2 cache is
                                                updated with the new
  Ld R1,X                           Ld R2,X
                                                value

                                                ● The system is now
  Store R2, X                                   “coherent”

                                                ● Note that CPU3 was
                                                removed from the
                                                directory
Ordering
Directory   Directory   Directory   Directory
                                                ● Our protocol relies on
L2 Bank     L2 Bank     L2 Bank     L2 Bank     stores writing through
                                                to the L2 cache.


               Interconnect


   L1          L1          L1          L1

 CPU 0       CPU 1       CPU 2       CPU 3


Store R1,X

Store R2, Y
Ordering
Directory   Directory   Directory   Directory
                                                ● Our protocol relies on
L2 Bank     L2 Bank     L2 Bank     L2 Bank     stores writing through
                                                to the L2 cache.

                                                ● If the stores are to
               Interconnect                     different addresses,
                                                there are multiple
                                                points within the
                                                system where the
   L1          L1          L1          L1
                                                stores may be
 CPU 0       CPU 1       CPU 2       CPU 3      reordered.


Store R1,X

Store R2, Y
Ordering
Directory   Directory   Directory   Directory

L2 Bank     L2 Bank     L2 Bank     L2 Bank



               Interconnect


   L1          L1          L1          L1

 CPU 0       CPU 1       CPU 2       CPU 3


Store R1,X

Store R2, Y
Ordering
Directory   Directory   Directory   Directory

L2 Bank     L2 Bank     L2 Bank     L2 Bank



               Interconnect


   L1          L1          L1          L1

 CPU 0       CPU 1       CPU 2       CPU 3


Store R1,X

Store R2, Y
Ordering
Directory   Directory   Directory   Directory

L2 Bank     L2 Bank     L2 Bank     L2 Bank



               Interconnect


   L1          L1          L1          L1

 CPU 0       CPU 1       CPU 2       CPU 3


Store R1,X

Store R2, Y
Ordering
Directory   Directory   Directory   Directory   Purple leaves the
                                                network first!
L2 Bank     L2 Bank     L2 Bank     L2 Bank



               Interconnect


   L1          L1          L1          L1

 CPU 0       CPU 1       CPU 2       CPU 3


Store R1,X

Store R2, Y
Ordering
Directory   Directory   Directory   Directory   Stores are written
                                                to the shared L2
L2 Bank     L2 Bank     L2 Bank     L2 Bank     out-of-order
                                                (purple first, then
                                                red) !!!

               Interconnect


   L1          L1          L1          L1

 CPU 0       CPU 1       CPU 2       CPU 3


Store R1,X

Store R2, Y
Ordering
Directory   Directory   Directory   Directory        Stores are written
                                                     to the shared L2
L2 Bank     L2 Bank     L2 Bank     L2 Bank          out-of-order
                                                     (purple first, then
                                                     red) !!!

               Interconnect


   L1          L1          L1          L1
                                                Interconnect is not
 CPU 0       CPU 1       CPU 2       CPU 3      the only cause for
                                                out-of-order!

Store R1,X

Store R2, Y
Ordering
Directory   Directory   Directory   Directory

L2 Bank     L2 Bank     L2 Bank      L2 Bank



               Interconnect


   L1          L1          L1          L1

 CPU 0       CPU 1       CPU 2       CPU 3


Store R1,X
                                Processor core may issues instructions
Store R2, Y                     out-of-order (remember out-of-order
                                machines??)
Ordering
Directory   Directory   Directory   Directory

L2 Bank     L2 Bank     L2 Bank     L2 Bank



               Interconnect
                                                L2 pipeline may also
                                                reorder requests to
   L1          L1          L1          L1       different addresses

 CPU 0       CPU 1       CPU 2       CPU 3


Store R1,X

Store R2, Y
L2 Pipeline Ordering

 Retry Fifo       Resource
                  Allocation
                               L2 Tag   L2 Data   Coherence
                     And
                               Access   Access     Control
From Network       Conflict
                  Detection
L2 Pipeline Ordering

 Retry Fifo        Resource
                   Allocation
                                 L2 Tag   L2 Data   Coherence
                      And
                                 Access   Access     Control
From Network        Conflict
                   Detection




         Two Memory Requests
         arrive on the network
L2 Pipeline Ordering

 Retry Fifo        Resource
                   Allocation
                                 L2 Tag   L2 Data   Coherence
                      And
                                 Access   Access     Control
From Network        Conflict
                   Detection




         Requests Serviced in-
         order
L2 Pipeline Ordering

 Retry Fifo         Resource
                    Allocation
                                 L2 Tag   L2 Data   Coherence
                       And
                                 Access   Access     Control
From Network       Conflict!
                     Conflict
                    Detection




         Conflicts are sent to
         retry fifo
L2 Pipeline Ordering

 Retry Fifo         Resource
                    Allocation
                                 L2 Tag   L2 Data   Coherence
                       And
                                 Access   Access     Control
From Network         Conflict
                    Detection




         Network is given
         priority
L2 Pipeline Ordering

 Retry Fifo       Resource
                  Allocation
                               L2 Tag         L2 Data   Coherence
                     And
                               Access         Access     Control
From Network       Conflict
                  Detection




                   Requests are now
                   executing in a different
                   order!
L2 Pipeline Ordering

 Retry Fifo       Resource
                  Allocation
                               L2 Tag         L2 Data   Coherence
                     And
                               Access         Access     Control
From Network       Conflict
                  Detection




                   Requests are now
                   executing in a different
                   order!
Simple Case (revisited)
int shared_flag = 0;
int shared_value = 0;
                                       Memory
void sender_thread()
{
   shared_value = 42;                 Interconnect
   shared_flag = 1;
}                                  CPU 0        CPU 1

void receiver_thread()
{
   while (shared_flag == 0) { }
   Int new_value = shared_value;
   printf(“%in”, new_value);
}
Simple Case (revisited)
         Directory   Directory   Directory   Directory

         L2 Bank     L2 Bank     L2 Bank     L2 Bank



                         Interconnect


            L1           L1         L1          L1

          CPU 0       CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                       while (shared_flag == 0) { }
                                                                   0
 shared_flag = 1;   1                        new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0                                  0            Receiver is
                                                             spinning on
                                                             “shared_flag”
                         Interconnect


            L1           L1        L1          0
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0                                  0            “shared_value”
                                                             has reset value
                                                             of 0
                         Interconnect


            L1           L1        L1          0
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0                                  0         Store to shared
                                                          value writes-thru
                                                          L1
                   42 Interconnect


            L1           L1        L1          0
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3        Directory   Directory   Directory

         L2 Bank     L2 Bank     L2 Bank     L2 Bank
             0                                   0         Store to
                                                           “shared_flag”
                                                           writes thru L1
                 1   42 Interconnect


            L1           L1         L1          0
                                                L1

          CPU 0       CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                       while (shared_flag == 0) { }
                                                                   0
 shared_flag = 1;   1                        new_value = shared_value;
Simple Case (revisited)
            3        Directory   Directory   Directory

         L2 Bank     L2 Bank     L2 Bank     L2 Bank
             0                                   0
                                                           Both stores are
                                                           now sitting in
                 1   42 Interconnect                       the network


            L1           L1         L1          0
                                                L1

          CPU 0       CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                       while (shared_flag == 0) { }
                                                                   0
 shared_flag = 1;   1                        new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                  0
                                                          Store to
                                                          “shared_flag” is
                   42 Interconnect                        first to leave the
                                                          network

            L1           L1        L1          0
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                  0
                                                          1) “shared_flag”
                                                          is updated
                   42 Interconnect
                                                          2) Coherence
                                                          protocol
            L1           L1        L1          0
                                               L1         invalidates copy
                                                          in CPU3
          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                  0


                   42 Interconnect


            L1           L1        L1          L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank       Receiver that is
             0
             1                                  0         polling now
                                                          misses in the
                                                          cache and sends
                   42 Interconnect                        request to L2!


            L1           L1        L1          L1       Miss!
          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank       Response comes
             0
             1                                  0         back.

                                                          Flag is now set!
                   42 Interconnect
                                                          Time to read the
                                                          “shared_value”!
            L1           L1        L1          1
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                  0

                                                          Note that the write
                   42 Interconnect                        to “shared_value”
                                                          is still sitting in the
                                                          network!
            L1           L1        L1          1
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                  0


                   42 Interconnect


            L1           L1        L1          1
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory      3

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                  0


                   42 Interconnect


            L1           L1        L1         1
                                              L1 0

          CPU 0      CPU 1       CPU 2      CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory      3

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                  0         Write of “42” to
                                                          “shared_value”
                                                          finally escapes
                   42 Interconnect                        the network, but
                                                          it is TOO LATE!

            L1           L1        L1         1
                                              L1 0

          CPU 0      CPU 1       CPU 2      CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Simple Case (revisited)
            3       Directory   Directory     3

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                  0
                                                         Our code doesn't
                                                         always work!
                   42 Interconnect
                                                         WTF???

            L1           L1        L1         1
                                              L1 0

          CPU 0      CPU 1       CPU 2      CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value; 0
Simple Case (revisited)
            3       Directory   Directory      3

         L2 Bank    L2 Bank     L2 Bank     L2 Bank     The architecture needs
             0
             1                                  0       to expose ordering
                                                        properties to the
                                                        programmer, so that
                   42 Interconnect                      the programmer may
                                                        write correct code.

            L1           L1        L1         1
                                              L1 0      This is called the
                                                        “Memory Model”
          CPU 0      CPU 1       CPU 2      CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 shared_flag = 1;   1                       new_value = shared_value;
Sequential Consistency
 Hardware GUARANTEES that all memory
operations are ordered globally.

● Benefits
  ● Simplifies programming (our initial code

    would have worked)
● Costs

  ● Hard to implement micro-architecturally

  ● Can hurt performance

  ● Hard to verify
Weak Consistency
 Loads and stores to different addresses may
be re-ordered

● Benefits
  ● Much easier to implement and build

  ● Higher performing

  ● Easy to verify

● Costs

  ● More complicated for the programmer

  ● Requires special “ordering” instructions for

    synchronization
Instructions for Weak Memory
                 Models
●   Write Barrier
    ● Don't issue a write until all preceding writes have completed



●   Read Barrier
    ● Don't issue a read until all preceding reads have completed



●   Memory Barrier
    ● Don't issue a memory operation until all preceding memory

      operations have completed

Etc etc
Simple Case (write barrier)
int shared_flag = 0;
int shared_value = 0;
                                       Memory
void sender_thread()
{
   shared_value = 42;                 Interconnect
   __write_barrier();
   shared_flag = 1;                CPU 0        CPU 1
}

void receiver_thread()
{
   while (shared_flag == 0) { }
   Int new_value = shared_value;
   printf(“%in”, new_value);
}
Simple Case (revisited)
         Directory   Directory   Directory   Directory

         L2 Bank     L2 Bank     L2 Bank     L2 Bank



                         Interconnect


            L1           L1         L1          L1

          CPU 0       CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                       while (shared_flag == 0) { }
                                                                   0
 __write_barrier();                          new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0                                  0            Receiver is
                                                             spinning on
                                                             “shared_flag”
                         Interconnect


            L1           L1        L1          0
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;    1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0                                  0            “shared_value”
                                                             has reset value
                                                             of 0
                         Interconnect


            L1           L1        L1          0
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0                                  0         Store to shared
                                                          value writes-thru
                                                          L1
                   42 Interconnect


            L1           L1        L1          0
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
                     3       Directory   Directory   Directory

                  L2 Bank    L2 Bank     L2 Bank     L2 Bank
                      0                                  0         write_barrier
                                                                   prevents issues of
                                                                   “shared_flag = 1”
                            42 Interconnect                        until the
                                                                   “shared_value =
                                                                   42” is complete.
                     L1           L1        L1          0
                                                        L1         This is tracked via
                                                                   acknowledgments
                   CPU 0      CPU 1       CPU 2       CPU 3

          shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                           0
          __write_barrier();                         new_value = shared_value;
          shared_flag = 1;   1
Blocked
Simple Case (revisited)
                     3       Directory   Directory   Directory

                  L2 Bank    L2 Bank     L2 Bank     L2 Bank
                      0                                  42
                                                         0         Write eventually
                                                                   leaves network
                            42 Interconnect


                     L1           L1        L1          0
                                                        L1

                   CPU 0      CPU 1       CPU 2       CPU 3

          shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                           0
          __write_barrier();                         new_value = shared_value;
          shared_flag = 1;   1
Blocked
Simple Case (revisited)
                     3       Directory   Directory   Directory

                  L2 Bank    L2 Bank     L2 Bank     L2 Bank
                      0                                  42
                                                         0         Write is
                                                                   acknowledged
                                  Interconnect


                     L1           L1        L1          0
                                                        L1

                   CPU 0      CPU 1       CPU 2       CPU 3

          shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                           0
          __write_barrier();                         new_value = shared_value;
          shared_flag = 1;   1
  Still
Blocked
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0                                  42
                                                0         Barrier is now
                                                          complete!
                         Interconnect


            L1           L1        L1          0
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3        Directory   Directory   Directory

         L2 Bank     L2 Bank     L2 Bank     L2 Bank
             0                                  42
                                                 0         Store to
                                                           “shared_flag”
                                                           writes thru L1
                 1       Interconnect


            L1           L1         L1          0
                                                L1

          CPU 0       CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                       while (shared_flag == 0) { }
                                                                   0
 __write_barrier();                          new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                 42
                                                0
                                                          Store to
                                                          “shared_flag”
                         Interconnect                     leaves the
                                                          network

            L1           L1        L1          0
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                 42
                                                0
                                                          1) “shared_flag”
                                                          is updated
                         Interconnect
                                                          2) Coherence
                                                          protocol
            L1           L1        L1          0
                                               L1         invalidates copy
                                                          in CPU3
          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;    1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                 42
                                                0


                         Interconnect


            L1           L1        L1          L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank       Receiver that is
             0
             1                                 42
                                                0         polling now
                                                          misses in the
                                                          cache and sends
                         Interconnect                     request to L2!


            L1           L1        L1          L1       Miss!
          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank       Response comes
             0
             1                                 42
                                                0         back.

                                                          Flag is now set!
                         Interconnect
                                                          Time to read the
                                                          “shared_value”!
            L1           L1        L1          1
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                 42
                                                0


                         Interconnect


            L1           L1        L1          1
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory   Directory

         L2 Bank    L2 Bank     L2 Bank     L2 Bank
             0
             1                                 42
                                                0


                         Interconnect


            L1           L1        L1          1
                                               L1

          CPU 0      CPU 1       CPU 2       CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory      3

         L2 Bank    L2 Bank     L2 Bank     L2 Bank       Correct Code!!!
             0
             1                                 42
                                                0


                         Interconnect


            L1           L1        L1         1 0
                                              L1 42

          CPU 0      CPU 1       CPU 2      CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Simple Case (revisited)
            3       Directory   Directory      3

         L2 Bank    L2 Bank     L2 Bank     L2 Bank       Correct Code!!!
             0
             1                                 42
                                                0


                         Interconnect                     What about
                                                          reads.....

            L1           L1        L1         1 0
                                              L1 42

          CPU 0      CPU 1       CPU 2      CPU 3

 shared_value = 42; 42                      while (shared_flag == 0) { }
                                                                  0
 __write_barrier();                         new_value = shared_value;
 shared_flag = 1;   1
Weak or Strong?

●The academic community pushed hard for sequential
consistency:

“Multiprocessors Should Support Simple Memory Consistency
Models” Mark Hill, IEEE Computer, August 1998
Weak or Strong?

●The academic community pushed hard for sequential
consistency:

“Multiprocessors Should Support Simple Memory Consistency
Models” Mark Hill, IEEE Computer, August 1998

    WRONG!!!

 Most new architectures support relaxed memory models
(ARM, IA64, TILE, etc). Much easier to implement and verify.
Not a programming issue, because the complexity is hidden
behind a library, and 99.9% of programmers don't have to
worry about these issues!
Break Problem
You are one of P recently arrested prisoners. The warden makes the
following announcement:

"You may meet together today and plan a strategy, but after today
you will be in isolated cells and have no communication with one
another. I have set up a "switch room" which contains a light switch,
which is either on or off. The switch is not connected to anything.
Every now and then, I will select one prisoner at random to enter the
"switch room". This prisoner may throw the switch (from on to off, or
vice-versa), or may leave the switch unchanged. Nobody else will
ever enter this room. Each prisoner will visit the switch room
arbitrarily often. More precisely, for any N, eventually each of you
will visit the switch room at least N times. At any time, any of you
may declare: "we have all visited the switch room at least once." If
the claim is correct, I will set you free. If the claim is incorrect, I will
feed all of you to the sharks."

Devise a winning strategy when you know that the initial state of the
switch is off. Hint: not all prisoners need to do the same thing.
1                                                         2




                                                              Introduction to Green Computing

                                                              • What do we mean by Green Computing?

                                                              • Why Green Computing?
     TDT4260
                                                              • Measuring “greenness”

     Introduction to Green Computing
                                                              • Research into energy consumption reduction
     Asymmetric multicore processors


                                       Alexandru Iordan




3                                                         4



    What do we mean by Green                                  What do we mean by Green
    Computing?                                                Computing?

                                                              The green computing movement is a multifaceted global
                                                              effort to reduce energy consumption and to promote
                                                              sustainable development in the IT world.
                                                              [Patrick Kurp, Green computing in Communications of
                                                              the ACM, 2008]




5                                                         6




    Why Green Computing?                                      Measuring “greenness”

        • Heat dissipation                                    • Non-standard metrics
          problems                                               –   Energy (Joules)
                                                                 –   Power (Watts)
                                                                 –   Energy-per-instructions ( Joules / No. instructions )
        • High energy bills                                      –   Energy-delayN-product ( Joules * secondsN )
                                                                 –   PerformanceN / Watt ( (No. instructions / second)N / Watt )

        • Growing environmental
                                                              • Standard metrics
          impact                                                 – Data centers: Power Usage Effectiveness metric (The Green Grid
                                                                   consortium)
                                                                 – Servers: ssj_ops / Watt metric (SPEC consortium)




                                                                                                                                    1
8                                                                              9




     Research into energy consumption                                               Maximizing Power Efficiency with
     reduction                                                                      Asymmetric Multicore Systems
                                                                                    Fedorova et al., Communications of the ACM, 2009

                                                                                    • Outline

                                                                                       – Asymmetric multicore processors

                                                                                       – Scheduling for parallel and serial applications

                                                                                       – Scheduling for CPU- and memory-intensive applications




10                                                                             11




     Asymmetric multicore processors                                                Efficient utilization of AMPs
     • What makes a multicore asymmetric?
        – a few powerful cores (high clock freq., complex pipelines, OoO
          execution)                                                                • Efficient mapping of threads/workloads
        – many simple cores (low clock freq., simple pipeline, low power
          requirement)                                                                 – parallel applications
                                                                                           • serial part → complex cores
     • Homogeneous ISA AMP                                                                 • scalable parallel part → simple cores
        – the same binary code can run on both types of cores
                                                                                       – microarchitectural characteristics of workloads
                                                                                           • CPU intensive applications → complex cores
     • Heterogeneous ISA AMP                                                               • memory intensive applications → simple cores
        – code compiled separately for each type of core
        – examples: IBM Cell, Intel Larrabee




12                                                                             13




     Sequential vs. parallel characteristics                                        Parallelism-aware scheduling
     • Sequential programs                                                          • Goal: improve overall system efficiency (not the
        – high degree of ILP                                                          performance of a particular application)
        – can utilize features of a complex core (super-scalar pipeline, OoO
          execution, complex branch prediction)
                                                                                    • Idea: assign sequential applications/phases to run on
     • Parallel programs                                                              the complex cores
        – high number of parallel threads/tasks (compensates for low ILP and
          masks memory delays)
                                                                                    • Does NOT provide fairness
     • Having both complex and simple cores, give AMPs
       applicability for wider range of applications




                                                                                                                                                 2
14                                                                    15




     Challenges of PA scheduling                                           “Heterogeneity”-aware scheduling

     • Detecting serial and parallel phases
        – limited scalability of threads can yield wrong solutions         • Goal: improve overall system efficiency

     • Thread migration overhead                                           • Idea:
        – migration across memory domains is expensive                        – CPU-intensive applications/phases → complex cores
        – scheduler must be topology aware                                    – memory-intensive applications/phases → simple cores


                                                                           • Inherently unfair




16                                                                    17




     Challenges of HA scheduling                                           Summary
                                                                           • Green Computing focuses on improving energy-
     • Classifying threads/phases as CPU- or memory-                         efficiency and sustainable development in the IT
       bound                                                                 world
        – two approaches presented: direct measurement and modeling
                                                                           • AMPs promise higher energy-efficiency than
     • Long execution time (direct measurement approach)                     symmetric processors
       or need of offline information (modeling approach)
                                                                           • Schedulers must by designed to take advantage of
                                                                             the asymmetric hardware




18                                                                    19




     References
     • Kirk W. Cameron, The road to greener IT pastures
       in IEEE Computer, 2009

     • Dan Herrick and Mark Ritschard, Greening your
       computing technology, the near and far
       perspectives in Proceedings of the 37th ACM
       SIGUCCS, 2009

     • Luiz A. Barroso, The price of performance in ACM
       Queue , 2005




                                                                                                                                      3
2


                                                                                      Contents


                                                                                        1 Njord Power5+ hardware

                                                                                        2 Kongull AMD Istanbul hardware


                 NTNU HPC Infrastructure                                                3 Resource Managers
                 IBM AIX Power5+, CentOS AMD Istanbul
                                                                                        4 Documentation
                 Jørn Amundsen
                 IDI/NTNU IT
                 2011-03-25



www.ntnu.no                                             Jørn Amundsen, NTNU IT   www.ntnu.no                                                             Jørn Amundsen, NTNU IT




3                                                                                4


     Power5+ hardware                                                                 Cache and memory

                                                                                               • 16 x 64-bit word cache lines (32 in L3)
                                                                                               • Hardware cache line prefetch on loads
              Cache and memory
                                                                                               • Reads from memory are written into L2
              Chip layout
                                                                                               • External L3, acts as a victim cache for L2
              System level                                                                     • L2 and L3 are shared between cores
               TOC
                                                                                               • L1 is write-through
                                                                                               • Cache coherence is maintained system-wide at L2 level
                                                                                               • 4K pages sizes default, kernel supports 64K and 16M pages




www.ntnu.no                                             Jørn Amundsen, NTNU IT   www.ntnu.no                                                             Jørn Amundsen, NTNU IT
5                                                                                                                                            6


     Chip design                                                                                                                                  SMT
                                                                L3 cache
                                                                36M
                                                                                                                                                           • In a concrete application, the processor core might be idle 50-80% of
                                                                12−way
                                                                                                                                                             the time, waiting for memory
                                                                                                                                                           • An obvious solution would be to let another thread execute while our
                                                                 35.2 GB/s
               Execution
                 Units
                                decode &
                                schedule
                                                L1 I−cache                          L1 I−cache      decode &
                                                                                                    schedule
                                                                                                                          Execution
                                                                                                                            Units
                                                                                                                                                             thread is waiting for memory
                                                64K                                 64K

                 2 LSU
                                logic           2−way                               2−way           logic
                                                                                                                           2 LSU
                                                                                                                                                           • This is known as hyper-threading in the Intel/AMD world, and
                 2 FXU                                                                                                     2 FXU                             Simultaneous Multithreading (SMT) with IBM
                 2 FPU                                                                                                     2 FPU
                 1 BXU
                             64−bit registers   L1 D−cache
                                                32K
                                                                L2 cache
                                                                1.92M
                                                                                    L1 D−cache
                                                                                    32K
                                                                                                 64−bit registers
                                                                                                                           1 BXU                           • SMT is supported in hardware throughout the processor core
                 1 CRL       32 GPR, 32 FPR     4−way           10−way              4−way        32 GPR, 32 FPR            1 CRL
                                                                                                                                                           • SMT is more efficient than hyper-threading with less context switch
              power5+ core                                                                                               power5+ core
                                                                                                                                                             overhead
                                                                                                                                                           • Power5 and 6 supports 1 thread/core or SMT with 2 threads/core,
                                                               Switch Fabric


                                                                                                                                                             while the latest Power7 supports 4 threads/core
                                                             Memory Controller
                                                                                                                         power5+ chip
                                                                                                                                                           • SMT is enabled or disabled dynamically on a node with the
                                                                        25.6 GB/s
                                                                                                                                                             (privileged) command smtctl
                                                              Main memory
                                                              16−128GB
                                                              DDR2


www.ntnu.no                                                                                                         Jørn Amundsen, NTNU IT   www.ntnu.no                                                               Jørn Amundsen, NTNU IT




7                                                                                                                                            8


     SMT (2)                                                                                                                                      Chip module packaging
                                                                                                                                                           • 4 chips and 4 L3 caches are HW integrated onto a MCM
                                                                                                                                                           • 90.25 cm2 , 89 layers of metal
              • SMT is beneficial if you are doing a lot of memory references, and
                 your application performance is memory bound
              • Enabling SMT doubles the number of MPI tasks per node, from 16 to
                 32. Requires your application to be sufficiently scalable.
              • SMT is only available in user space with batch processing, by adding
                 the structured comment string:
                    #@ requirements = ( Feature == "SMT" )




www.ntnu.no                                                                                                         Jørn Amundsen, NTNU IT   www.ntnu.no                                                               Jørn Amundsen, NTNU IT
9                                                                                                      10


     The system level                                                                                       GPFS

                                                                                                                     • An important feature of a HPC system is the capability of moving
              • On a p575 system, a node is 2 MCM’s / 8 chips / 16 1.9GHz cores                                        large amounts of data from or to memory, across nodes and from or
              • The   Njord system is                                                                                  to permanent storage
                  -   2 x 16-way 32 GiB login nodes                                                                  • In this respect a high quality and high performance global file system
                  -   4 x 16-way 16 GiB I/O nodes (used with GPFS)                                                     is essential
                  -   186 x 16-way 32 GiB compute nodes                                                              • GPFS is a robust parallel FS geared at high BW I/O, used
                  -   6 x 16-way 128 GiB compute nodes
                                                                                                                       extensively in HPC and in the database industry
              • GPFS parallel file system, 33 TiB fiber disks 62 TiB SATA disks                                        • Disk access is ≈ 1000 times slower than memory access, hence key
              • Interconnect                                                                                           factor for performance are
                   - IBM Federation, a multistage crossbar network providing 2 GiB/s                                     - spreading (striping) files across many disk units
                     bidirectional bandwidth and 5µs latency system-wide MPI performance                                 - using memory to cache files
                                                                                                                         - hiding latencies in software




www.ntnu.no                                                                   Jørn Amundsen, NTNU IT   www.ntnu.no                                                                         Jørn Amundsen, NTNU IT




11                                                                                                     12


     GPFS and parallel I/O (2)                                                                              File buffering

              • High transfer rates is achieved by distributing files in blocks round                                 • The kernel does read-aheads
                robin across a large number of disk units, up to thousands of disks                                    and write-behinds of file blocks          application                  user
                                                                                                                                                                  buffer                  application
              • On njord, the GPFS block size and stripe unit is 1 MB                                                • The kernel does heuristics on
              • In addition to multiple disks servicing file I/O, multiple threads might                                I/O to discover sequential and
                read, write or update (R+W) a file simultaneously                                                       strided forward and backward
                                                                                                                       reads.
              • GPFS use multiple I/O servers (4 dedicated nodes on njord), working                                                                             file system
                                                                                                                                                                                          KERNEL
                in parallel for performance, maintaining file and file metadata                                        • The disadvantage is memory                   buffer

                consistency.                                                                                           copying of all data
              • High performance comes at a cost. Although GPFS can handle                                           • Can bypass with DIRECT_IO –
                directories with millions of files, it is usually the best to use fewer and                             can be useful with large I/O                           DISK
                larger files, and to access files in larger chunks.                                                      (MB-sized), utilizing application                      SUBSYSTEM
                                                                                                                       I/O patterns



www.ntnu.no                                                                   Jørn Amundsen, NTNU IT   www.ntnu.no                                                                         Jørn Amundsen, NTNU IT
13                                                                                                     14


     AMD Istanbul hardware                                                                                  Cache and memory



                Cache and memory                                                                                     • 6 x 128 KiB L1 cache
                                                                                                                     • 6 x 512 KiB L2 cache
                System level
                                                                                                                     • 1 x 6 MiB L3 cache
                  TOC
                                                                                                                     • 24 or 48 GiB DDR3 RAM




www.ntnu.no                                                                   Jørn Amundsen, NTNU IT   www.ntnu.no                             Jørn Amundsen, NTNU IT




15                                                                                                     16


     The system level                                                                                       Resource Managers
              • A node is 2 chips / 12 2.4GHz cores
              • The Kongull system is
                  - 1 x 12-way 24 GiB login nodes
                  - 4 x 12-way 24 GiB I/O nodes (used with GPFS)
                  - 52 x 12-way 24 GiB compute nodes                                                                   Resource Managers
                  - 44 x 12-way 48 GiB compute nodes
                                                                                                                       Njord classes
              • Nodes compute-0-0 – compute-0-39 and compute-1-0 –
                compute-1-11 are 24 GiB @ 800 MHz, while compute-1-12 –                                                Kongull queues
                compute-1-15 and compute-2-0 – compute-2-39 are 48 GiB                                                  TOC
                @ 667 MHz bus frequency
              • GPFS parallel file system, 73 TiB
              • Interconnect
                   - A fat tree implemented with HP Procurve switches, 1 Gb from node to
                     rack switch, then 10Gb from the rack switch to the toplevel switch.
                     Bandwidth and latency is left as a programming exercise.

www.ntnu.no                                                                   Jørn Amundsen, NTNU IT   www.ntnu.no                             Jørn Amundsen, NTNU IT
17                                                                                                    18


     Resource Managers                                                                                     Njord job class overview
              • Need efficient (and fair) utilization of the large pool of resources
              • This is the domain of queueing (batch) systems or resource                                          class      min-max    max nodes    max
                                                                                                                                                                   description
                                                                                                                               nodes      / job        runtime
                managers
                                                                                                                    forecast                                       top priority class dedicated
              • Administers the execution of (computational) jobs and provides                                                 1-180      180          unlimited
                                                                                                                                                                   to forecast jobs
                resource accounting across usersand accounts                                                                                                       high priority 115GB memory
                                                                                                                    bigmem     1-6        4            7 days
              • Includes distribution of parallel (OpenMP/MPI) threads/processes                                                                                   class
                across physical cores and gang scheduling of parallel execution                                     large      4-180      128          21 days
                                                                                                                                                                   high priority class for jobs of
                                                                                                                                                                   64 processors or more
              • Jobs are Unix shell scripts with batch system keywords embedded
                                                                                                                    normal     1-52       42           21 days     default class
                within structured comments
              • Both Njord and Kongull employs a series of queues (classes)                                         express                                        high priority class for debug-
                                                                                                                               1-186      4            1 hour
                                                                                                                                                                   ging and test runs
                administering various sets of possibly overlapping nodes with
                                                                                                                    small                                          low priority class for serial or
                possibly different priorities                                                                                  1/2        1/2          14 days
                                                                                                                                                                   small SMP jobs
              • IBM LoadLeveler on Njord, Torque (development from OpenPBS) on                                      optimist   1-186      48           unlimited   checkpoint-restart jobs
                Kongull

www.ntnu.no                                                                  Jørn Amundsen, NTNU IT   www.ntnu.no                                                                             Jørn Amundsen, NTNU IT




19                                                                                                    20


     Njord job class overview (2)                                                                          LoadLeveler sample jobscript
                                                                                                              #     @   job_name = hybrid_job
                                                                                                              #     @   account_no = ntnuXXX
              • Forecast is the highest priority queue, suspends everything else                              #     @   job_type = parallel
                                                                                                              #     @   node = 3
                                                                                                              #     @   tasks_per_node = 8
              • Beware: node memory (except bigmem) is split in 2, to guarantee                               #     @   class = normal
                                                                                                              #     @   ConsumableCpus(2) ConsumableMemory(1664mb)
                available memory for forecast jobs                                                            #     @   error = $(job_name).$(jobid).err
                                                                                                              #     @   output = $(job_name).$(jobid).out
              • A C-R job runs at the very lowest priority, any other job will terminate                      #     @   queue

                and requeue an optimist queue job if not enough available nodes                               export OMP_NUM_THREADS=2
                                                                                                              # Create (if necessary) and move to my working directory
              • Optimist class jobs need an internal checkpoint-restart mechanism                             w=$WORKDIR/$USER/test
                                                                                                              if [ ! -d $w ]; then mkdir -p $w; fi
              • AIX LoadLeveler impose node job memory limits, e.g. jobs                                      cd $w
                oversubscribing available node memory are aborted with an email                               $HOME/a.out
                                                                                                              llq -w $LOADL_STEP_ID

                                                                                                              exit 0




www.ntnu.no                                                                  Jørn Amundsen, NTNU IT   www.ntnu.no                                                                             Jørn Amundsen, NTNU IT
21                                                                                                                 22


     LoadLeveler sample C-R email (1/2)                                                                                 LoadLeveler sample C-R email (2/2)
        Date: Mon, 21 Mar 2011 18:31:37 +0100
        From: loadl@hpc.ntnu.no
        To: joern@hpc.ntnu.no
        Subject: z2rank_s_5
                                                                                                                           This job step was dispatched to run 18 time(s).
        From: LoadLeveler                                                                                                  This job step was rejected by Starter 0 time(s).
                                                                                                                           Submitted at: Mon Mar 21 10:02:56 2011
        LoadLeveler Job Step: f05n02io.791345.0                                                                            Started at: Mon Mar 21 18:16:59 2011
        Executable: /home/ntnu/joern/run/z2rank/logs/skipped/z2rank_s_5.job                                                Exited at: Mon Mar 21 18:31:37 2011
        Executable arguments:                                                                                                             Real Time:   0 08:28:41
        State for machine: f14n06                                                                                                Job Step User Time: 16 06:34:29
        LoadL_starter: The program, z2rank_s_5.job, exited normally and returned                                               Job Step System Time:   0 00:21:15
        an exit code of 0.                                                                                                      Total Job Step Time: 16 06:55:44

                    State   for   machine:   f09n06                                                                                Starter User Time:   0 00:00:19
                    State   for   machine:   f13n04                                                                              Starter System Time:   0 00:00:09
                    State   for   machine:   f14n04                                                                               Total Starter Time:   0 00:00:28
                    State   for   machine:   f08n06
                    State   for   machine:   f12n06
                    State   for   machine:   f15n07
                    State   for   machine:   f18n04



www.ntnu.no                                                                               Jørn Amundsen, NTNU IT   www.ntnu.no                                                 Jørn Amundsen, NTNU IT




23                                                                                                                 24


     Kongull job queue overview                                                                                         Documentation
                class       min-max       max nodes       max
                                                                      description
                            nodes         / job           runtime

                default                                               default queue except IPT,                            Njord User Guide
                            1-52          52              35 days
                                                                      SFI IO and Sintef Petroleum                                       http://docs.notur.no/ntnu/njord-ibm-power-5
                express                                               high priority queue for de-                          Notur load stats
                            1-96          96              1 hour
                                                                      bugging and test runs
                                                                                                                                         http://www.notur.no/hardware/status/
                bigmem                                                default queue for IPT, SFI IO
                            1-44          44              7 days
                                                                      and Sintef Petroleum
                                                                                                                           Kongull support wiki
                optimist    1-96          48              28 days     checkpoint-restart jobs                                           http://hpc-support.idi.ntnu.no/
                                                                                                                           Kongull load stats
              • Oversubscribing node physical memory crashes the node                                                                    http://kongull.hpc.ntnu.no/ganglia/
              • this might happen if not specifying the below in your job script:
                   #PBS -lnodes=1:ppn=12
              • If all nodes are not reserved, the batch system will attempt to share nodes by default


www.ntnu.no                                                                               Jørn Amundsen, NTNU IT   www.ntnu.no                                                 Jørn Amundsen, NTNU IT
TDT4260 Computer Architecture
                              Mini-Project Guidelines

                                      Alexandru Ciprian Iordan
                                      iordan@idi.ntnu.no

                                            January 10, 2011


1 Introduction

The Mini-Project accounts for 20% of the final grade in TDT4260 Computer Architecture. Your task is
to develop and evaluate a prefetcher using the M5 simulator. M5 is currently one of the most popular
simulators for computer architecture research and has a rich feature set. Consequently, it is a very
complex piece of software. To make your task easier, we have created a simple interface to the memory
system that you can use to develop your prefetcher. Furthermore, you can evaluate your prefetchers by
submitting your code via web interface. This web interface runs your code on the Kongull cluster with
the default simulator setup. It is also possible to experiment with other parameters, but then you will have
to run the simulator yourself. The web interface, the modified M5 simulator and more documentation
can be found at http://dm-ark.idi.ntnu.no/.
The Mini-Project is carried out in groups of 2 to 4 students. In some cases we will allow students to
work alone. Your will be graded based on both a written paper and a short oral presentation.
Make sure you clearly cite the source of information, data and figures. Failure to do so is regarded as
cheating and is handled according to NTNU guidelines. If you have any questions, send an e-mail to
teaching assistant Alexandru Ciprian Iordan (iordan@idi.ntnu.no) .


1.1     Mini-Project Goals

The Mini-Project has the following goals:
      • Many computer architecture topics are best analyzed by experiments and/or detailed studies. The
        Mini-Project should provide training in such exercises.
      • Writing about a topic often increases the understanding of it. Consequently, we require that the
        result of the Mini-Project is a scientific paper.


2 Practical Guidelines

2.1     Time Schedule and Deadlines

The Mini-Project schedule is shown in Table 1. If these deadlines collide with deadlines in other subjects,
we suggest that you consider handing in the Mini-Project earlier than the deadline. If you miss the final
deadline, this will reduce the maximum score you can be awarded.



                                                     1
Deadline                           Description
 Friday 21. January                 List of group members delivered to Alexandru Ciprian Ior-
                                    dan (iordan@idi.ntnu.no) by e-mail
 Friday 4. March                    Short status report and an outline of the final report delivered to
                                    Alexandru Ciprian Iordan (iordan@idi.ntnu.no) by e-mail
 Friday 8. April 12:00 (noon)       Final paper deadline. Deliver the paper through It’s Learning. De-
                                    tailed report layout requirements can be found in section 2.2.
 Week 15 (11. - 15. April)          Compulsory 10 minute oral presentations

                                       Table 1: Mini-Project Deadlines


2.2       Paper Layout

The paper must follow the IEEE Transactions style guidelines available here:
http://www.ieee.org/publications_standards/publications/authors/authors_
journals.html#sect2
Both Latex and Word templates are available, but we recommend that you use Latex. The paper must
use a maximum of 8 pages. Failure to comply with these requirements will reduce the maximum score
you can be awarded.
In addition, we will deduct points if:
      • The paper does not have a proper scientific structure. All reports must contain the following sec-
        tions: Abstract, Introduction, Related Work or Background, Prefetcher Description, Methodology,
        Results, Discussion and Conclusion. You may rename the “Prefetcher Description” section to a
        more descriptive title. Acknowledgements and Author biographies are optional.
      • Use citations correctly. If you use a figure that somebody else has made, a citation must appear in
        the figure text.
      • NTNU has acquired an automated system that checks for plagiarism. We may run this system on
        your papers so make sure you write all text yourself.


2.3       Evaluation

The Mini-Project accounts for 20% of the total grade in TDT4260 Computer Architecture. Within the
Mini-Project, the report counts 80% and the oral presentation 20%.
The report grade will be based on the following criteria:
      •   Language and use of figures
      •   Clarity of the problem statement
      •   Overall document structure
      •   Depth of understanding for the field of computer architecture
      •   Depth of understanding of the investigated problem
The oral presentation grade will be based on following criteria:
      •   Presentation structure
      •   Quality and clarity of the slides
      •   Presentation style
      •   If you use more than the provided time, you will lose points.



                                                       2
M5 simulator system
TDT4260 Computer Architecture
    User documentation




   Last modified: November 23, 2010
Contents

1 Introduction                                                                   2
  1.1   Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       2
  1.2   Chapter outlines . . . . . . . . . . . . . . . . . . . . . . . . .       2

2 Installing and running M5                                                      4
  2.1   Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       4
  2.2   Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . .     4
        2.2.1   Linux . . . . . . . . . . . . . . . . . . . . . . . . . . .      4
        2.2.2   VirtualBox disk image . . . . . . . . . . . . . . . . . .        5
  2.3   Build . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    5
  2.4   Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      5
        2.4.1   CPU2000 benchmark tests . . . . . . . . . . . . . . . .          6
        2.4.2   Running M5 with custom test programs . . . . . . . .             7
  2.5   Submitting the prefetcher for benchmarking . . . . . . . . . .           8

3 The prefetcher interface                                                       9
  3.1   Memory model . . . . . . . . . . . . . . . . . . . . . . . . . .         9
  3.2   Interface specification . . . . . . . . . . . . . . . . . . . . . .       9
  3.3   Using the interface . . . . . . . . . . . . . . . . . . . . . . . .     11
        3.3.1   Example prefetcher . . . . . . . . . . . . . . . . . . . .      13

4 Statistics                                                                    14

5 Debugging the prefetcher                                                      16
  5.1   m5.debug and trace flags . . . . . . . . . . . . . . . . . . . . .       16
  5.2   GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     16
  5.3   Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    17

                                       1
Chapter 1

Introduction

You are now going to write your own hardware prefetcher, using a modified
version of M5, an open-source hardware simulator system. This modified
version presents a simplified interface to M5’s cache, allowing you to con-
centrate on a specific part of the memory hierarchy: a prefetcher for the
second level (L2) cache.


1.1    Overview

This documentation covers the following:

   • Installing and running the simulator

   • Machine model and memory hierarchy

   • Prefetcher interface specification

   • Using the interface

   • Testing and debugging the prefetcher on your local machine

   • Submitting the prefetcher for benchmarking

   • Statistics


1.2    Chapter outlines

The first chapter gives a short introduction, and contains an outline of the
documentation.



                                    2
The second chapter starts with the basics: how to install the M5 simulator.
There are two possible ways to install and use it. The first is as a stand-
alone VirtualBox disk-image, which requires the installation of VirtualBox.
This is the best option for those who use Windows as their operating system
of choice. For Linux enthusiasts, there is also the option of downloading a
tarball, and installing a few required software packages.
The chapter then continues to walk you through the necessary steps to
get M5 up and running: building from source, running with command-line
options that enables prefetching, running local benchmarks, compiling and
running custom test-programs, and finally, how to submit your prefetcher
for testing on a computing cluster.
The third chapter gives an overview over the simulated system, and de-
scribes its memory model. There is also a detailed specification of the
prefetcher interface, and tips on how to use it when writing your own
prefetcher. It includes a very simple example prefetcher with extensive com-
ments.
The fourth chapter contains definitions of the statistics used to quantita-
tively measure prefetchers.
The fifth chapter gives details on how to debug prefetchers using advanced
tools such as GDB and Valgrind, and how to use trace-flags to get detailed
debug printouts.




                                     3
Chapter 2

Installing and running M5

2.1     Download

Download the modified M5 simulator from the PfJudgeβ website.


2.2     Installation

2.2.1   Linux

Software requirements (specific Debian/Ubuntu packages mentioned in paren-
theses):

   • g++ >= 3.4.6

   • Python and libpython >= 2.4 (python and python-dev)

   • Scons > 0.98.1 (scons)

   • SWIG >= 1.3.31 (swig)

   • zlib (zlib1g-dev)

   • m4 (m4)

To install all required packages in one go, issue instructions to apt-get:
sudo apt-get install g++ python-dev scons swig zlib1g-dev m4
The simulator framework comes packaged as a gzipped tarball. Start the ad-
venture by unpacking with tar xvzf framework.tar.gz. This will create
a directory named framework.



                                     4
2.2.2    VirtualBox disk image

If you do not have convenient access to a Linux machine, you can download
a virtual machine with M5 preconfigured. You can run the virtual machine
with VirtualBox, which can be downloaded from http.//www.virtualbox.
org.
The virtual machine is available as a zip archive from the PfJudgeβ web-
site. After unpacking the archive, you can import the virtual machine into
VirtualBox by selecting “Import Appliance” in the file menu and opening
“Prefetcher framework.ovf”.


2.3     Build

M5 uses the scons build system: scons -j2 ./build/ALPHA SE/m5.opt
builds the optimized version of the M5 binaries.
-j2 specifies that the build process should built two targets in parallel. This
is a useful option to cut down on compile time if your machine has several
processors or cores.
The included build script compile.sh encapsulates the necessary build com-
mands and options.


2.4     Run

Before running M5, it is necessary to specify the architecture and parameters
for the simulated system. This is a nontrivial task in itself. Fortunately
there is an easy way: use the included example python script for running
M5 in syscall emulation mode, m5/config/example/se.py. When using
a prefetcher with M5, this script needs some extra options, described in
Table 2.1.
For an overview of all possible options to se.py, do
        ./build/ALPHA SE/m5.opt common/example/se.py --help
When combining all these options, the command line will look something
like this:
      ./build/ALPHA SE/m5.opt common/example/se.py --detailed
--caches --l2cache --l2size=1MB --prefetcher=policy=proxy
--prefetcher=on access=True
This command will run se.py with a default program, which prints out
“Hello, world!” and exits. To run something more complicated, use the



                                      5
Option                              Description
 --detailed                          Detailed timing simulation
 --caches                            Use caches
 --l2cache                           Use level two cache
 --l2size=1MB                        Level two cache size
 --prefetcher=policy=proxy           Use the C-style prefetcher interface
 --prefetcher=on access=True         Have the cache notify the prefetcher
                                     on all accesses, both hits and misses
 --cmd                               The program (an Alpha binary) to run

              Table 2.1: Basic se.py command line options.


--cmd option to specify another program. See subsection 2.4.2 about cross-
compiling binaries for the Alpha architecture. Another possibility is to run
a benchmark program, as described in the next section.


2.4.1     CPU2000 benchmark tests

The test prefetcher.py script can be used to evaluate the performance of
your prefetcher against the SPEC CPU2000 benchmarks. It runs a selected
suite of CPU2000 tests with your prefetcher, and compares the results to
some reference prefetchers.
The per-test statistics that M5 generates are written to
output/<testname-prefetcher>/stats.txt. The statistics most relevant
for hardware prefetching are then filtered and aggregated to a stats.txt
file in the framework base directory.
See chapter 4 for an explanation of the reported statistics.
Since programs often do some initialization and setup on startup, a sample
from the start of a program run is unlikely to be representative for the whole
program. It is therefore desirable to begin the performance tests after the
program has been running for some time. To save simulation time, M5 can
resume a program state from a previously stored checkpoint. The prefetcher
framework comes with checkpoints for the CPU2000 benchmarks taken after
109 instructions.
It is often useful to run a specific test to reproduce a bug. To run the
CPU2000 tests outside of test prefetcher.py, you will need to set the
M5 CPU2000 environment variable. If this is set incorrectly, M5 will give the
error message “Unable to find workload”. To export this as a shell variable,
do


                                      6
export M5 CPU2000=lib/cpu2000
Near the top of test prefetcher.py there is a commented-out call to
dry run(). If this is uncommented, test prefetcher.py will print the
command line it would use to run each test. This will typically look like
this:
      m5/build/ALPHA SE/m5.opt --remote-gdb-port=0 -re
--outdir=output/ammp-user m5/configs/example/se.py
--checkpoint-dir=lib/cp --checkpoint-restore=1000000000
--at-instruction --caches --l2cache --standard-switch
--warmup-insts=10000000 --max-inst=10000000 --l2size=1MB
--bench=ammp --prefetcher=on access=true:policy=proxy
This uses some additional command line options, these are explained in
Table 2.2.

 Option                         Description
 --bench=ammp                   Run one of the SPEC CPU2000 benchmarks.
 --checkpoint-dir=lib/cp        The directory where program checkpoints are stored.
 --at-instruction               Restore at an instruction count.
 --checkpoint-restore=n         The instruction count to restore at.
 --standard-switch              Warm up caches with a simple CPU model,
                                then switch to an advanced model to gather statistics.
 --warmup-insts=n               Number of instructions to run warmup for.
 --max-inst=n                   Exit after running this number of instructions.

            Table 2.2: Advanced se.py command line options.


2.4.2     Running M5 with custom test programs

If you wish to run your self-written test programs with M5, it is necessary to
cross-compile them for the Alpha architecture. The easiest way to achieve
this is to download the precompiled compiler-binaries provided by crosstool
from the M5 website. Install the one that fits your host machine best (32
or 64 bit version). When cross-compiling your test program, you must use
the -static option to enforce static linkage.
To run the cross-compiled Alpha binary with M5, pass it to the script with
the --cmd option. Example:
      ./build/ALPHA SE/m5.opt configs/example/se.py --detailed
--caches --l2cache --l2size=512kB --prefetcher=policy=proxy
--prefetcher=on access=True --cmd /path/to/testprogram


                                      7
2.5     Submitting the prefetcher for benchmarking

First of all, you need a user account on the PfJudgeβ web pages. The
teaching assistant in TDT4260 Computer Architecture will create one for
you. You must also be assigned to a group to submit prefetcher code or
view earlier submissions.
Sign in with your username and password, then click “Submit prefetcher”
in the menu. Select your prefetcher file, and optionally give the submission
a name. This is the name that will be shown in the highscore list, so choose
with care. If no name is given, it defaults to the name of the uploaded file.
If you check “Email on complete”, you will receive an email when the results
are ready. This could take some time, depending on the cluster’s current
workload.
When you click “Submit”, a job will be sent to the Kongull cluster, which
then compiles your prefetcher and runs it with a subset of the CPU2000
tests. You are then shown the “View submissions” page, with a list of all
your submissions, the most recent at the top.
When the prefetcher is uploaded, the status is “Uploaded”. As soon as it is
sent to the cluster, it changes to “Compiling”. If it compiles successfully, the
status will be “Running”. If your prefetcher does not compile, status will
be “Compile error”. Check “Compilation output” found under the detailed
view.
When the results are ready, status will be “Completed”, and a score will be
given. The highest scoring prefetcher for each group is listed on the highscore
list, found under “Top prefetchers” in the menu. Click on the prefetcher
name to go a more detailed view, with per-test output and statistics.
If the prefetcher crashes on some or all tests, status will be “Runtime error”.
To locate the failed tests, check the detailed view. You can take a look at
the output from the failed tests by clicking on the “output” link found after
each test statistic.
To allow easier exploration of different prefetcher configurations, it is possi-
ble to submit several prefetchers at once, bundled into a zipped file. Each
.cc file in the archive is submitted independently for testing on the cluster.
The submission is named after the compressed source file, possibly prefixed
with the name specified in the submission form.
There is a limit of 50 prefetchers per archive.




                                       8
Chapter 3

The prefetcher interface

3.1    Memory model
The simulated architecture is loosely based on the DEC Alpha Tsunami
system, specifically the Alpha 21264 microprocessor. This is a superscalar,
out-of-order (OoO) CPU which can reorder a large number of instructions,
and do speculative execution.
The L1 prefetcher is split in a 32kB instruction cache, and a 64kB data
cache. Each cache block is 64B. The L2 cache size is 1MB, also with a cache
block size of 64B. The L2 prefetcher is notified on every access to the L2
cache, both hits and misses. There is no prefetching for the L1 cache.
The memory bus runs at 400MHz, is 64 bits wide, and has a latency of 30ns.


3.2    Interface specification
The interface the prefetcher will use is defined in a header file located at
prefetcher/interface.hh. To use the prefetcher interface, you should
include interface.hh by putting the line #include "interface.hh" at
the top of your source file.

 #define                Value    Description
 BLOCK SIZE                64    Size of cache blocks (cache lines) in bytes
 MAX QUEUE SIZE           100    Maximum number of pending prefetch requests
 MAX PHYS MEM SIZE     228 − 1   The largest possible physical memory address

                      Table 3.1: Interface #defines.

NOTE: All interface functions that take an address as a parameter block-
align the address before issuing requests to the cache.

                                     9
Function                                     Description
void prefetch init(void)                     Called before any memory access to let the
                                             prefetcher initialize its data structures
void prefetch access(AccessStat stat)        Notifies the prefetcher about a cache access
void prefetch complete(Addr addr)            Notifies the prefetcher about a prefetch load
                                             that has just completed

            Table 3.2: Functions called by the simulator.




Function                                Description
void issue prefetch(Addr addr)          Called by the prefetcher to initiate a prefetch
int get prefetch bit(Addr addr)         Is the prefetch bit set for addr?
int set prefetch bit(Addr addr)         Set the prefetch bit for addr
int clear prefetch bit(Addr addr)       Clear the prefetch bit for addr
int in cache(Addr addr)                 Is addr currently in the L2 cache?
int in mshr queue(Addr addr)            Is there a prefetch request for addr in
                                        the MSHR (miss status holding register) queue?
int current queue size(void)            Returns the number of queued prefetch requests
void DPRINTF(trace, format, ...)        Macro to print debug information.
                                        trace is a trace flag (HWPrefetch),
                                        and format is a printf format string.

    Table 3.3: Functions callable from the user-defined prefetcher.




AccessStat member    Description
Addr pc              The address of the instruction that caused the access
                     (Program Counter)
Addr mem addr        The memory address that was requested
Tick time            The simulator time cycle when the request was sent
int miss             Whether this demand access was a cache hit or miss

                  Table 3.4: AccessStat members.



                                   10
The prefetcher must implement the three functions prefetch init,
prefetch access and prefetch complete. The implementation may be
empty.
The function prefetch init(void) is called at the start of the simulation
to allow the prefetcher to initialize any data structures it will need.
When the L2 cache is accessed by the CPU (through the L1 cache), the func-
tion void prefetch access(AccessStat stat) is called with an argument
(AccessStat stat) that gives various information about the access.
When the prefetcher decides to issue a prefetch request, it should call
issue prefetch(Addr addr), which queues up a prefetch request for the
block containing addr.
When a cache block that was requested by issue prefetch arrives from
memory, prefetch complete is called with the address of the completed
request as parameter.
Prefetches issued by issue prefetch(Addr addr) go into a prefetch request
queue. The cache will issue requests from the queue when it is not fetching
data for the CPU. This queue has a fixed size (available as MAX QUEUE SIZE),
and when it gets full, the oldest entry is evicted. If you want to check the
current size of this queue, use the function current queue size(void).


3.3    Using the interface

Start by studying interface.hh. This is the only M5-specific header file
you need to include in your header file. You might want to include standard
header files for things like printing debug information and memory alloca-
tion. Have a look at what the supplied example prefetcher (a very simple
sequential prefetcher) to see what it does.
If your prefetcher needs to initialize something, prefetch init is the place
to do so. If not, just leave the implementation empty.
You will need to implement the prefetch access function, which the cache
calls when accessed by the CPU. This function takes an argument,
AccessStat stat, which supplies information from the cache: the address
of the executing instruction that accessed cache, what memory address was
access, the cycle tick number, and whether the access was a cache miss. The
block size is available as BLOCK SIZE. Note that you probably will not need
all of this information for a specific prefetching algorithm.
If your algorithm decides to issue a prefetch request, it must call the
issue prefetch function with the address to prefetch from as argument.
The cache block containing this address is then added to the prefetch request



                                     11
queue. This queue has a fixed limit of MAX QUEUE SIZE pending prefetch re-
quests. Unless your prefetcher is using a high degree of prefetching, the
number of outstanding prefetches will stay well below this limit.
Every time the cache has loaded a block requested by the prefetcher,
prefetch complete is called with the address of the loaded block.
Other functionality available through the interface are the functions for get-
ting, setting and clearing the prefetch bit. Each cache block has one such
tag bit. You are free to use this bit as you see fit in your algorithms. Note
that this bit is not automatically set if the block has been prefetched, it
has to be set manually by calling set prefetch bit. set prefetch bit on
an address that is not in cache has no effect, and get prefetch bit on an
address that is not in cache will always return false.
When you are ready to write code for your prefetching algorithm of choice,
put it in prefetcher/prefetcher.cc. When you have several prefetchers,
you may want to to make prefetcher.cc a symlink.
The prefetcher is statically compiled into M5. After prefetcher.cc has
been changed, recompile with ./compile.sh. No options needed.




                                     12
3.3.1   Example prefetcher

/*
 * A sample prefetcher which does sequential one-block lookahead.
 * This means that the prefetcher fetches the next block _after_ the one that
 * was just accessed. It also ignores requests to blocks already in the cache.
 */

#include "interface.hh"


void prefetch_init(void)
{
    /* Called before any calls to prefetch_access. */
    /* This is the place to initialize data structures. */

    DPRINTF(HWPrefetch, "Initialized sequential-on-access prefetchern");
}

void prefetch_access(AccessStat stat)
{
    /* pf_addr is now an address within the _next_ cache block */
    Addr pf_addr = stat.mem_addr + BLOCK_SIZE;

    /*
     * Issue a prefetch request if a demand miss occured,
     * and the block is not already in cache.
     */
    if (stat.miss && !in_cache(pf_addr)) {
        issue_prefetch(pf_addr);
    }
}

void prefetch_complete(Addr addr) {
    /*
     * Called when a block requested by the prefetcher has been loaded.
     */
}




                              13
Chapter 4

Statistics

This chapter gives an overview of the statistics by which your prefetcher is
measured and ranked.

IPC instructions per cycle. Since we are using a superscalar architecture,
    IPC rates > 1 is possible.

Speedup Speedup is a commonly used proxy for overall performance when
    running benchmark tests suites.
                        execution timeno prefetcher    IP Cwith prefetcher
           speedup =                                 =
                       execution timewith prefetcher    IP Cno prefetcher

Good prefetch The prefetched block is referenced by the application be-
    fore it is replaced.

Bad prefetch The prefetched block is replaced without being referenced.

Accuracy Accuracy measures the number of useful prefetches issued by
    the prefetcher.
                              good prefetches
                        acc =
                              total prefetches
Coverage How many of the potential candidates for prefetches were actu-
    ally identified by the prefetcher?
                                    good prefetches
                    cov =
                            cache misses without prefetching

Identified Number of prefetches generated and queued by the prefetcher.




                                      14
Issued Number of prefetches issued by the cache controller. This can
     be significantly less than the number of identified prefetches, due to
     duplicate prefetches already found in the prefetch queue, duplicate
     prefetches found in the MSHR queue, and prefetches dropped due to
     a full prefetch queue.

Misses Total number of L2 cache misses.

Degree of prefetching Number of blocks fetched from memory in a single
    prefetch request.

Harmonic mean A kind of average used to aggregate each benchmark
    speedup score into a final average speedup.
                                            n                   n
                     Havg =   1        1               1   =   n   1
                              x1   +   x2   + ... +   xn       i=1 xi




                                        15
Chapter 5

Debugging the prefetcher

5.1    m5.debug and trace flags

When debugging M5 it is best to use binaries built with debugging support
(m5.debug), instead of the standard build (m5.opt). So let us start by
recompiling M5 to be better suited to debugging:
      scons -j2 ./build/ALPHA SE/m5.debug.
To see in detail what’s going on inside M5, one can specify enable trace
flags, which selectively enables output from specific parts of M5. The most
useful flag when debugging a prefetcher is HWPrefetch. Pass the option
--trace-flags=HWPrefetch to M5:
      ./build/ALPHA SE/m5.debug --trace-flags=HWPrefetch [...]
Warning: this can produce a lot of output! It might be better to redirect
stdout to file when running with --trace-flags enabled.


5.2    GDB

The GNU Project Debugger gdb can be used to inspect the state of the
simulator while running, and to investigate the cause of a crash. Pass GDB
the executable you want to debug when starting it.
      gdb --args m5/build/ALPHA SE/m5.debug --remote-gdb-port=0
-re --outdir=output/ammp-user m5/configs/example/se.py
--checkpoint-dir=lib/cp --checkpoint-restore=1000000000
--at-instruction --caches --l2cache --standard-switch
--warmup-insts=10000000 --max-inst=10000000 --l2size=1MB
--bench=ammp --prefetcher=on access=true:policy=proxy
You can then use the run command to start the executable.


                                   16
Some useful GDB commands:
 run <args>        Restart the executable with the given command line arguments.
 run               Restart the executable with the same arguments as last time.
 where             Show stack trace.
 up                Move up stack trace.
 down              Move down stack frame.
 print <expr>      Print the value of an expression.
 help              Get help for commands.
 quit              Exit GDB.
GDB has many other useful features, for more information you can consult
the GDB User Manual at http://sourceware.org/gdb/current/onlinedocs/
gdb/.


5.3      Valgrind

Valgrind is a very useful tool for memory debugging and memory leak detec-
tion. If your prefetcher causes M5 to crash or behave strangely, it is useful
to run it under Valgrind and see if it reports any potential problems.
By default, M5 uses a custom memory allocator instead of malloc. This will
not work with Valgrind, since it replaces malloc with its own custom mem-
ory allocator. Fortunately, M5 can be recompiled with NO FAST ALLOC=True
to use normal malloc:
        scons NO FAST ALLOC=True ./m5/build/ALPHA SE/m5.debug
To avoid spurious warnings by Valgrind, it can be fed a file with warning
suppressions. To run M5 under Valgrind, use
      valgrind --suppressions=lib/valgrind.suppressions
./m5/build/ALPHA SE/m5.debug [...]
Note that everything runs much slower under Valgrind.




                                     17
Page 1 of 5


Norwegian University of Science and Technology (NTNU)
DEPT. OF COMPUTER AND INFORMATION SCIENCE (IDI)

Course responsible: Professor Lasse Natvig
Quality assurance of the exam: PhD Jon Olav Hauglid
Contact person during exam: Magnus Jahre

Deadline for examination results: 23rd of June 2009.



                        EXAM IN COURSE TDT4260 COMPUTER ARCHITECTURE
                        Tuesday 2nd of June 2009
                        Time: 0900 - 1300

Supporting materials: No written and handwritten examination support materials are permitted. A
specified, simple calculator is permitted.

By answering in short sentences it is easier to cover all exercises within the duration of the exam. The
numbers in parenthesis indicate the maximum score for each exercise. We recommend that you start
by reading through all the sub questions before answering each exercise.

The exam counts for 80% of the total evaluation in the course. Maximum score is therefore 80 points.

Exercise 1) Instruction level parallelism (Max 10 points)
a) (Max 5 points) What is the difference between (true) data dependencies and name
   dependencies? Which of the two presents the most serious problem? Explain why such
   dependencies will not always result in a data hazard.

    Solution sketch:
    True data dependency: One instruction reads what an earlier has written (data flows) (RAW).
    Name dependency: Two instructions use the same register or memory location, but there is no
    flow of data between them. One instruction writes what an earlier has read (WAR) or written
    (WAW). (no data flow).
    True data dependency is the most serious problem, as name dependencies can be prevented by
    register renaming. Also, many pipelines are designed so that name-dependencies will not cause a
    hazard.
    A dependency between two instructions will only result in a data hazard if the instructions are
    close enough together and the processor executes them out of order.

b) (Max 5 points) Explain why loop unrolling can improve performance. Are there any potential
   downsides to using loop unrolling?

    Solution sketch:
    Loop unrolling can improve performance by reducing the loop overhead (e.g. loop overhead
    instructions executed every 4th element, rather than for each). It also makes it possible for
    scheduling techniques to further improve instruction order as instructions for different elements
Page 2 of 5


    (iterations) now can be interchanged. Downsides include increased code size which may lead to
    more cache misses and increased number of registers used.



Exercise 2) Multithreading (Max 15 points)
a) (Max 5 points) What are the differences between fine-grained and coarse-grained
   multithreading?

    Solution sketch:
    Fine-grained: Switch between threads after each instruction. Coarse-grained: Switch on costly
    stalls (cache miss).

b) (Max 5 points) Can techniques for instruction level parallelism (ILP) and thread level parallelism
   (TLP) be used simultaneously? Why/why not?

    Solution sketch:
    ILP and TLP can be used simultaneously. TLP looks at parallelism between different threads,
    while ILP looks at parallelism inside a single instruction stream/thread.

c) (Max 5 points) Assume that you are asked to redesign a processor from single threaded to
   simultaneous multithreading (SMT). How would that change the requirements for the caches?
   (I.e., what would you look at to ensure that the caches would not degrade performance when
   moving to SMT)

    Solution sketch:
    Several threads executing at once will lead to increased cache traffic and more cache conflicts.
    Techniques that could help: Increased cache size, more cache ports/banks, higher associativity,
    non-blocking caches.

Exercise 3) Multiprocessors (Max 15 points)
a) (Max 5 points) Give a short example illustrating the cache coherence problem for
   multiprocessors.

    Solution sketch:
    See Figure 4.3 on page 206 of the text book. (A reads X, B reads X, A stores X, B now has
    inconsistent value for X).

b) (Max 5 points) Why does bus snooping scale badly with number of processors? Discuss how
   cache block size could influence the choice between write invalidate and write update.

    Solution sketch:
    Bus snooping relies on a common bus where information is broadcasted. As number of devices
    increase, this common medium becomes a bottleneck.
    Invalidates are done at cache block level, while updates are done on individual words. False
    sharing coherence misses only appear when using write invalidate with block sizes larger than
Page 3 of 5


   one word. So as cache block size increases, the number of false sharing coherence misses will
   increase, thereby making write update increasingly more appealing.

c) (Max 5 points) What makes the architecture of UltraSPARC T1 (“Niagara”) different from most
   other processor architectures?

   Solution sketch:
   High focus on TLP, low focus on ILP. Poor single thread performance, but great multithread
   performance. Thread switch on any stall. Short pipeline, in-order, no branch prediction.


Exercise 4) Memory, vector processors and networks (Max 15 points)
a) (Max 5 points) Briefly describe 5 different optimizations of cache performance.

   Solution sketch:
   (1 point pr. optimization) 6 techniques listed on page 291 in the text book, 11 more in 5.2 on
   page 293.

b) (Max 5 points) What makes vector processors fast at executing a vector operation?

   Solution sketch:
   A Vector operation can be executed with a single instruction, reducing code size and improving
   cache utilization. Further, the single instruction has no loop overhead and no control
   dependencies which a scalar processor would have. Hazard checks can also be done per vector,
   rather than per element. A vector processor also contains a deep pipeline especially designed for
   vector operations.

c) (Max 5 points) Discuss how the number of devices to be connected influences the choice of
   topology.

   Solution sketch:
   This is a classic example of performance vs. cost. Different topologies scale differently with
   respect to performance or cost as the number of devices grows. Crossbar scales performance
   well, but cost badly. Ring or bus scale performance badly, but cost well.


Exercise 5) Multicore architectures and programming (Max 25 points)
a) (Max 6 points) Explain briefly the research method called design space exploration (DSE). When
   doing DSE, explain how a cache sensitive application can be made processor bound, and how it
   can be made bandwidth bound.

   Solution sketch:
   (Lecture 10-slide 4) DSE is to try out different points in an n-dimensional space of possible
   designs, where n is the number of different main design parameters, such as #cores, core-types
   (IO vs. OOO etc.), cache size etc. Cache sensitive applications can become processor bound by
Page 4 of 5


    increasing the cache size, and they can be made bandwidth bound by decreasing it..

b) (Max 5 points) In connection with GPU-programming (shader programming), David Blythe uses
   the concept ”computational coherence”. Explain it briefly.

    LF: See lecture 10, slide 36 + evt. the paper.

c) (Max 8 points) Give an overview of the architecture of the Cell processor.

    Solution sketch:
    All details of this figure are not expected, but the main elements.




    * One main processor (Power-architecture, called PPE = Power processing element) – this acts as
    a host (master) processor. (Power arch., 64 bit, in-order two-issue superscalar, SMT
    (Simultaneous multithreading. Has a vector media extension (VMX) (Kahle figure 2))
    * 8 identical SIMD processors (called SPE = Synergistic Processing element), each of these
    consists of SPU processing element (Synergistic processor unit) and local storage (LS, 256 KB
    SRAM --- not cache). On chip memory controller + bus interface. (Can operate on integers in
    different formats., 8, 16, 32 and floating point numbers in 32 og 64 bit. (64 bit floats in later
    version).
    * Interconnect is a ring-bus (Element Interconnect Bus, EIB), connects PPE + 8 SPE. two
    unidirectional busses in each direction. Worst case latency is half distance, can support up to
    three simultaneous transfers
    * Highly programmable DMA controller.

d) (Max 6 points) The Cell design team made several design decisions that were motivated by a wish
   to make it easier to develop programs with predictable (more deterministic) processing time
   (performance). Describe two of these.

    Solution sketch:
    1) They discarded the common out-of-order execution in the Power-processor, developed a
    simpler in-order processor
Page 5 of 5


2) The local store memory (LS) in the SPE processing elements do not use HW cache-coherency
snooping protocols to avoid the in-determinate nature of cache misses. The programmer handles
memory in a more explicit way
3) Also the large number of registers (128) might help making the processing more deterministic
wrt. execution time.
4) Extensive timers and counters (probably performance counters) (that may be used by the
SW/programmer to monitor/adjust/control performance)



                                   …---oooOOOooo---…
Page 1 of 4




Norwegian University of Science and Technology (NTNU)
DEPT. OF COMPUTER AND INFORMATION SCIENCE (IDI)

Contact person for questions regarding exam exercises:
Name: Lasse Natvig
Phone: 906 44 580

                         EXAM IN COURSE TDT4260 COMPUTER ARCHITECTURE
                         Monday 26th of May 2008
                         Time: 0900 – 1300

                         Solution sketches in blue text
Supporting materials: No handwritten or printed materials allowed, simple specified calculator is allowed.

By answering in short sentences it is easier to cover all exercises within the duration of the exam. The numbers in
parenthesis indicate the maximum score for each exercise. We recommend that you start by reading through all
the sub questions before answering each exercise.

The exam counts for 80% of the total evaluation in the course. Maximum score is therefore 80 points.

Exercise 1) Parallel Architecture (Max 25 points)

a) (Max 5 points) The feature size of integrated circuits is now often 65 nanometres or smaller, and it is still
decreasing. Explain briefly how the number of transistors on a chip and the wire delay changes with shrinking
feature size.
The number of transistors can be 4 times larger when the feature size is halved. However the wire delay does not
improve (scales poorly). (The textbook page 17 gives more details, but we here ask for the main trends)

b) (Max 5 points) In a cache coherent multiprocessor, the concepts migration and replication of shared data items
are central. Explain both concepts briefly and also how they influence on latency to access of shared data and the
bandwidth demand on the shared memory.
Migration means that data move to a place closer to requesting/accessing unit. Replication just means storing
several copies. Having a local copy in general means faster access, and it is harmelss to have several copies of
read-only data. (Textbook page 207)

c) (Max 5 points) Explain briefly how a write buffer can be used in cache systems to increase performance.
Explain also what “write merging” is in this context.
The main purpose of the write buffer is to temporarily store data that are evicted from the cache so new data can
reuse the cache space as fast as possible, i.e. to avoid waiting for the latency of the memory one level further away
from the processor. If more writes are to the same cache block (adress) these writes can be combined, resulting in
a reduced traffic towards the next memory level. (Textbook page 300)
 ((Also slides 11-6-3)). // Retting: 3 poeng for skrive-buffer-forståelse og 2 for skrive-fletting.

d) (Max 5 points) Sketch a figure that shows how a hypercube with 16 nodes are built by combining two smaller
hypercubes. Compare the hypercube-topology with the 2-dimensional mesh topology with respect to connectivity
and node cost (number of links/ports per node).
(Figure E-14 c) A mesh has a fixed degree of connectivity and becomes slower in general when the number of
nodes is increased, since the number of hops needed for reaching another node on average is increasing. For a
hypercube it is the other way around, the connectivity increase for larger networks, so the communication time
does not increase much, but the node cost does also increase. When going to a larger network, increasing the
Page 2 of 4




dimension, every node must be extended with a new port, and this is a drawback when it comes to building
computers using such networks.

e) (Max 5 points) When messages are sent between nodes in a multiprocessor two possible strategies are source
routing and distributed routing. Explain the difference between these two.
For source routing, the entire routing path is precomputed by the source (possibly by table lookup—and placed in
the packet header). This usually consists of the output port or ports supplied for each switch along the
predetermined path from the source to the destination, (which can be stripped off by the routing control
mechanism at each switch. An additional bit field can be included in the header to signify whether adaptive
routing is allowed (i.e., that any one of the supplied output ports can be used).
For distributed routing, the routing information usually consists of the destination address. This is used by the
routing control mechanism in each switch along the path to determine the next output port, either by computing it
using a finite-state machine or by looking it up in a local routing table (i.e., forwarding table). (Textbook page E-
48)

Exercise 2) Parallel processing (Max 15 points)

a) (Max 5 points) Explain briefly the main difference between a VLIW processor and a dynamically scheduled
superscalar processor. Include the role of the compiler in your explanation.
Parallel execution of several operations is scheduled (analysed and planned) at compile time and assembled into
very long/broad instructions for VLIW. (Such work done at compile time is often called static). In a dynamically
scheduled superscalar processor dependency and resource analysis are done at run time (dynamically) to find
opportunities to do operations in parallell. (Textbook page 114 -> and VLIW paper)

b) (Max 5 points) What function has the vector mask register in a vector processor?
If you want to update just some subset of the elements in a vector register, i.e. to implement
IF A[i] != 0 THEN A[i] = A[i] – B[i] for (i=0..n) in a simple way, this can be done by setting the vector mask
register to 1 only for the elements with A[i] != 0. In this way, the vectorinstruction A = A - B can be performed
without testing every element explicitly.

c) (Max 5 points) Explain briefly the principle of vector chaining in vector processors.
The execution of instructions using several/different functional and memory pipelines can be chained together
directly or by using vector registers. The chaining forms one longer pipeline. (This is the technique of forwarding
(used in processor, as in Tomasulos algorithm) extended to vector registers (Textbook F-23)
((Slides forel-9, slide 20)) – bør sjekkes


Exercise 3) Multicore processors (Max 20 points)

a) (Max 5 points) In the paper Chip Multithreading: Opportunities and Challenges, by Spracklen & Abraham is
the concept Chip Multithreaded processor (CMT) described. The authors describe three generations of CMT
processors. Describe each of these briefly. Make simple drawings if you like.
a) 1. generation: typically 2 cores pr. chip, every core is a traditional processor-core, no shared resources except
the off-chip bandwidth. 2.generation: Shared L2 cache, but still traditional processor cores. 3. generation: as 2.
gen., but the cores are now custom-made for being used in a CMP, and might also use simultaneous
multithreading (SMT). (This description is a bit ”biased” and colored by the backgorund of the authors (in Sun
Microsystems) that was involved in the design of Niagara 1 og 2 (T1))
// fig. 1 i artikkel, og slides // Var deloppgave mai 2007,

b) (Max 5 points) Outline the main architecture in SUN’s T1 (Niagara) multicore processor. Describe the
placement of L1 and L2 cache, as well as how the L1 caches are kept coherent.
Fig 4.24 at page 250 in the textbook, that shows 8 cores, each with its own L1-cache (described in the text), 4 x
L2 cache banks, each having a channel to external memory, 1x FPU unit, crossbar as interconnection. Coherence
Page 3 of 4




is maintained by a catalog associated with each L2 cache. This knows which L1-caches that havbe a copy of data
in the L2 cache.
// Læreboka side 249-250, også forelsning

c) (Max 6 points) In the paper Exploring the Design Space of Future CMP’s the authors perform a design space
exploration where several main architectural parameters are varied assuming a fixed total chip area of 400mm2.
Outline the approach by explaining the following figure;




Technology independent area models – found empirically, – core area and cache area measured in cache byte
equivalents (CBE). Study the relative costs in area versus the associated performance gains --- maximize
performance per unit area for future technology generations. With smaller feature sizes, the available area for
cache banks and processing cores increases. Table 3 displays die area in terms of the cache-byte-equivalents
(CBE), and PIN and POUT columns show how many of each type of processor with 32KB separate L1 instruction
and data caches could be implemented on the chip if no L2 cache area were required. (PIN is a simple in-order-
execution processor, POUT is a larger out-of-order exec processor). And, for reference, Lambda-squared where
lambda is equal to one half of the feature size. The primary goal of this paper is to determine the best balance
between per-processor cache area, area consumed by different processor organizations, and the number of cores
on a single die.
LF; Ny oppgave / Middels/vanskelig / foil 1-6, og 2-3

d) (Max 4 points) Explain the argument of the authors of the paper Exploring the Design Space of Future CMP’s
that we in the future may have chips with useless area on the chip that performs no other function than as a
placeholder for pin area?
As applications become bandwidth bound, and global wire delays increase, an interesting scenario may arise. It is
likely that monolithic caches cannot be grown past a certain point in 50 or 35nm technologies, since the wire
delays will make them too slow. It is also likely that, given a ceiling on cache size, off-chip bandwidth will limit
the number of cores. Thus, there may be useless area on the chip which cannot be used for cache or processing
logic, and which performs no function other than as a placeholder for pin area. That area may be useful to use for
compression engines, or intelligent controllers to manage the caches and memory channels.
 (Fra forel 8, slide 6 på side 4)

Exercise 4) Research prototypes (Max 20 points)

a) (Max 5 points) Sketch a figure of the main system structure of the Manchester Dataflow Machine (MDM).
Include the following units: Matching unit, Token Queue, IO switch, Instruction store, Overflow unit and
Processing unit. Show also how these are connected.
See figure 5 in the paper, and slides. The Overflow unit is coupled to the matching unit, in parallel..
Page 4 of 4




                           Matching
                           Unit
            Token
            Queue
 Output                                      Instruction
                                             Store

            Switch

                           Processing
   Input                   Unit
                           P0...P19


b) (Max 5 points) What was the function of the overflow unit in MDM and explain very briefly how it was
implemented.
If an operand does not find its corresponding operand in the Matching Unit (MU), and it is not space in MU to
store it (for waiting on the other operand), the operand is stored in the overflow store. This is a separate and much
slower subsystem with much larger storage capcity. It is composed of a separate overflow-bus, memory and a
microcoded processors, in other words a SW-solution. See also figure 7 in the paper.

c) (Max 5 points) In the paper The Stanford FLASH Multiprocessor by Kuskin et.al., the FLASH computer is
described. FLASH is an abbreviation for FLexible Architecture for SHared memory. What kind of flexibility was
the main goal for the project?
Programming paradigm, flexibility in the choice between distributed shared memory (DSM) i.e. cache coherent
shared memory and message passing, but also other alternative ways of communication between the nodes could
be explored.

d) (Max 5 points) Outline the main architecture of a node in a FLASH system. What was the most central design
choice to achieve this flexibility?
Fig. 2.1 explain much of this




Interconnection of PE’s in a mesh. The most central design choice was the MAGIC unit, a specially designed
node controller. All memory accesses goes through this, and it can as an example realise a cache-coherence
protocol. Every Node is identical. The whole computer has one single adress space, but the memory is physically
distributed.
                                               ---oooOOOooo---

tdt4260

  • 1.
    TDT 4260 –lecture 1 – 2011 Course goal • Course introduction • To get a general and deep understanding of the – course goals organization of modern computers and the – staff motivation for different computer architectures. Give – contents a base for understanding of research themes within – evaluation the field. – web, ITSL • High level • Textbook • Mostly HW and low-level SW – Computer Architecture, A Quantitative Approach, Fourth • HW/SW interplay Edition • Parallelism • by John Hennessy & David Patterson (HP90 - 96 – 03) - 06 • Principles, not details • Today: Introduction (Chapter 1) – Partly covered  inspire to learn more 1 Lasse Natvig 2 Lasse Natvig Contents TDT-4260 / DT8803 • Recommended background • Computer architecture fundamentals, trends, measuring – Course TDT4160 Computer Fundamentals, or performance, quantitative principles. Instruction set equivalent. architectures and the role of compilers. Instruction-level • http://www.idi.ntnu.no/emner/tdt4260/ parallelism, thread-level parallelism, VLIW. – And Its Learning • Memory hierarchy design, cache. Multiprocessors, shared • Friday 1215-1400 memory architectures, vector processors, NTNU/Notur – And/or some Thursdays 1015-1200 supercomputers, supercomputers distributed shared memory memory, – 12 lectures planned synchronization, multithreading. – some exceptions may occur • Interconnection networks, topologies • Evaluation • Multicores,homogeneous and heterogeneous, principles and – Obligatory exercise (counts 20%). Written product examples exam counts 80%. Final grade (A to F) given at end of semester. If there is a re-sit • Green computing (introduction) examination, the examination form may • Miniproject - prefetching change from written to oral. 3 Lasse Natvig 4 Lasse Natvig Lecture plan Subject to change EMECS, new European Master's Date  and lecturer  Topic Course in Embedded Computing Systems 1:  14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge 2:  21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 2 3: 28 Jan (IB) ILP, Chapter 2; TLP, Chapter 3 4: 4 Feb (LN) Multiprocessors, Chapter 4  5: 11 Feb MG(?)) Prefetching + Energy Micro guest lecture 6: 18 Feb (LN) Multiprocessors continued  7: 25 Feb (IB) Piranha CMP + Interconnection networks  8: 4 Mar (IB) Memory and cache, cache coherence  (Chap. 5) 9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl  multicore ... Fedorova ... assymetric multicore ... 10: 18 Mar (IB) Memory consistency (4.6) + more on memory 11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers   (2) Green  computing 12: 1 Apr (IB/LN) Wrap up lecture, remaining stuff 13: 8 Apr  Slack – no lecture planned  5 Lasse Natvig 6 Lasse Natvig 1
  • 2.
    Preliminary reading list, subject to change!!! People involved • Chap.1: Fundamentals, sections 1.1 - 1.12 (pages 2-54) • Chap.2: ILP, sections 2.1 - 2.2 and parts of 2.3 (pages 66-81), section 2.7 (pages 114-118), parts of section 2.9 (pages 121-127, stop at speculation), Lasse Natvig section 2.11 - 2.12 (pages 138 - 141). (Sections 2.4 - 2.6 are covered by similar material in our computer design course) Course responsible, lecturer • Chap.3: Limits on ILP, section 3.1 and parts of section 3.2 (pages 154 -159), lasse@idi.ntnu.no section 3.5 - 3.8 (pages 172-185). • Chap.4: Multiprocessors and TLP, sections 4.1 - 4.5, 4.8 - 4.10 • Chap.5: Memory hierachy, section 5.1 - 5.3 (pages 288 - 315). Ian Bratt • App A: section A 1 (Expected to be repetition from other courses) A.1 Lecturer (Also t Til (Al at Tilera.com) ) • Appendix E, interconnection networks, pages E2-E14, E20-E25, E29-E37 ianbra@idi.ntnu.no and E45-E51. • App. F: Vector processors, sections F1 - F4 and F8 (pages F-2 - F-32, F- 44 - F-45) Alexandru Iordan • Data prefetch mechanisms (ACM Computing Survey) Teaching assistant (Also PhD-student) • Piranha, (To be announced) iordan@idi.ntnu.no • Multicores (New bookchapter) (To be announced) • (App. D; embedded systems?)  see our new course TDT4258 Mikrokontroller systemdesign http://www.idi.ntnu.no/people/ 7 Lasse Natvig 8 Lasse Natvig research.idi.ntnu.no/multicore Prefetching ---pfjudge Some few highlights: - Green computing, 2xPhD + master students - Multicore memory systems, 3 x PhD theses - Multicore programming and parallel computing - Cooperation with industry 9 Lasse Natvig 10 Lasse Natvig ”Computational computer architecture” Experiment Infrastructure • Computational science and engineering (CSE) • Stallo compute cluster – Computational X, X = comp.arch. – 60 Teraflop/s peak • Simulates new multicore architectures – 5632 processing cores – Last level, shared cache fairness (PhD-student M. Jahre) – 12 TB total memory – Bandwidth aware prefetching (PhD-student M. Grannæs) – 128 TB centralized disk • Complex cycle-accurate simulators – Weighs 16 tons – 80 000 lines C++ 20 000 lines python C++, – Open source, Linux-based • Multi-core research • Design space exploration (DSE) – About 60 CPU years allocated per – one dimension for each arch. parameter year to our projects – DSE sample point = specific multicore configuration – Typical research paper uses 5 to – performance of a selected set of configurations evaluated by 12 CPU years for simulation simulating the execution of a set of workloads (extensive, detailed design space exploration) 11 Lasse Natvig 12 Lasse Natvig 2
  • 3.
    The End ofMoore’s law Motivational background for single-core microprocessors • Why multicores – in all market segments from mobile phones to supercomputers • The ”end” of Moores law • The power wall • The memory wall • The bandwith problem • ILP limitations • The complexity wall But Moore’s law still holds for FPGA, memory and multicore processors 13 Lasse Natvig 14 Lasse Natvig Energy & Heat Problems The Memory Wall 1000 • Large power “Moore’s Law” consumption 100 CPU 60%/year – Costly Performance P-M gap grows 50% / year – Heat problems 10 – Restricted battery DRAM operation time 9%/year 9%/ 1 • Google ”Open House Trondheim 1980 1990 2000 2006” • The Processor Memory Gap – ”Performance/Watt is the only flat • Consequence: deeper memory hierachies trend line” – P – Registers – L1 cache – L2 cache – L3 cache – Memory - - - – Complicates understanding of performance • cache usage has an increasing influence on performance 15 Lasse Natvig 16 Lasse Natvig The I/O pin or Bandwidth problem The limitations of ILP (Instruction Level Parallelism) • # I/O signaling pins in Applications – limited by physical tecnology 30 3   – speeds have not 2.5  25 increased at the same Fraction of total cycles (%) rate as processor clock 20 2 rates  dup Speed 1.5 • Projections 15 – from ITRS (International 10 1  Technology Roadmap for Semiconductors) 5 0.5 0 0 [Huh, Burger and Keckler 2001] 0 1 2 3 4 5 6+ 0 5 10 15 Number of instructions issued Instructions issued per cycle 17 Lasse Natvig 18 Lasse Natvig 3
  • 4.
    Reduced Increase inClock Frequency Solution: Multicore architectures (also called Chip Multi-processors - CMP) • More power-efficient – Two cores with clock frequency f/2 can potentially achieve the same speed as one at frequency f with 50% reduction in total energy consumption [Olukotun & Hammond 2005] • Exploits Thread Level Parallelism (TLP) – in addition to ILP – requires multiprogramming or parallel programming • Opens new possibilities for architectural innovations 19 Lasse Natvig 20 Lasse Natvig Why heterogeneous multicores? CPU – GPU – convergence • Specialized HW is (Performance – Programmability) Cell BE processor faster than general HW Processors: Larrabee, Fermi, … – Math co-processor Languages: CUDA, OpenCL, … – GPU, DSP, etc… • Benefits of customization – Similar to ASIC vs. general purpose programmable HW • Amdahl’s law – Parallel speedup limited by serial fraction •  1 super-core 21 Lasse Natvig 22 Lasse Natvig Parallel processing – conflicting Multicore programming challenges goals • Instability, diversity, conflicting goals … what to do? Performance • What kind of parallel programming? The P6-model: Parallel Processing – Homogeneous vs. heterogeneous challenges: Performance, Portability, – DSL vs. general languages Programmability and Power efficiency – Memory locality Portability • What to teach? – Teaching should be founded on active research Programmability Powerefficiency • Two layers of programmers y p g – The Landscape of Parallel Computing Research: A View from • Examples; Berkeley [Asan+06] – Performance tuning may reduce portability • Krste Asanovic presentation at ACACES Summerschool 2007 • Eg. Datastructures adapted to cache block size – 1) Programmability layer (Productivity layer) (80 - 90%) • ”Joe the programmer” – New languages for higher programmability may reduce performance and increase power consumption – 2) Performance layer (Efficiency layer) (10 - 20%) • Both layers involved in HPC • Programmability an issue also at the performance-layer 23 Lasse Natvig 24 Lasse Natvig 4
  • 5.
    Parallel Computing Laboratory,U.C. Berkeley, (Slide adapted from Dave Patterson ) Classes of computers Easy to write correct programs that run efficiently on manycore • Servers – storage servers Personal Image Hearing, Parallel – compute servers (supercomputers) Speech Health Retrieval Music Browser – web servers Design Patterns/Motifs – high availability Composition & Coordination Language (C&CL) – scalability – throughput oriented (response time of less importance) ormance C&CL Compiler/Interpreter • Desktop (price 3000 NOK – 50 000 NOK) – the largest market g Diagnosing Power/Perfo Parallel P ll l Libraries Parallel Frameworks – price/performance focus – latency oriented (response time) • Embedded systems Efficiency Languages Sketching – the fastest growing market (”everywhere”) Autotuners – TDT 4258 Microcontroller system design Legacy Communication & Synch. – ATMEL, Nordic Semic., ARM, EM, ++ Schedulers Code Primitives Efficiency Language Compilers OS Libraries & Services Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore 25 Lasse Natvig 26 Lasse Natvig 25 Borgar  FXI Technologies Falanx (Mali) ARM ”An idependent compute platform to gather the Norway fragmented mobile space and thus help accelerate the prolifitation of content and applications eco- systems (I.e build an ARM based SoC, put it ,p in a memory card, connect it to the web- and voila, you got iPhone for the masses ).” • http://www.fxitech.com/ – ”Headquartered in Trondheim • But also an office in Silicon Valley …” 27 Lasse Natvig 28 Lasse Natvig Trends Comp. Arch. is an Integrated Approach • For technology, costs, use • What really matters is the functioning of the • Help predicting the future complete system • Product development time – hardware, runtime system, compiler, operating system, and – 2-3 years application –  design for the next technology – In networking, this is called the “End to End argument” • Computer architecture is not just about – Why should an architecture live longer than a product? transistors(not at all), individual instructions, or particular implementations – E.g., Original RISC projects replaced complex instructions with a compiler + simple instructions 29 Lasse Natvig 30 Lasse Natvig 5
  • 6.
    Computer Architecture is Designand Analysis TDT4260 Course Focus Architecture is an iterative process: Understanding the design techniques, machine • Searching the space of possible designs Design • At all levels of computer systems structures, technology factors, evaluation Analysis methods that will determine the form of computers in 21st Century Technology Parallelism Programming Creativity C ti it Languages Applications Interface Design Cost / Computer Architecture: (ISA) Performance • Organization Analysis • Hardware/Software Boundary Compilers Good Ideas Operating Measurement & Systems Evaluation History Mediocre Ideas Bad Ideas 31 Lasse Natvig 32 Lasse Natvig Moore’s Law: 2X transistors / Holistic approach “year” e.g., to programmability Parallel & concurrent programming Operating System & system software Multicore, interconnect, memory • “Cramming More Components onto Integrated Circuits” – Gordon Moore, Electronics, 1965 • # of transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24) 33 Lasse Natvig 34 Lasse Natvig Tracking Technology Latency Lags Bandwidth (last ~20 years) Performance Trends 10000 • 4 critical implementation technologies: CPU high, • Performance Milestones Processor – Disks, Memory low • Processor: ‘286, ‘386, ‘486, Pentium, – Memory, (“Memory Pentium Pro, Pentium 4 (21x,2250x) Wall”) 1000 – Network, • Ethernet: 10Mb, 100Mb, 1000Mb, Network – Processors 10000 Mb/s (16x,1000x) Relative Memory • Compare for Bandwidth vs. Latency BW 100 Disk • Memory Module: 16bit plain DRAM, Page Mode DRAM 32b 64b SDRAM, P M d DRAM, 32b, 64b, SDRAM improvements in performance over time Improve ment DDR SDRAM (4x,120x) • Bandwidth: number of events per unit time • Disk : 3600, 5400, 7200, 10000, 15000 – E.g., M bits/second over network, M bytes / second from 10 RPM (8x, 143x) disk (Processor latency = typical # of pipeline-stages * time • Latency: elapsed time for a single event (Latency improvement = Bandwidth improvement) pr. clock-cycle) – E.g., one-way network delay in microseconds, 1 average disk access time in milliseconds 1 10 100 Relative Latency Improvement 35 Lasse Natvig 36 Lasse Natvig 6
  • 7.
    COST and COTS Speedup Superlinear speedup ? • Cost • General definition: Performance (p processors) – to produce one unit Speedup (p processors) = Performance (1 processor) – include (development cost / # sold units) – benefit of large volume • COTS • For a fixed problem size (input data set), – commodity off the shelf dit ff th h lf performance = 1/time – Speedup Time (1 processor) fixed problem (p processors) = Time (p processors) • Note: use best sequential algorithm in the uni-processor solution, not the parallel algorithm with p = 1 37 Lasse Natvig 38 Lasse Natvig Amdahl’s Law (1967) (fixed problem size) Gustafson’s “law” (1987) (scaled problem size, fixed execution time) • “If a fraction s of a (uniprocessor) • Total execution time on computation is inherently parallel computer with n serial, the speedup is at processors is fixed most 1/s” – serial fraction s’ • Total work in computation – parallel fraction p’ – serial fraction s – s’ + p’ = 1 (100%) – parallel fraction p p • S (n) Time’(1)/Time’(n) S’(n) = Time (1)/Time (n) – s + p = 1 (100%) = (s’ + p’n)/(s’ + p’) • S(n) = Time(1) / Time(n) = s’ + p’n = s’ + (1-s’)n = (s + p) / [s +(p/n)] = n +(1-n)s’ • Reevaluating Amdahl's law, = 1 / [s + (1-s) / n] John L. Gustafson, CACM May 1988, pp 532-533. ”Not a new = n / [1 + (n - 1)s] law, but Amdahl’s law with changed assumptions” • ”pessimistic and famous” 39 Lasse Natvig 40 Lasse Natvig How the serial fraction limits speedup • Amdahl’s law • Work hard to reduce the serial part of the application – remember IO – think different (than traditionally  = serial fraction or sequentially) 41 Lasse Natvig 7
  • 8.
    1 TDT4260 Computer architecture Mini-project PhD candidate Alexandru Ciprian Iordan Institutt for datateknikk og informasjonsvitenskap
  • 9.
    2 What is it…? How much…? • The mini-project is the exercise part of TDT4260 course • This year the students will need to develop and evaluate a PREFETCHER • The mini-project accounts for 20 % of the final grade in TDT4260 • 80 % for report • 20 % for oral presentation
  • 10.
    3 What will you work with… • Modified version of M5 (for development and evaluation) • Computing time on Kongull cluster (for benchmarking) • More at: http://dm-ark.idi.ntnu.no/
  • 11.
    4 M5 • Initially developed by the University of Michigan • Enjoys a large community of users and developers • Flexible object-oriented architecture • Has support for 3 ISA: ALPHA, SPARC and MIPS
  • 12.
    5 Team work… • You need to work in groups of 2-4 students • Grade is based on written paper AND oral presentation (chose you best speaker)
  • 13.
    6 Time Schedule and Deadlines More on It’s learning
  • 14.
    7 Web page presentation
  • 15.
    Contents • Instruction level parallelism Chap 2 • Pipelining (repetition) App A TDT 4260 ▫ Basic 5-step pipeline • Dependencies and hazards Chap 2.1 App A.1, Chap 2 ▫ Data, name, control, structural Instruction Level Parallelism • Compiler techniques for ILP Chap 2.2 • (Static prediction Chap 2.3) ▫ Read this on your own • Project introduction Pipelining Instruction level parallelism (ILP) (1/3) • A program is sequence of instructions typically written to be executed one after the other • Poor usage of CPU resources! (Why?) • Better: Execute instructions in parallel ▫ 1: Pipeline Partial overlap of instruction execution ▫ 2: Multiple issue Total overlap of instruction execution • Today: Pipelining Pipelining (2/3) Pipelining (3/3) • Multiple different stages executed in parallel • Good Utilization: All stages are ALWAYS in use ▫ Laundry in 4 different stages ▫ Washing, drying, folding, ... ▫ Wash / Dry / Fold / Store ▫ Great usage of resources! • Assumptions: • Common technique, used everywhere ▫ Task can be split into stages ▫ Manufacturing, CPUs, etc ▫ Storage of temporary data • Ideal: time_stage = time_instruction / stages ▫ But stages are not perfectly balanced ▫ Stages synchronized ▫ But transfer between stages takes time ▫ Next operation known before last finished? ▫ But pipeline may have to be emptied ▫ ...
  • 16.
    Example: MIPS64 (2/2) Example:MIPS64 (1/2) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 • RISC • Pipeline I ALU • Load/store ▫ IF: Instruction fetch n Ifetch Reg DMem Reg s • Few instruction formats ▫ ID: Instruction decode / t register fetch r. ALU • Fixed instruction length ▫ EX: Execute / effective Ifetch Reg DMem Reg • 64-bit address (EA) O r ALU ▫ DADD = 64 bits ADD ▫ MEM: Memory access d Ifetch Reg DMem Reg ▫ LD = 64 bits L(oad) ▫ WB: Write back (reg) e r • 32 registers (R0 = 0) ALU Ifetch Reg DMem Reg • EA = offset(Register) Big Picture: Big Picture (continued): • What are some real world examples of • Computer Architecture is the study of design pipelining? tradeoffs!!!! • Why do we pipeline? • There is no “philosophy of architecture” and no • Does pipelining increase or decrease instruction “perfect architecture”. This is engineering, not throughput? science. • Does pipelining increase or decrease instruction • What are the costs of pipelining? latency? • For what types of devices is pipelining not a good choice? Improve speedup? Dependencies and hazards • Why not perfect speedup? • Dependencies ▫ Sequential programs ▫ Parallel instructions can be executed in parallel ▫ Dependent instructions are not parallel ▫ One instruction dependent on another I1: DADD R1, R2, R3 ▫ Not enough CPU resources I2: DSUB R4, R1, R5 • What can be done? ▫ Property of the instructions ▫ Forwarding (HW) • Hazards ▫ Situation where a dependency causes an instruction to ▫ Scheduling (SW / HW) give a wrong result ▫ Prediction (SW / HW) ▫ Property of the pipeline • Both hardware (dynamic) and compiler (static) ▫ Not all dependencies give hazards can help Dependencies must be close enough in the instruction stream to cause a hazard
  • 17.
    Dependencies Hazards • (True) data dependencies • Data hazards ▫ One instruction reads what an earlier has written ▫ Overlap will give different result from sequential • Name dependencies ▫ RAW / WAW / WAR ▫ Two instructions use the same register / mem loc • Control hazards ▫ But no flow of data between them ▫ Branches ▫ Two types: Anti and output dependencies ▫ Ex: Started executing the wrong instruction • Control dependencies • Structural hazards ▫ Instructions dependent on the result of a branch ▫ Pipeline does not support this combination of instr. • Again: Independent of pipeline implementation ▫ Ex: Register with one port, two stages want to read Data dependency Hazard? Figure A.6, Page A-16 Data Hazards (1/3) • Read After Write (RAW) I InstrJ tries to read operand before InstrI writes ALU Reg add r1,r2,r3 Ifetch Reg DMem n it s ALU t sub r4,r1,r3 Ifetch Reg DMem Reg I: add r1,r2,r3 r. J: sub r4,r1,r3 ALU Ifetch Reg DMem Reg O and r6,r1,r7 r • Caused by a true data dependency d • This hazard results from an actual need for ALU Ifetch Reg DMem Reg e or r8,r1,r9 r communication. ALU Ifetch Reg DMem Reg xor r10,r1,r11 Data Hazards (2/3) Data Hazards (3/3) • Write After Write (WAW) • Write After Read (WAR) InstrJ writes operand before InstrI writes it. InstrJ writes operand before InstrI reads it I: sub r1,r4,r3 I: sub r4,r1,r3 J: add r1,r2,r3 J: add r1,r2,r3 • Caused by an output dependency • Caused by an anti dependency This results from reuse of the name “r1” • Can’t happen in MIPS 5 stage pipeline because: ▫ All instructions take 5 stages, and • Can’t happen in MIPS 5 stage pipeline because: ▫ Writes are always in stage 5 ▫ All instructions take 5 stages, and • WAR and WAW can occur in more ▫ Reads are always in stage 2, and ▫ Writes are always in stage 5 complicated pipes
  • 18.
    Forwarding Can all data hazards be solved via Figure A.7, Page A-18 forwarding??? IF ID/RF EX MEM WB IF ID/RF EX MEM WB I I ALU ALU Reg Reg add r1,r2,r3 Ifetch Reg DMem Ld r1,r2 Ifetch Reg DMem n n s s ALU ALU t sub r4,r1,r3 Ifetch Reg DMem Reg t add r4,r1,r3 Ifetch Reg DMem Reg r. r. ALU ALU Ifetch Reg DMem Reg Ifetch Reg DMem Reg O and r6,r1,r7 O and r6,r1,r7 r r d d ALU ALU Ifetch Reg DMem Reg Ifetch Reg DMem Reg e or r8,r1,r9 e or r8,r1,r9 r r ALU ALU Ifetch Reg DMem Reg Ifetch Reg DMem Reg xor r10,r1,r11 xor r10,r1,r11 Structural Hazards (Memory Port) Hazards, Bubbles (Similar to Figure A.5, Page A-15) Figure A.4, Page A-14 Time (clock cycles) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I Load Ifetch Reg DMem Reg ALU I Load Ifetch Reg DMem Reg n n s ALU Reg s t Instr 1 Ifetch Reg DMem ALU Reg t Instr 1 Ifetch Reg DMem r. r. ALU Ifetch Reg DMem Reg Ld r1, r2 ALU Ifetch Reg DMem Reg Instr 2 O O r r Stall Bubble Bubble Bubble Bubble Bubble d ALU Ifetch Reg DMem Reg d Instr 3 e e ALU r Add r1, r1, r1 Ifetch Reg DMem Reg ALU r Instr 4 Ifetch Reg DMem Reg How do you “bubble” the pipe? How can we avoid this hazard? Control hazards (1/2) Control hazards (2/2) • Sequential execution is predictable, • What can be done? (conditional) branches are not ▫ Always stop (previous slide) • May have fetched instructions that should not be executed Also called freeze or flushing of the pipeline • Simple solution (figure): Stall the pipeline (bubble) ▫ Assume no branch (=assume sequential) ▫ Performance loss depends on number of branches in the program Must not change state before branch instr. is complete and pipeline implementation ▫ Branch penaltyC ▫ Assume branch Only smart if the target address is ready early ▫ Delayed branch Execute a different instruction while branch is evaluated Static techniques (fixed rule or compiler) Possibly wrong instruction Correct instruction
  • 19.
    Example •Assume branch conditionals are evaluated in the EX Dynamic scheduling stage, and determine the fetch address for the following cycle. • So far: Static scheduling • If we always stall, how many cycles are bubbled? ▫ Instructions executed in program order • Assume branch not taken, how many bubbles for an ▫ Any reordering is done by the compiler incorrect assumption? • Is stalling on every branch ok? • Dynamic scheduling • What optimizations could be done to improve stall ▫ CPU reorders to get a more optimal order penalty? Fewer hazards, fewer stalls, ... ▫ Must preserve order of operations where reordering could change the result ▫ Covered by TDT 4255 Hardware design Example Compiler techniques for ILP Source code: Notice: for (i = 1000; i >0; i=i-1) • Lots of dependencies • For a given pipeline and superscalarity • No dependencies between iterations x[i] = x[i] + s; ▫ How can these be best utilized? • High loop overhead ▫ As few stalls from hazards as possible Loop unrolling • Dynamic scheduling MIPS: ▫ Tomasulo’s algorithm etc. (TDT4255) Loop: L.D F0,0(R1) ; F0 = x[i] ▫ Makes the CPU much more complicated ADD.D F4,F0,F2 ; F2 = s • What can be done by the compiler? S.D F4,0(R1) ; Store x[i] + s ▫ Has ”ages” to spend, but less knowledge DADDUI R1,R1,#-8 ; x[i] is 8 bytes ▫ Static scheduling, but what else? BNE R1,R2,Loop ; R1 = R2? Loop: L.D F0,0(R1) Static scheduling Loop unrolling ADD.D F4,F0,F2 S.D F4,0(R1) Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) L.D F6,-8(R1) stopp DADDUI R1,R1,#-8 ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F4,F0,F2 ADD.D F4,F0,F2 S.D F4,0(R1) S.D F8,-8(R1) stopp stopp DADDUI R1,R1,#-8 L.D F10,-16(R1) stopp stopp BNE R1,R2,Loop ADD.D F12,F10,F2 S.D F4,0(R1) S.D F4,8(R1) S.D F12,-16(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop L.D F14,-24(R1) stopp • Reduced loop overhead ADD.D F16,F14,F2 BNE R1,R2,Loop • Requires number of iterations S.D F16,-24(R1) divisible by n (here n=4) DADDUI R1,R1,#-32 • Register renaming BNE R1,R2,Loop • Offsets have changed Result: From 9 cycles per iteration to 7 • Stalls not shown (Delays from table in figure 2.2)
  • 20.
    Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) ADD.D F4,F0,F2 L.D F6,-8(R1) S.D F4,0(R1) L.D F10,-16(R1) Loop unrolling: Summary L.D F6,-8(R1) L.D F14,-24(R1) ADD.D F8,F6,F2 ADD.D F4,F0,F2 • Original code 9 cycles per element S.D F8,-8(R1) ADD.D F8,F6,F2 • Scheduling 7 cycles per element L.D F10,-16(R1) ADD.D F12,F10,F2 • Loop unrolling 6,75 cycles per element ADD.D F12,F10,F2 ADD.D F16,F14,F2 ▫ Unrolled 4 iterations S.D F12,-16(R1) S.D F4,0(R1) L.D F14,-24(R1) S.D F8,-8(R1) • Combination 3,5 cycles per element ADD.D F16,F14,F2 DADDUI R1,R1,#-32 ▫ Avoids stalls entirely S.D F16,-24(R1) S.D F12,-16(R1) DADDUI R1,R1,#-32 S.D F16,-24(R1) BNE R1,R2,Loop Compiler reduced execution time by 61% BNE R1,R2,Loop Avoids stall after: L.D(1), ADD.D(2), DADDUI(1) Loop unrolling in practice • Do not usually know upper bound of loop • Suppose it is n, and we would like to unroll the loop to make k copies of the body • Instead of a single unrolled loop, we generate a pair of consecutive loops: ▫ 1st executes (n mod k) times and has a body that is the original loop ▫ 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times • For large values of n, most of the execution time will be spent in the unrolled loop
  • 21.
    Review • Name real-world examples of pipelining • Does pipelining lower instruction latency? • What is the advantage of pipelining? • What are some disadvantages of pipelining? TDT 4260 • What can a compiler do to avoid processor Chap 2, Chap 3 stalls? Instruction Level Parallelism (cont) • What are the three types of data dependences? • What are the three types of pipeline hazards? Getting CPI below 1 Contents • CPI ≥ 1 if issue only 1 instruction every clock cycle • Multiple-issue processors come in 3 flavors: • Very Large Instruction Word Chap 2.7 1. Statically-scheduled superscalar processors ▫ IA-64 and EPIC • In-order execution • Instruction fetching Chap 2.9 • Varying number of instructions issued (compiler) 2. Dynamically-scheduled superscalar processors • Limits to ILP Chap 3.1/2 • Out-of-order execution • Multi-threading Chap 3.5 • Varying number of instructions issued (CPU) 3. VLIW (very long instruction word) processors • In-order execution • Fixed number of instructions issued VLIW: Very Large Instruction Word (2/2) VLIW: Very Large Instruction Word (1/2) • Assume 2 load/store, 2 fp, 1 int/branch ▫ VLIW with 0-5 operations. • Each VLIW has explicit coding for multiple ▫ Why 0? operations ▫ Several instructions combined into packets • Important to avoid empty instruction slots ▫ Possibly with parallelism indicated ▫ Loop unrolling ▫ Local scheduling • Tradeoff instruction space for simple decoding ▫ Global scheduling ▫ Room for many operations Scheduling across branches ▫ Independent operations => execute in parallel ▫ E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 • Difficult to find all dependencies in advance branch ▫ Solution1: Block on memory accesses ▫ Solution2: CPU detects some dependencies
  • 22.
    Loop: L.D F0,0(R1) Recall: L.D F6,-8(R1) Loop Unrolling in VLIW Unrolled Loop L.D L.D F10,-16(R1) F14,-24(R1) Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch that minimizes ADD.D F4,F0,F2 L.D F0,0(R1) L.D F6,-8(R1) 1 ADD.D F8,F6,F2 L.D F10,-16(R1) L.D F14,-24(R1) 2 stalls for Scalar ADD.D F12,F10,F2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F16,F14,F2 Source code: ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D F4,0(R1) S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6 for (i = 1000; i >0; i=i-1) S.D -16(R1),F12 S.D -24(R1),F16 7 S.D F8,-8(R1) x[i] = x[i] + s; S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8 DADDUI R1,R1,#-32 S.D -0(R1),F28 BNEZ R1,LOOP 9 S.D F12,-16(R1) Register mapping: Unrolled 7 iterations to avoid delays S.D F16,-24(R1) 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) s F2 BNE R1,R2,Loop Average: 2.5 ops per clock, 50% efficiency i R1 Note: Need more registers in VLIW (15 vs. 6 in SS) Problems with 1st Generation VLIW VLIW Tradeoffs • Increase in code size • Advantages ▫ Loop unrolling ▫ “Simpler” hardware because the HW does not have to ▫ Partially empty VLIW identify independent instructions. • Operated in lock-step; no hazard detection HW • Disadvantages ▫ A stall in any functional unit pipeline causes entire processor to ▫ Relies on smart compiler stall, since all functional units must be kept synchronized ▫ Code incompatibility between generations ▫ Compiler might predict function units, but caches hard to predict ▫ There are limits to what the compiler can do (can’t move ▫ Moder VLIWs are “interlocked” (identify dependences between loads above branches, can’t move loads above stores) bundles and stall). • Binary code compatibility • Common uses ▫ Strict VLIW => different numbers of functional units and unit ▫ Embedded market where hardware simplicity is latencies require different versions of the code important, applications exhibit plenty of ILP, and binary compatibility is a non-issue. IA-64 and EPIC Instruction bundle (VLIW) • 64 bit instruction set architecture ▫ Not a CPU, but an architecture ▫ Itanium and Itanium 2 are CPUs based on IA-64 • Made by Intel and Hewlett-Packard (itanium 2 and 3 designed in Colorado) • Uses EPIC: Explicitly Parallel Instruction Computing • Departure from the x86 architecture • Meant to achieve out-of-order performance with in- order HW + compiler-smarts ▫ Stop bits to help with code density ▫ Support for control speculation (moving loads above branches) ▫ Support for data speculation (moving loads above stores) Details in Appendix G.6
  • 23.
    Functional units andtemplate • Functional units: Code example (1/2) ▫ I (Integer), M (Integer + Memory), F (FP), B (Branch), L + X (64 bit operands + special inst.) • Template field: ▫ Maps instruction to functional unit ▫ Indicates stops: Limitations to ILP Control Speculation Code example 2/2 • Can the compiler schedule an independent load above a branch? Bne R1, R2, TARGET Ld R3, R4(0) • What are the problems? • EPIC provides speculative loads Ld.s R3, R4(0) Bne R1, R2, TARGET Check R4(0) Data Speculation EPIC Conclusions • Goal of EPIC was to maintain advantages of VLIW, but • Can the compiler schedule an independent load achieve performance of out-of-order. above a store? • Results: St R5, R6(0) ▫ Complicated bundling rules saves some space, but Ld R3, R4(0) makes the hardware more complicated • What are the problems? ▫ Add special hardware and instructions for scheduling • EPIC provides “advanced loads” and an ALAT loads above stores and branches (new complicated (Advanced Load Address Table) hardware) Ld.a R3, R4(0) creates entry in ALAT ▫ Add special hardware to remove branch penalties St R5, R6(0) looks up ALAT, if match, jump to (predication) fixup code ▫ End result is a machine as complicated as an out-of- order, but now also requiring a super-sophisticated compiler.
  • 24.
    Branch Target Buffer(BTB) Instruction fetching • Predicts next instruction • Want to issue >1 instruction every cycle address, sends it out before • This means fetching >1 instruction decoding ▫ E.g. 4-8 instructions fetched every cycle instruction • PC of branch sent • Several problems to BTB ▫ Bandwidth / Latency • When match is ▫ Determining which instructions found, Predicted PC is returned Jumps • If branch Branches predicted taken, • Integrated instruction fetch unit instruction fetch continues at Predicted PC Branch Target Buffer (BTB) Possible Optimizations???? Return Address Predictor • Small buffer of 70% go • Predicts next return Misprediction frequency instruction addresses acts 60% m88ksim address, sends it as a stack cc1 out before 50% • Caches most compress decoding instruction recent return 40% xlisp • PC of branch sent addresses ijpeg 30% to BTB • Call ⇒ Push a perl • When match is return address 20% vortex found, Predicted on stack PC is returned 10% • Return ⇒ Pop • If branch an address off 0% predicted taken, stack & predict instruction fetch 0 1 2 4 8 16 as new PC continues at Return address buffer entries Predicted PC Chapter 3 Integrated Instruction Fetch Units Limits to ILP • Recent designs have implemented the fetch • Advances in compiler technology + significantly stage as a separate, autonomous unit new and different hardware techniques may be ▫ Multiple-issue in one simple pipeline stage is too able to overcome limitations assumed in studies complex • However, unlikely such advances when coupled • An integrated fetch unit provides: with realistic hardware will overcome these ▫ Branch prediction limits in near future ▫ Instruction prefetch • How much ILP is available using existing ▫ Instruction memory access and buffering mechanisms with increasing HW budgets?
  • 25.
    Ideal HW Model Upper Limit to ILP: Ideal Machine (Figure 3.1) 1. Register renaming – infinite virtual registers all register WAW & WAR hazards are avoided Instructions Per Clock 160 FP: 75 - 150 150.1 2. Branch prediction – perfect; no mispredictions 140 3. Jump prediction – all jumps perfectly predicted Integer: 18 - 60 118.7 120 2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available 100 75.2 4. Memory-address alias analysis – addresses known & 80 62.6 a load can be moved before a store provided addresses 60 54.8 not equal 40 1&4 eliminates all but RAW 20 17.9 5. perfect caches; 1 cycle latency for all instructions; 0 unlimited instructions issued/clock cycle gcc espresso li fpppp doducd tomcatv Programs More Realistic HW: Window Impact Figure 3.2 FP: 9 - 150 Instruction window 160 150 • Ideal HW need to know entire code 140 119 • Obviously not practical Instructions Per Clock 120 Integer: 8 - 63 ▫ Register dependencies scales quadratically 100 IPC • Window: The set of instructions examined for 80 75 63 simultaneous execution 60 55 61 49 59 60 45 • How does the size of the window affect IPC? 40 36 41 35 34 ▫ Too small window => Can’t see whole loops 15 13 18 1512 9 14 16 15 14 20 10 8 10 8 11 9 ▫ Too large window => Hard to implement 0 gcc espresso li fpppp doduc tomcatv Infinite 2048 512 128 32 Multi-threaded execution Thread Level Parallelism (TLP) • ILP exploits implicit parallel operations within • Multi-threading: multiple threads share the a loop or straight-line code segment functional units of 1 processor via overlapping ▫ Must duplicate independent state of each thread e.g., a • TLP explicitly represented by the use of separate copy of register file, PC and page table multiple threads of execution that are ▫ Memory shared through virtual memory mechanisms inherently parallel ▫ HW for fast thread switch; much faster than full • Use multiple instruction streams to improve: process switch ≈ 100s to 1000s of clocks 1. Throughput of computers that run many programs • When switch? 2. Execution time of a single application implemented ▫ Alternate instruction per thread (fine grain) as a multi-threaded program (parallel program) ▫ When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)
  • 26.
    Fine-Grained Multithreading Coarse-Grained Multithreading • Switch threads only on costly stalls (L2 cache miss) • Switches between threads on each instruction • Advantages ▫ Multiples threads interleaved ▫ No need for very fast thread-switching • Usually round-robin fashion, skipping stalled ▫ Doesn’t slow down thread, since switches only when threads thread encounters a costly stall • CPU must be able to switch threads every clock • Disadvantage: hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs • Hides both short and long stalls ▫ Since CPU issues instructions from 1 thread, when a stall ▫ Other threads executed when one thread stalls occurs, the pipeline must be emptied or frozen • But slows down execution of individual threads ▫ New thread must fill pipeline before instructions can ▫ Thread ready to execute without stalls will be delayed by complete instructions from other threads • => Better for reducing penalty of high cost stalls, • Used on Sun’s Niagara where pipeline refill << stall time Simultaneous Multi-threading Do both ILP and TLP? One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC • TLP and ILP exploit two different kinds of 1 1 parallel structure in a program 2 2 • Can a high-ILP processor also exploit TLP? 3 3 ▫ Functional units often idle because of stalls or 4 4 dependences in the code 5 5 • Can TLP be a source of independent instructions 6 6 that might reduce processor stalls? 7 7 • Can TLP be used to employ functional units that 8 8 would otherwise lie idle with insufficient ILP? 9 9 • => Simultaneous Multi-threading (SMT) ▫ Intel: Hyper-Threading M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes Simultaneous Multi-threading (SMT) Multi-threaded categories Simultaneous Superscalar Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Multithreading • A dynamically scheduled processor already has many HW mechanisms to support multi-threading ▫ Large set of virtual registers Virtual = not all visible at ISA level Register renaming ▫ Dynamic scheduling • Just add a per thread renaming table and keeping separate PCs ▫ Independent commitment can be supported by logically keeping a separate reorder buffer for each thread Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot
  • 27.
    Design Challenges inSMT • SMT makes sense only with fine-grained implementation ▫ How to reduce the impact on single thread performance? ▫ Give priority to one or a few preferred threads • Large register file needed to hold multiple contexts • Not affecting clock cycle time, especially in ▫ Instruction issue - more candidate instructions need to be considered ▫ Instruction completion - choosing which instructions to commit may be challenging • Ensuring that cache and TLB conflicts generated by SMT do not degrade performance
  • 28.
    TDT 4260 –lecture 4 – 2011 Updated lecture plan pr. 4/2 • Contents Date and lecturer Topic – Computer architecture introduction 1: 14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge 2: 21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 2 • Trends 3: 3 Feb (IB) ILP, Chapter 2; TLP, Chapter 3 • Moore’s law 4: 4 Feb (LN) Multiprocessors, Chapter 4 5: 11 Feb MG Prefetching + Energy Micro guest lecture by Marius Grannæs & • Amdahl’s law pizza • Gustafson s Gustafson’s law 6: 18 Feb (LN) Multiprocessors continued 7: 24 Feb (IB) Memory and cache, cache coherence (Chap. 5) – Why multiprocessor? Chap 4.1 8: 4 Mar (IB) Piranha CMP + Interconnection networks • Taxonomy 9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl multicore ... Fedorova ... assymetric multicore ... • Memory architecture • Communication 10: 18 Mar (IB) Memory consistency (4.6) + more on memory 11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers (2) – Cache coherence Chap 4.2 Green computing • The problem 12: 7 Apr (IB/LN) Wrap up lecture, remaining stuff • Snooping protocols 13: 8 Apr Slack – no lecture planned 1 Lasse Natvig 2 Lasse Natvig Trends Comp. Arch. is an Integrated Approach • For technology, costs, use • What really matters is the functioning of the • Help predicting the future complete system • Product development time – hardware, runtime system, compiler, operating – 2-3 years system, and application –  design for the next technology – In networking, this is called the “End to End argument” networking End argument – Why should an architecture live longer than a product? • Computer architecture is not just about transistors(not at all), individual instructions, or particular implementations – E.g., Original RISC projects replaced complex instructions with a compiler + simple instructions 3 Lasse Natvig 4 Lasse Natvig Computer Architecture is Design and Analysis TDT4260 Course Focus Understanding the design techniques, machine Architecture is an iterative process: • Searching the huge space of possible structures, technology factors, evaluation designs Design methods that will determine the form of • At all levels of computer systems Analysis computers in 21st Century Technology Parallelism Programming Creativity C ti it Languages Applications Interface Design Cost / Computer Architecture: (ISA) Performance • Organization Analysis • Hardware/Software Boundary Compilers Good Ideas Operating Measurement & Systems Evaluation History Mediocre Ideas Bad Ideas 5 Lasse Natvig 6 Lasse Natvig 1
  • 29.
    Moore’s Law: 2Xtransistors / Holistic approach NTNU-principle: Teaching based “year” on research, example, PhD- e.g., to programmability combined with project of Alexandru Iordan: performance TBP (Wool, TBB) Energy aware task pool implementation Parallel & concurrent programming Operating System & system software Multicore, interconnect, memory • “Cramming More Components onto Integrated Circuits” – Gordon Moore, Electronics, 1965 Multicore memory systems (Dybdahl-PhD, • # of transistors / cost-effective integrated circuit double Grannæs-PhD, Jahre-PhD, M5-sim, pfJudge) every N months (12 ≤ N ≤ 24) 7 Lasse Natvig 8 Lasse Natvig Tracking Technology Latency Lags Bandwidth (last ~20 years) Performance Trends 10000 • 4 critical implementation technologies: CPU high, • Performance Milestones Processor – Disks, Memory low • Processor: ‘286, ‘386, ‘486, Pentium, – Memory, (“Memory Pentium Pro, Pentium 4 (21x,2250x) Wall”) 1000 – Network, • Ethernet: 10Mb, 100Mb, 1000Mb, Network – Processors 10000 Mb/s (16x,1000x) Relative Memory • Compare for Bandwidth vs. Latency BW 100 Disk • Memory Module: 16bit plain DRAM, Page Mode DRAM 32b 64b SDRAM, P M d DRAM, 32b, 64b, SDRAM improvements in performance over time Improve ment DDR SDRAM (4x,120x) • Bandwidth: number of events per unit time • Disk : 3600, 5400, 7200, 10000, 15000 – E.g., M bits/second over network, M bytes / second from 10 RPM (8x, 143x) disk ----------------- • Latency: elapsed time for a single event (Latency improvement = Bandwidth improvement) (Processor latency = typical # of pipeline- – E.g., one-way network delay in microseconds, 1 stages * time pr. clock-cycle) average disk access time in milliseconds 1 10 100 Relative Latency Improvement 9 Lasse Natvig 10 Lasse Natvig COST and COTS Speedup • Cost • General definition: Performance (p processors) – to produce one unit Speedup (p processors) = Performance (1 processor) – include (development cost / # sold units) – benefit of large volume • For a fixed problem size (input data set), • COTS performance = 1/time – Speedup – commodity off the shelf Time (1 processor) fixed problem (p processors) = • much better performance/price pr. component Time (p processors) • strong influence on the selection of components for building supercomputers in more than 20 years • Note: use best sequential algorithm in the uni-processor solution, not the parallel algorithm with p = 1 Superlinear speedup ? 11 Lasse Natvig 12 Lasse Natvig 2
  • 30.
    Amdahl’s Law (1967)(fixed problem size) Gustafson’s “law” (1987) (scaled problem size, fixed execution time) • “If a fraction s of a (uniprocessor) • Total execution time on computation is inherently parallel computer with n serial, the speedup is at processors is fixed most 1/s” – serial fraction s’ • Total work in computation – parallel fraction p’ – serial fraction s – s’ + p’ = 1 (100%) – parallel fraction p p • S (n) Time’(1)/Time’(n) S’(n) = Time (1)/Time (n) – s + p = 1 (100%) = (s’ + p’n)/(s’ + p’) • S(n) = Time(1) / Time(n) = s’ + p’n = s’ + (1-s’)n = (s + p) / [s +(p/n)] = n +(1-n)s’ • Reevaluating Amdahl's law, = 1 / [s + (1-s) / n] John L. Gustafson, CACM May 1988, pp 532-533. ”Not a new = n / [1 + (n - 1)s] law, but Amdahl’s law with changed assumptions” • ”pessimistic and famous” 13 Lasse Natvig 14 Lasse Natvig How the serial fraction limits speedup Single/ILP  Multi/TLP • Uniprocessor trends • Amdahl’s law – Getting too complex – Speed of light – Diminishing returns from ILP • Work hard to • Multiprocessor reduce the – Focus in the textbook: 4-32 CPUs serial part of – Increased performance through parallelism – Multichip the application – Multicore ((Single) Chip Multiprocessors – CMP) – remember IO – Cost effective – think different (than traditionally  = serial fraction • Right balance of ILP and TLP is unclear today or sequentially) – Desktop vs. server? 15 Lasse Natvig 16 Lasse Natvig Other Factors  Multiprocessor – Taxonomy Multiprocessors • Flynn’s taxonomy (1966, 1972) • Growth in data-intensive applications – Taxonomy = classification – Databases, file servers, multimedia, … – Widely used, but perhaps a bit coarse • Growing interest in servers, server performance • Single Instruction Single Data (SISD) • Increasing desktop performance less important – Common uniprocessor – Outside of graphics • Si l I t ti M lti l Data (SIMD) Single Instruction Multiple D t • Improved understanding in how to use – “ = Data Level Parallelism (DLP)” multiprocessors effectively – Especially in servers where significant natural TLP • Multiple Instruction Single Data (MISD) – Not implemented? • Advantage of leveraging design investment by replication – Pipeline / Stream processing / GPU ? – Rather than unique design • Multiple Instruction Multiple Data (MIMD) • Power/cooling issues  multicore – Used today – “ = Thread Level Parallelism (TLP)” 17 Lasse Natvig 18 Lasse Natvig 3
  • 31.
    Flynn’s taxonomy (1/2) Flynn’s taxonomy (2/2), MISD Single/Multiple Instruction/Data Single/Multiple Instruction/Data Stream Stream SISD uniprocessor SIMD w/distributed memory MISD (software pipeline) MIMD w/shared memory 19 Lasse Natvig 20 Lasse Natvig Advantages to MIMD MIMD: Memory architecture • Flexibility – High single-user performance, multiple programs, multiple threads – High multiple-user performance – Combination P1 Pn • Built using commercial off-the-shelf (COTS) P1 Pn $ $ components t Interconnection network (IN) Mem $ Mem $ – 2 x Uniprocessor = Multi-CPU – 2 x Uniprocessor core on a single chip = Multicore Mem Mem Interconnection network (IN) Centralized Memory Distributed Memory 21 Lasse Natvig 22 Lasse Natvig Distributed (Shared) Memory Multiprocessor Centralized Memory Multiprocessor • Pro: Cost-effective way to scale memory bandwidth • Also called • If most accesses are to • Symmetric Multiprocessors (SMPs) local memory • Uniform Memory Access (UMA) architecture • Pro: Reduces latency • Shared memory becomes bottleneck y of local memory accesses • Con: Communication • Large caches  single memory can satisfy becomes more complex memory demands of small number of processors • Can scale to a few dozen processors by using a • Pro/Con: Possible to change software to switch and by using many memory banks take advantage of memory that is close, but this can also make SW less portable • Scaling beyond that is hard – Non-Uniform Memory Access (NUMA) 23 Lasse Natvig 24 Lasse Natvig 4
  • 32.
    MP (MIMD), clusterof SMPs Distributed memory P P P Proc. Proc. Proc. Proc. Proc. Proc. 1. Shared address space Network Caches Caches Caches Caches Caches Caches • Logically shared, physically distributed M • Distributed Shared Memory (DSM) Node Interc. Network Interc Node Interc. Network Interc Conceptual Model • NUMA architecture hit t P P P Memory I/O Memory I/O 2. Separate address spaces M M M • Every P-M module is a separate computer Network Cluster Interconnection Network Implementation • Multicomputer • Clusters • Combination of centralized and distributed • Not a focus in this course • Like an early version of the kongull-cluster 25 Lasse Natvig 26 Lasse Natvig Communication models Limits to parallelism • We need separate processes and threads! • Shared memory – Can’t split one thread among CPUs/cores – Centralized or Distributed Shared Memory • Parallel algorithms needed – Separate field – Communication using LOAD/STORE – Some problems are inherently serial – Coordinated using traditional OS methods • P-complete problems • Semaphores, monitors, etc. – Part of parallel complexity theory – Busy-wait more acceptable than for uniprosessor Busy wait • See minicourse TDT6 - Heterogeneous and green computing • Message passing • http://www.idi.ntnu.no/emner/tdt4260/tdt6 – Using send (put) and receive (get) • Asynchronous / Synchronous • Amdahl’s law – Libraries, standards – Serial fraction of code limits speedup • …, PVM, MPI, … – Example: speedup = 80 with 100 processors require maximum 0,25% of the time spent on serial code 27 Lasse Natvig 28 Lasse Natvig SMP: Cache Coherence Problem Enforcing coherence P1 P2 P3 • Separate caches makes multiple copies frequent u =? u =? u =7 – Migration cache 4 cache 5 cache 3 • Moved from shared memory to local cache u :5 u :5 • Speeds up access, reduces memory bandwidth requirements – Replication • Several local copies when item is read by several p y 1 I/O devices /O u :5 2 • Speeds up access, reduces memory contention Memory • Need coherence protocols to track shared data • Processors see different values for u after event 3 – Directory based • Old (stale) value read in event 4 (hit) • Status in shared location (Chap. 4.4) • Event 5 (miss) reads – (Bus) snooping – correct value (if write-through caches) • Each cache maintains local status – old value (if write-back caches) • All caches monitor broadcast medium • Unacceptable to programs, and frequent! • Write invalidate / Write update 29 Lasse Natvig 30 Lasse Natvig 5
  • 33.
    Snooping: Write invalidate Snooping: Write update • Several reads or one write: No change • Writes require exclusive access • Also called write broadcast • Writes to shared data: All other cache copies • Must know which cache blocks are shared invalidated i lid t d • Usually Write-Through – Invalidate command and address broadcasted – Write to shared data: Broadcast, all caches listen and updates their – All caches listen (snoops) and invalidates if necessary copy (if any) – Read miss: Main memory is up to date • Read miss: – Write-Through: Memory always up to date – Write-Back: Caches listen and any exclusive copy is put on the bus 31 Lasse Natvig 32 Lasse Natvig Snooping: Invalidate vs. Update • Repeated writes to the same address (no reads) requires several updates, but only one invalidate An Example Snoopy Protocol • Invalidates are done at cache block level, while updates are • Invalidation protocol, write-back cache done of individual words • Each cache block is in one state • Delay from a word is written until it can be read is shorter for – Shared : Clean in all caches and up-to-date in updates memory, block can be read • Invalidate most common – Exclusive : One cache has only copy, its – Less bus traffic writeable, and dirty – Less memory traffic – Invalid : block contains no data – Bus and memory bandwidth typical bottleneck 33 Lasse Natvig 34 Lasse Natvig Snooping: Invalidation protocol (2/6) Snooping: Invalidation protocol (1/6) Processor Processor Processor Processor 0 1 2 N-1 Processor Processor Processor Processor 0 1 2 N-1 read x x o shared read miss Interconnection Network Interconnection Network x o I/O System x o I/O System Main Memory Main Memory 35 Lasse Natvig 36 Lasse Natvig 6
  • 34.
    Snooping: Invalidation protocol(3/6) Snooping: Invalidation protocol (4/6) Processor Processor Processor Processor Processor Processor Processor Processor 0 1 2 N-1 0 1 2 N-1 read x x o x o x o shared shared shared read miss Interconnection Network Interconnection Network x o I/O System x o I/O System Main Memory Main Memory 37 Lasse Natvig 38 Lasse Natvig Snooping: Invalidation protocol (5/6) Snooping: Invalidation protocol (6/6) Processor Processor Processor Processor Processor Processor Processor Processor 0 1 2 N-1 0 1 2 N-1 write x x o x o x 1 shared shared exclusive invalidate Interconnection Network Interconnection Network x o I/O System x o I/O System Main Memory Main Memory 39 Lasse Natvig 40 Lasse Natvig 7
  • 35.
    Prefetching Marius Grannæs Feb 11th, 2011 www.ntnu.no M. Grannæs, Prefetching
  • 36.
    2 About Me • PhD from NTNU in Computer Architecture in 2010 • “Reducing Memory Latency by Improving Resource Utilization” • Supervised by Lasse Natvig • Now working for Energy Micro • Working on energy profiling, caching and prefetching • Software development www.ntnu.no M. Grannæs, Prefetching
  • 37.
    3 About Energy Micro • Fabless semiconductor company • Founded in 2007 by ex-chipcon founders • 50 employees • Offices around the world • Designing the world most energy friendly microcontrollers • Today: EFM32 Gecko • Next friday: EFM32 Tiny Gecko (cache) • May(ish): EFM32 Giant Gecko (cache + prefetching) • Ambition: 1% marketshare... www.ntnu.no M. Grannæs, Prefetching
  • 38.
    3 About Energy Micro • Fabless semiconductor company • Founded in 2007 by ex-chipcon founders • 50 employees • Offices around the world • Designing the world most energy friendly microcontrollers • Today: EFM32 Gecko • Next friday: EFM32 Tiny Gecko (cache) • May(ish): EFM32 Giant Gecko (cache + prefetching) • Ambition: 1% marketshare... • of a $30 bn market. www.ntnu.no M. Grannæs, Prefetching
  • 39.
    4 What is Prefetching? Prefetching Prefetching is a technique for predicting future prefetches and fetching the data into the cache www.ntnu.no M. Grannæs, Prefetching
  • 40.
    5 The Memory Wall 100000 CPU performance Memory performance 10000 Performance 1000 100 10 1 1980 1985 1990 1995 2000 2005 2010 Year W.Wulf and S. McKee, "Hitting the Memory Wall: Implications of the Obvious" www.ntnu.no M. Grannæs, Prefetching
  • 41.
    6 A Useful Analogy • An Intel Core i7 can execute 147600 Million Instructions per second. • ⇒ A carpenter can hammer one nail per second. www.ntnu.no M. Grannæs, Prefetching
  • 42.
    6 A Useful Analogy • An Intel Core i7 can execute 147600 Million Instructions per second. • ⇒ A carpenter can hammer one nail per second. • DDR3-1600 RAM can perform 65 Million transfers per second. www.ntnu.no M. Grannæs, Prefetching
  • 43.
    6 A Useful Analogy • An Intel Core i7 can execute 147600 Million Instructions per second. • ⇒ A carpenter can hammer one nail per second. • DDR3-1600 RAM can perform 65 Million transfers per second. • ⇒ The carpenter must wait 38 minutes per nail. www.ntnu.no M. Grannæs, Prefetching
  • 44.
    7 Solution www.ntnu.no M. Grannæs, Prefetching
  • 45.
    7 Solution Solution outline: 1 You bring an entire box of nails. 2 Keep the box close to the carpenter www.ntnu.no M. Grannæs, Prefetching
  • 46.
    8 Analysis: Carpenting How long (on average) does it take to get one nail? www.ntnu.no M. Grannæs, Prefetching
  • 47.
    8 Analysis: Carpenting How long (on average) does it take to get one nail? Nail latency LNail = LBox + pBox is empty · (LShop + LTraffic ) LNail Time to get one nail. LBox Time to check and fetch one nail from the box. pBox is empty Probabilty that the box you have is empty. LShop Time to go to the shop (38 minutes). LTraffic Time lost due to traffic. www.ntnu.no M. Grannæs, Prefetching
  • 48.
    9 Solution: (For computers) • Faster, but smaller memory closer to the processor. • Temporal locality • If you needed X in the past, you are probably going to need X in the near future. • Spatial locality • If you need X , you probably need X + 1 www.ntnu.no M. Grannæs, Prefetching
  • 49.
    9 Solution: (For computers) • Faster, but smaller memory closer to the processor. • Temporal locality • If you needed X in the past, you are probably going to need X in the near future. • Spatial locality • If you need X , you probably need X + 1 ⇒ If you need X, put it in the cache, along with everything else close to it (cache line) www.ntnu.no M. Grannæs, Prefetching
  • 50.
    10 Analysis: Caches Nail latency LSystem = LCache + pMiss · (LMain Memory + LCongestion ) LSystem Total system latency. LCache Latency of the cache. pMiss Probabilty of a cache miss. LMain Memory Main memory latency. LCongestion Latency due to main memory congestion. www.ntnu.no M. Grannæs, Prefetching
  • 51.
    11 DRAM in perspective • “Incredibly slow” DRAM has a response time of 15.37 ns. • Speed of light is 3 · 108 m/s. • Physical distance from processor to DRAM chips is typically 20cm. www.ntnu.no M. Grannæs, Prefetching
  • 52.
    11 DRAM in perspective • “Incredibly slow” DRAM has a response time of 15.37 ns. • Speed of light is 3 · 108 m/s. • Physical distance from processor to DRAM chips is typically 20cm. 2 · 20 · 10− 3m = 0.13ns (1) 3 · 108 m/s • Just 2 orders of magnitude! www.ntnu.no M. Grannæs, Prefetching
  • 53.
    11 DRAM in perspective • “Incredibly slow” DRAM has a response time of 15.37 ns. • Speed of light is 3 · 108 m/s. • Physical distance from processor to DRAM chips is typically 20cm. 2 · 20 · 10− 3m = 0.13ns (1) 3 · 108 m/s • Just 2 orders of magnitude! • Intel Core i7 - 147600 Million Instructions per second. • Ultimate laptop - 5 · 1050 operations per second/kg. Lloyd, Seth, “Ultimate physical limits to computation” www.ntnu.no M. Grannæs, Prefetching
  • 54.
    12 When does caching not work? The four Cs: • Cold/Compulsory: • The data has not been referenced before • Capacity • The data has been referenced before, but has been thrown out, because of the limited size of the cache. • Conflict • The data has been thrown out of a set-assosciative cache because it would not fit in the set. • Coherence • Another processor (in a muti-processor/core environment) has invalidated the cacheline. www.ntnu.no M. Grannæs, Prefetching
  • 55.
    12 When does caching not work? The four Cs: • Cold/Compulsory: • The data has not been referenced before • Capacity • The data has been referenced before, but has been thrown out, because of the limited size of the cache. • Conflict • The data has been thrown out of a set-assosciative cache because it would not fit in the set. • Coherence • Another processor (in a muti-processor/core environment) has invalidated the cacheline. We can buy our way out of Capacity and Conflict misses, but not Cold or Coherence misses! www.ntnu.no M. Grannæs, Prefetching
  • 56.
    13 Cache Sizes 10000 E Co ium 4 i7 um I re II II 4 2 Co nt o um um Pr re Pe nti nti nti um 1000 Pe Pe Pe nti Cache size (kB) Pe 100 m X tiu 6D n Pe 48 80 10 1 1985 1990 1995 2000 2005 2010 Year www.ntnu.no M. Grannæs, Prefetching
  • 57.
    14 Core i7 (Lynnfield) - 2009 www.ntnu.no M. Grannæs, Prefetching
  • 58.
    15 Pentium M - 2003 www.ntnu.no M. Grannæs, Prefetching
  • 59.
    16 Prefetching Prefetching increases the performance of caches by predicting what data is needed and fetching that data into the cache before it is referenced. Need to know: • What to prefetch? • When to prefetch? • Where to put the data? • How do we prefetch? (Mechanism) www.ntnu.no M. Grannæs, Prefetching
  • 60.
    17 Prefetching Terminology Good Prefetch A prefetch is classified as Good if the prefetched block is referenced by the application before it is replaced. www.ntnu.no M. Grannæs, Prefetching
  • 61.
    17 Prefetching Terminology Good Prefetch A prefetch is classified as Good if the prefetched block is referenced by the application before it is replaced. Bad Prefetch A prefetch is classified as Bad if the prefetched block is not referenced by the application before it is replaced. www.ntnu.no M. Grannæs, Prefetching
  • 62.
    18 Accuracy The accuracy of a given prefetch algorithm that yields G good prefetches and B bad prefetches is calculated as: Accuracy G Accuracy = G+B www.ntnu.no M. Grannæs, Prefetching
  • 63.
    19 Coverage If a conventional cache has M misses without using any prefetch algorithm, the coverage of a given prefetch algorithm that yields G good prefetches and B bad prefetches is calculated as: Accuracy G Coverage = M www.ntnu.no M. Grannæs, Prefetching
  • 64.
    20 Prefetching System Latency Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion ) www.ntnu.no M. Grannæs, Prefetching
  • 65.
    20 Prefetching System Latency Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion ) • If a prefetch is good: • pmiss is lowered • ⇒ Lsystem decreases www.ntnu.no M. Grannæs, Prefetching
  • 66.
    20 Prefetching System Latency Lsystem = Lcache + pmiss · (Lmain memory + Lcongestion ) • If a prefetch is good: • pmiss is lowered • ⇒ Lsystem decreases • If a prefetch is bad: • pmiss becomes higher because useful data might be replaced • Lcongestion becomes higher because of useless traffic • ⇒ Lsystem increases www.ntnu.no M. Grannæs, Prefetching
  • 67.
    21 Prefetching Techniques Types of prefetching: • Software • Special instructions. • Most modern high performance processors have them. • Very flexible. • Can be good at pointer chasing. • Requires compiler or programmer effort. • Processor executes prefetches instead of computation. • Static (performed at compile-time). • Hardware • Hybrid www.ntnu.no M. Grannæs, Prefetching
  • 68.
    21 Prefetching Techniques Types of prefetching: • Software • Hardware • Dedicated hardware analyzes memory references. • Most modern high performance processors have them. • Fixed functionality. • Requires no effort by the programmer or compiler. • Off-loads prefetching to hardware. • Dynamic (performed at run-time) • Hybrid www.ntnu.no M. Grannæs, Prefetching
  • 69.
    21 Prefetching Techniques Types of prefetching: • Software • Hardware • Hybrid • Dedicated hardware unit. • Hardware unit programmed by software. • Some effort required by the programmer or compiler. www.ntnu.no M. Grannæs, Prefetching
  • 70.
    22 Software Prefetching f o r ( i =0; i < 10000; i ++) { acc += data [ i ] ; } www.ntnu.no M. Grannæs, Prefetching
  • 71.
    22 Software Prefetching f o r ( i =0; i < 10000; i ++) { acc += data [ i ] ; } MOV r1, 0 ; Acc MOV rO, #0 ; i Label: LOAD r2, r0(#data) ; Cache miss! (400 cycles!) ADD r1, r2 ; acc += date[i] INC r0 ; i++ CMP r0, #100000 ; i < 100000 BL Label ; branch if less www.ntnu.no M. Grannæs, Prefetching
  • 72.
    23 Software Prefetching II f o r ( i =0; i < 10000; i ++) { acc += data [ i ] ; } Simple optimization using __builtin_prefetch() f o r ( i =0; i < 10000; i ++) { _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ; acc += data [ i ] ; } www.ntnu.no M. Grannæs, Prefetching
  • 73.
    23 Software Prefetching II f o r ( i =0; i < 10000; i ++) { acc += data [ i ] ; } Simple optimization using __builtin_prefetch() f o r ( i =0; i < 10000; i ++) { _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ; acc += data [ i ] ; } Why add 10 (and not 1?) Prefetch Distance - Memory latency >> Computation latency. www.ntnu.no M. Grannæs, Prefetching
  • 74.
    24 Software Prefetching III f o r ( i =0; i < 10000; i ++) { _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ; acc += data [ i ] ; } Note: • data[0] → data[9] will not be prefetched. • data[10000] → data[10009] will be prefetched, but not used. www.ntnu.no M. Grannæs, Prefetching
  • 75.
    24 Software Prefetching III f o r ( i =0; i < 10000; i ++) { _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ; acc += data [ i ] ; } Note: • data[0] → data[9] will not be prefetched. • data[10000] → data[10009] will be prefetched, but not used. G 9990 Accuracy = = = 0.999 = 99, 9% G+B 10000 G 9990 Coverage = = = 0.999 = 99, 9% M 10000 www.ntnu.no M. Grannæs, Prefetching
  • 76.
    25 Complex Software f o r ( i =0; i < 10000; i ++) { _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ; i f ( someFunction ( i ) == True ) { acc += data [ i ] ; } } Does prefetching pay off in this case? www.ntnu.no M. Grannæs, Prefetching
  • 77.
    25 Complex Software f o r ( i =0; i < 10000; i ++) { _ _ b u i l t i n _ p r e f e t c h (& data [ i + 1 0 ] ) ; i f ( someFunction ( i ) == True ) { acc += data [ i ] ; } } Does prefetching pay off in this case? • How many times is someFunction(i) true? • How much memory bus access is perfomed in someFunction(i)? • Does power matter? We have to profile the program to know! www.ntnu.no M. Grannæs, Prefetching
  • 78.
    26 Dynamic Data Structures I typedef s t r u c t { i n t data ; node_t n e x t ; } node_t ; w h i l e ( ( node = node−>n e x t ) ! = NULL ) { acc += node−>data ; } www.ntnu.no M. Grannæs, Prefetching
  • 79.
    27 Dynamic Data Structures II typedef s t r u c t { i n t data ; node_t n e x t ; node_t jump ; } node_t ; w h i l e ( ( node = node−>n e x t ) ! = NULL ) { _ _ b u l t i n _ p r e f e t c h ( node−>jump ) ; acc += node−>data ; } www.ntnu.no M. Grannæs, Prefetching
  • 80.
    28 Hardware Prefetching Software prefetching: • Need programmer effort to implement • Prefetch instructions is not computing • Compile-time • Very flexible www.ntnu.no M. Grannæs, Prefetching
  • 81.
    28 Hardware Prefetching Software prefetching: • Need programmer effort to implement • Prefetch instructions is not computing • Compile-time • Very flexible Hardware prefetching: • No programmer effort • Does not displace compute instructions • Run-time • Not flexible www.ntnu.no M. Grannæs, Prefetching
  • 82.
    29 Sequential Prefetching The simplest prefetcher, but suprisingly effective due to spatial locality. Sequential Prefetching Miss on address X ⇒ Fetch X+n, X+n+1 ... , X+n+j n Prefetch distance j Prefetch degree Collectively known as prefetch agressiveness. www.ntnu.no M. Grannæs, Prefetching
  • 83.
    30 Sequential Prefetching II 5 Sequential 4.5 4 3.5 Speedup 3 2.5 2 1.5 1 libquantum milc leslie3d GemsFDTD lbm sphinx3 www.ntnu.no M. Grannæs, Prefetching Benchmark
  • 84.
    31 Reference Prediction Tables Tien-Fu Chen and Jean-Loup Baer (1995) • Builds upon sequential prefetching, stride directed prefetching. • Observation: Non-unit strides in many applications • 2, 4, 6, 8, 10 (stride 2) • Observation: Each load instruction has a distinct access pattern Reference Prediction Tables (RPT): • Table index by the load instruction • Simple state machine • Store a single delta of history. www.ntnu.no M. Grannæs, Prefetching
  • 85.
    32 Reference Prediction Tables Cache Miss: PC Last Addr. Delta State Initial Training Prefetch www.ntnu.no M. Grannæs, Prefetching
  • 86.
    33 Reference Prediction Tables Cache Miss: 1 PC Last Addr. Delta State Initial Training Prefetch www.ntnu.no M. Grannæs, Prefetching
  • 87.
    34 Reference Prediction Tables Cache Miss: 1 100 1 -- Init PC Last Addr. Delta State Initial Training Prefetch www.ntnu.no M. Grannæs, Prefetching
  • 88.
    35 Reference Prediction Tables Cache Miss: 1 3 100 3 1 2 -- Train PC Last Addr. Delta State Initial Training Prefetch www.ntnu.no M. Grannæs, Prefetching
  • 89.
    36 Reference Prediction Tables Cache Miss: 1 3 5 100 5 3 2 2 Prefetch PC Last Addr. Delta State Initial Training Prefetch www.ntnu.no M. Grannæs, Prefetching
  • 90.
    37 Reference Prediction Tables 5 Sequential 4.5 RPT 4 3.5 Speedup 3 2.5 2 1.5 1 libquantum milc leslie3d GemsFDTD lbm sphinx3 www.ntnu.no Benchmark M. Grannæs, Prefetching
  • 91.
    38 Global History Buffer K. Nesbit, A. Dhodapkar and J.Smith (2004) • Observation: Predicting more complex patterns require more history • Observation: A lot of history in the RPT is very old Program Counter/Delta Correlation (PC/DC) • Store all misses in a FIFO called Global History Buffer (GHB) • Linked list of all misses from one load instruction • Traversing linked list gives a history for that load www.ntnu.no M. Grannæs, Prefetching
  • 92.
    39 Global History Buffer Index Table Global History Buffer PC Ptr Address Ptr 100 3 Delta Buffer 1 www.ntnu.no M. Grannæs, Prefetching
  • 93.
    40 Global History Buffer Index Table Global History Buffer PC Ptr Address Ptr 100 5 3 Delta Buffer 1 www.ntnu.no M. Grannæs, Prefetching
  • 94.
    41 Global History Buffer Index Table Global History Buffer PC Ptr Address Ptr 100 5 3 Delta Buffer 1 www.ntnu.no M. Grannæs, Prefetching
  • 95.
    42 Global History Buffer Index Table Global History Buffer PC Ptr Address Ptr 100 5 3 Delta Buffer 1 www.ntnu.no M. Grannæs, Prefetching
  • 96.
    43 Global History Buffer Index Table Global History Buffer PC Ptr Address Ptr 100 5 3 Delta Buffer 1 www.ntnu.no M. Grannæs, Prefetching
  • 97.
    44 Global History Buffer Index Table Global History Buffer PC Ptr Address Ptr 100 5 3 Delta Buffer 1 www.ntnu.no M. Grannæs, Prefetching
  • 98.
    45 Global History Buffer Index Table Global History Buffer PC Ptr Address Ptr 100 5 3 Delta Buffer 2 1 www.ntnu.no M. Grannæs, Prefetching
  • 99.
    46 Global History Buffer Index Table Global History Buffer PC Ptr Address Ptr 100 5 3 Delta Buffer 2 2 1 www.ntnu.no M. Grannæs, Prefetching
  • 100.
    47 Delta Correlation • In the previous example, the delta buffer only contained two values (2,2). • Thus it is easy to guess that the next delta is also 2. • We can then prefetch: Current address + Delta = 5 + 2 = 7 www.ntnu.no M. Grannæs, Prefetching
  • 101.
    47 Delta Correlation • In the previous example, the delta buffer only contained two values (2,2). • Thus it is easy to guess that the next delta is also 2. • We can then prefetch: Current address + Delta = 5 + 2 = 7 What if the pattern is repeating, but not regular? 1, 2, 3, 4, 5, 1, 2, 3, 4, 5 www.ntnu.no M. Grannæs, Prefetching
  • 102.
    48 Delta Correlation 10 11 13 16 17 19 22 24 25 1 2 3 1 2 3 1 2 www.ntnu.no M. Grannæs, Prefetching
  • 103.
    49 Delta Correlation 10 11 13 16 17 19 22 24 25 1 2 3 1 2 3 1 2 www.ntnu.no M. Grannæs, Prefetching
  • 104.
    50 Delta Correlation 10 11 13 17 18 20 23 24 25 1 2 3 1 2 3 1 2 www.ntnu.no M. Grannæs, Prefetching
  • 105.
    51 Delta Correlation 10 11 13 17 18 20 23 24 25 1 2 3 1 2 3 1 2 www.ntnu.no M. Grannæs, Prefetching
  • 106.
    52 Delta Correlation 10 11 13 17 18 20 23 24 25 1 2 3 1 2 3 1 2 www.ntnu.no M. Grannæs, Prefetching
  • 107.
    53 Delta Correlation 10 11 13 17 18 20 23 24 25 1 2 3 1 2 3 1 2 www.ntnu.no M. Grannæs, Prefetching
  • 108.
    54 Delta Correlation 10 11 13 16 17 19 22 24 25 1 2 3 1 2 3 1 2 www.ntnu.no M. Grannæs, Prefetching
  • 109.
    55 Delta Correlation 10 11 13 16 17 19 22 23 25 1 2 3 1 2 3 1 2 www.ntnu.no M. Grannæs, Prefetching
  • 110.
    56 Delta Correlation 10 11 13 16 17 19 22 23 25 1 2 3 1 2 3 1 2 www.ntnu.no M. Grannæs, Prefetching
  • 111.
    57 PC/DC 5 Sequential 4.5 RPT PC/DC 4 3.5 Speedup 3 2.5 2 1.5 1 libquantum milc leslie3d GemsFDTD lbm sphinx3 www.ntnu.no Benchmark M. Grannæs, Prefetching
  • 112.
    58 Data Prefetching Championships • Organized by JILP • Held in conjunction with HPCA’09 • Branch prediction championships • Everyone uses the same API (six function calls) • Same set of benchmarks • Third party evaluates performance • 20+ prefetchers submitted http://www.jilp.org/dpc/ www.ntnu.no M. Grannæs, Prefetching
  • 113.
    59 Delta Correlating Prediction Tables • Our submission to DPC-1 • Observation: GHB pointer chasing is expensive. • Observation: History doesn’t really get old. • Observation: History would reach a steady state. • Observation: Deltas are typically small, while the address space is large. • Table indexed by the PC of the load • Each entry holds the history of the load in the form of deltas. • Delta Correlation www.ntnu.no M. Grannæs, Prefetching
  • 114.
    60 Delta Correlating Prefetch Tables PC Last Addr. Last Pref. D D D D D D Ptr www.ntnu.no M. Grannæs, Prefetching
  • 115.
    61 Delta Correlating Prefetch Tables 10 100 10 - - - - - - - - PC Last Addr. Last Pref. D D D D D D Ptr www.ntnu.no M. Grannæs, Prefetching
  • 116.
    62 Delta Correlating Prefetch Tables 10 11 100 10 - - - - - - - - PC Last Addr. Last Pref. D D D D D D Ptr www.ntnu.no M. Grannæs, Prefetching
  • 117.
    63 Delta Correlating Prefetch Tables 10 11 100 10 - 1 - - - - - - PC Last Addr. Last Pref. D D D D D D Ptr www.ntnu.no M. Grannæs, Prefetching
  • 118.
    64 Delta Correlating Prefetch Tables 10 11 100 10 - 1 - - - - - - PC Last Addr. Last Pref. D D D D D D Ptr www.ntnu.no M. Grannæs, Prefetching
  • 119.
    65 Delta Correlating Prefetch Tables 10 11 100 11 - 1 - - - - - - PC Last Addr. Last Pref. D D D D D D Ptr www.ntnu.no M. Grannæs, Prefetching
  • 120.
    66 Delta Correlating Prefetch Tables 10 11 13 100 13 - 1 2 - - - - - PC Last Addr. Last Pref. D D D D D D Ptr www.ntnu.no M. Grannæs, Prefetching
  • 121.
    67 Delta Correlating Prefetch Tables 10 11 13 16 100 16 - 1 2 3 - - - - PC Last Addr. Last Pref. D D D D D D Ptr www.ntnu.no M. Grannæs, Prefetching
  • 122.
    68 Delta Correlating Prefetch Tables 10 11 13 16 17 19 22 100 22 - 1 2 3 1 2 3 - PC Last Addr. Last Pref. D D D D D D Ptr www.ntnu.no M. Grannæs, Prefetching
  • 123.
    69 Delta Correlating Prefetch Tables 5 Sequential 4.5 RPT PC/DC DCPT 4 3.5 Speedup 3 2.5 2 1.5 1 libquantum milc leslie3d GemsFDTD lbm sphinx3 Benchmark www.ntnu.no M. Grannæs, Prefetching
  • 124.
    70 DPC-1 Results 1 Access Map Pattern Matching 2 Global History Buffer - Local Delta Buffer 3 Prefetching based on a Differential Finite Context Machine 4 Delta Correlating Prediction Tables www.ntnu.no M. Grannæs, Prefetching
  • 125.
    70 DPC-1 Results 1 Access Map Pattern Matching 2 Global History Buffer - Local Delta Buffer 3 Prefetching based on a Differential Finite Context Machine 4 Delta Correlating Prediction Tables What did the winning entries do differently? • AMPM - Massive reordering to expose more patterns. • GHB-LDB and PDFCM - Prefetch into the L1. www.ntnu.no M. Grannæs, Prefetching
  • 126.
    71 Access Map Pattern Matching • Winning entry by Ishii et al. • Divides memory into hot zones • Each zone is tracked by using a 2 bit vector • Examines each zone for constant strides • Ignores temporal information Lesson Because of reordering, modern processors/compilers can reorder loads, thus the temporal information might be off. www.ntnu.no M. Grannæs, Prefetching
  • 127.
    72 Global History Buffer - Local Delta Buffer • Second place by Dimitrov et al. • Somewhat similar to DCPT • Improves PC/DC prefetching by including global correlation • Most common stride • Prefetches directly into the L1 Lesson Prefetch into L1 gives that extra performance boost Most common stride www.ntnu.no M. Grannæs, Prefetching
  • 128.
    73 Prefetching based on a Differential Finite Context Machine • Third place by Ramos et al. • Table with the most recent history for each load. • A hash of the history is computed and used to look up into a table containing the predicted stride • Repeat process to increase prefetching degree/distance • Separate prefetcher for L1 Lesson Feedback to adjust prefetching degree/prefetching distance Prefetch into the L1 www.ntnu.no M. Grannæs, Prefetching
  • 129.
    74 Improving DCPT Partial Matching Technique for handling reordering, common strides, etc L1 Hoisting Technique for handling L1 prefetching www.ntnu.no M. Grannæs, Prefetching
  • 130.
    75 Partial Matching • AMPM ignores all temporal information • Reordering the delta history is very expensive Reorder 5 accesses: 5! = 120 possibilities • Solution: Reduce spatial resolution by ignoring low bits Example delta stream 8, 9, 10, 8, 10, 9 ⇒ (Ignore lower 2 bits) www.ntnu.no M. Grannæs, Prefetching
  • 131.
    75 Partial Matching • AMPM ignores all temporal information • Reordering the delta history is very expensive Reorder 5 accesses: 5! = 120 possibilities • Solution: Reduce spatial resolution by ignoring low bits Example delta stream 8, 9, 10, 8, 10, 9 ⇒ (Ignore lower 2 bits) 8, 8, 8, 8, 8, 8 www.ntnu.no M. Grannæs, Prefetching
  • 132.
    75 Partial Matching • AMPM ignores all temporal information • Reordering the delta history is very expensive Reorder 5 accesses: 5! = 120 possibilities • Solution: Reduce spatial resolution by ignoring low bits Example delta stream 8, 9, 10, 8, 10, 9 ⇒ (Ignore lower 2 bits) 8, 8, 8, 8, 8, 8 , 8 www.ntnu.no M. Grannæs, Prefetching
  • 133.
    76 L1 Hoisting • All three top entries had mechanisms for prefetching into L1 • Problem: Pollution • Solution: Use the same highly accurate mechanism to prefetch into the L1. • In the steady state, only the last predicted delta will be used. • All other deltas has been prefetched and is either in the L2 or on it’s way. • Hoist the first delta from the L2 to the L1 to increase performance. www.ntnu.no M. Grannæs, Prefetching
  • 134.
    77 L1 Hoisting II Example delta stream 2, 3, 1, 2, 3, 1, 2, 3, www.ntnu.no M. Grannæs, Prefetching
  • 135.
    77 L1 Hoisting II Example delta stream 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3 www.ntnu.no M. Grannæs, Prefetching
  • 136.
    77 L1 Hoisting II Example delta stream 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3 Steady state Prefetch the last delta into L2 Hoist the first delta into L1 www.ntnu.no M. Grannæs, Prefetching
  • 137.
    78 DCPT-P 7 DCPT-P 6 AMPM GHB-LDB PDFCM 5 RPT PC/DC Speedup 4 3 2 1 0 mil Ge lib les lbm sp c ms qu lie hin FD an 3d x3 T tum D www.ntnu.no M. Grannæs, Prefetching
  • 138.
    79 Interaction with the memory controller • So far we’ve talked about what to prefetch (address) • When and how is equally important • Modern DRAM is complex • Modern DRAM controllers are even more complex • Bandwidth limited www.ntnu.no M. Grannæs, Prefetching
  • 139.
    80 Modern DRAM • Can have multiple independent memory controllers • Can have multiple channels per controller • Typically multiple banks • Each bank contains several pages (rows) of data (typical 1k- 8k) • Each page accesses is put in a single pagebuffer • Access time to the pagebuffer is much lower than a full access www.ntnu.no M. Grannæs, Prefetching
  • 140.
    81 The 3D structure of modern DRAM www.ntnu.no M. Grannæs, Prefetching
  • 141.
    82 The 3D structure of modern DRAM www.ntnu.no M. Grannæs, Prefetching
  • 142.
    83 The 3D structure of modern DRAM www.ntnu.no M. Grannæs, Prefetching
  • 143.
    84 The 3D structure of modern DRAM www.ntnu.no M. Grannæs, Prefetching
  • 144.
    85 The 3D structure of modern DRAM www.ntnu.no M. Grannæs, Prefetching
  • 145.
    86 Example Suppose a processor requires data at locations X1 and X2 that are located on the same page at times T1 and T2 . There are two separate outcomes: www.ntnu.no M. Grannæs, Prefetching
  • 146.
    87 Case 1: The requests occur at roughly the same time: 1 Read 1 (T1 ) enters the memory controller www.ntnu.no M. Grannæs, Prefetching
  • 147.
    87 Case 1: The requests occur at roughly the same time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened www.ntnu.no M. Grannæs, Prefetching
  • 148.
    87 Case 1: The requests occur at roughly the same time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Read 2 (T2 ) enters the memory controller www.ntnu.no M. Grannæs, Prefetching
  • 149.
    87 Case 1: The requests occur at roughly the same time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Read 2 (T2 ) enters the memory controller 4 Data X1 is returned from DRAM www.ntnu.no M. Grannæs, Prefetching
  • 150.
    87 Case 1: The requests occur at roughly the same time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Read 2 (T2 ) enters the memory controller 4 Data X1 is returned from DRAM 5 Data X2 is returned from DRAM www.ntnu.no M. Grannæs, Prefetching
  • 151.
    87 Case 1: The requests occur at roughly the same time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Read 2 (T2 ) enters the memory controller 4 Data X1 is returned from DRAM 5 Data X2 is returned from DRAM 6 The page is closed www.ntnu.no M. Grannæs, Prefetching
  • 152.
    87 Case 1: The requests occur at roughly the same time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Read 2 (T2 ) enters the memory controller 4 Data X1 is returned from DRAM 5 Data X2 is returned from DRAM 6 The page is closed Although there are two separate reads, the page is only opened once. www.ntnu.no M. Grannæs, Prefetching
  • 153.
    88 Case 2: The requests are separated in time: 1 Read 1 (T1 ) enters the memory controller www.ntnu.no M. Grannæs, Prefetching
  • 154.
    88 Case 2: The requests are separated in time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened www.ntnu.no M. Grannæs, Prefetching
  • 155.
    88 Case 2: The requests are separated in time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Data X1 is returned from DRAM www.ntnu.no M. Grannæs, Prefetching
  • 156.
    88 Case 2: The requests are separated in time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Data X1 is returned from DRAM 4 The page is closed www.ntnu.no M. Grannæs, Prefetching
  • 157.
    88 Case 2: The requests are separated in time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Data X1 is returned from DRAM 4 The page is closed 5 Read 2 (T2 ) enters the memory controller www.ntnu.no M. Grannæs, Prefetching
  • 158.
    88 Case 2: The requests are separated in time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Data X1 is returned from DRAM 4 The page is closed 5 Read 2 (T2 ) enters the memory controller 6 The page is opened again www.ntnu.no M. Grannæs, Prefetching
  • 159.
    88 Case 2: The requests are separated in time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Data X1 is returned from DRAM 4 The page is closed 5 Read 2 (T2 ) enters the memory controller 6 The page is opened again 7 Data X2 is returned from DRAM www.ntnu.no M. Grannæs, Prefetching
  • 160.
    88 Case 2: The requests are separated in time: 1 Read 1 (T1 ) enters the memory controller 2 The page is opened 3 Data X1 is returned from DRAM 4 The page is closed 5 Read 2 (T2 ) enters the memory controller 6 The page is opened again 7 Data X2 is returned from DRAM 8 The page is closed The page is opened and closed twice. By prefetching X2 we can increase performance by reducing latency and increase memory throughput. www.ntnu.no M. Grannæs, Prefetching
  • 161.
    89 When does prefetching pay off? The break-even point: Prefetching Accuracy · Cost of Prefetching = Cost of Single Read What is the cost of prefetching? • Application dependant • Less than the cost of a single read, because: • Able to utilize open pages • Reduce latency • Increase throughput • Multiple banks • Lower latency www.ntnu.no M. Grannæs, Prefetching
  • 162.
    90 Performance vs. Accuracy 100 90 80 70 60 Accuracy 50 40 30 20 Sequential prefetching Scheduled Region prefetching 10 CZone/Delta Correlation prefetching Reference Predicton Tables prefetching Treshold 0 -40 -20 0 20 40 60 IPC improvement (%) www.ntnu.no M. Grannæs, Prefetching
  • 163.
    91 Q&A Thank you for listening! www.ntnu.no M. Grannæs, Prefetching
  • 164.
    TDT 4260 –lecture 17/2 Updated lecture plan pr. 17/2 7/ • Contents Date and lecturer Topic 1: 14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge – Cache coherence Chap 4.2 2: 21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 2 • Repetition 3: 3 Feb (IB) ILP, Chapter 2; TLP, Chapter 3 4: 4 Feb (LN) Multiprocessors, Multiprocessors Chapter 4 • Snooping protocols 5: 11 Feb MG Prefetching + Energy Micro guest lecture by Marius Grannæs & pizza • SMP performance Chap 4.3 6: 18 Feb (LN, MJ) Multiprocessors continued // Writing a comp.arch. paper (relevant for miniproject, by (MJ)) – Cache performance 7: 24 Feb (IB) Memory and cache, cache coherence (Chap. 5) 8: 3 Mar (IB) Piranha CMP + Interconnection networks • Directory based cache coherence Chap 4.4 y p 9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty • Synchronization Chap 4.5 Amdahl multicore ... Fedorova ... assymetric multicore ... • UltraSPARC T1 (Niagara) Chap 4 8 4.8 10: 18 Mar (IB) 11: 25 Mar (JA, AI) Memory consistency (4.6) + more on memory (1) Kongull and other NTNU and NOTUR supercomputers (2) Green computing 12: 7 Apr (IB/LN) Wrap up lecture, remaining stuff 13: 8 Apr Slack – no lecture planned 1 Lasse Natvig 2 Lasse Natvig IDI Open, a challenge for you? Miniproject groups, updates? Mi i j t d t ? • http://events.idi.ntnu.no/open11/ Rank Prefetcher Group Score 1 rpt64k4_pf f Farfetched f 1.089 • 2 april programming contest informal fun pizza april, contest, informal, fun, pizza, coke (?), party (?), 100- 150 people, mostly 2 rpt_prefetcher_rpt_seq L2Detour 1.072 students, students low threshold 3 teeest Group 6 1.000 • Teams: 3 persons, one PC, Java, C/C++ ? • P bl Problems: SSome simple, some t i k i l tricky • Our team ”DM-gruppas beskjedne venner” is challenging you students! – And we will challenge some of all the ICT companies in Trondheim 3 Lasse Natvig 4 Lasse Natvig SMP: Cache Coherence Problem Enforcing coherence (recap) P1 P2 P3 • Separate caches speed up access u =? u =? u =7 – Migration cache 4 cache 5 cache 3 • Moved from shared memory to local cache u :5 u :5 – Replication • Several local copies when item is read by several 1 I/O devices d i • Need coherence protocols to track shared data u :5 2 – (Bus) snooping Memory • Each cache maintains local status • Processors see different values for u after event 3 • All caches monitor broadcast medium • Old (stale) value read in event 4 ( ) ( ) (hit) • Write invalidate / Write update • Event 5 (miss) reads – correct value (if write-through caches) – old value (if write back caches) write-back • Unacceptable to programs, and frequent! 5 Lasse Natvig 6 Lasse Natvig
  • 165.
    State Machine (1/3) CPU Read hit State Machine (2/3) State machine Write miss/ Invalidate State machine for bus for this block CPU Read miss Shared Shared for CPU Invalid requests Invalid (read/only) (read/only) requests Place read miss for each for each on bus cache block cache block CPU Write CPU read miss Write miss Read miss Place Write Write back block, block CPU Read miss for this block Miss on bus Place read miss Place read miss on bus on bus Write B k W it Back Read miss CPU Write Block; (abort for this block Miss => Write Miss on Bus memory access) Write Back Block; Hit => Invalidate on Bus (abort ( b t excl.l Exclusive memory access) Exclusive (read/write) CPU Write Miss CPU read hit (read/write) CPU write hit Write b k W it back cache bl k h block Place write miss on bus 7 Lasse Natvig 8 Lasse Natvig State Mach ne (3/3) Machine Directory based cache coherence (1/2) • State machine CPU Read hit q for CPU requests Write miss/Inv for each for this block Shared cache block and Invalid CPU Read (read/only) • Large MP systems, lots of CPUs for bus requests Place read miss for each CPU Write on bus • Distributed memory preferable cache block Place Write Miss on bus – Increases memory bandwidth Write miss CPU Read miss for this block CPU read miss Place read miss • Snooping bus with broadcast? Write back block, on bus Write Back Place read miss CPU Write – A single bus become a bottleneck Block; (abort excl. on bus Miss => Write Miss on Bus – Other ways of communicating needed memory access) Hit => Invalidate on Bus • With these broadcasting is hard/expensive Read i R d miss Write Back Exclusive for this block Block; (abort – Can avoid broadcast if we know exactly which caches ( (read/write)) memory access) have a copy  Directory py y CPU read hit d CPU Write Miss, Write back cache CPU write hit block, Place write miss on bus 9 Lasse Natvig 10 Lasse Natvig SMP performance (shared memory) Directory based cache coherence (2/2) • Directory knows which blocks are in which cache and their state • Focus on cache performance • Directory can be partitioned and distributed • 3 types of cache misses in uniprocessor ( C’s) yp p (3 ) • Typical states: – Capacity (too small for working set) – Shared – Compulsory (cold-start) – Uncached – Conflict (placement strategy) – Modified • Multiprocessor also give coherence misses • Protocol based on – True sharing messages • Misses because of sharing of data • Invalidate and – False sharing update sent only • Misses because of invalidates that would not have happened with where needed cache block size = one word – Avoids broadcast, broadcast reduces traffic Fig 4.19 11 Lasse Natvig 12 Lasse Natvig
  • 166.
    Example: L3 cachesize (fig 4.11) E l h (f ) Example: L3 cache size (fig 4.12) • Al h S AlphaServer 4100 3.25 ycles pe Instruction 3 – 4 x Alpha @ 300 MHz 2.75 Instruction Capacity/Conflict p y – L1 8 KB I + L1: 100 2.5 Cold 8 KB D lized Execution Time 90 2.25 False Sharing 2 True Sharing – L2: 96 KB 80 er 1.75 1 75 – L3: off-chip 70 1.5 Idle 2 MB 60 PAL Code 1.25 emory Cy 50 Memory Access M A 1 L2/L3 Cache Access 0.75 40 Instruction Execution 0.5 30 0.25 Me Normal 20 0 10 1 MB 2 MB 4 MB 8 MB Cache size 0 1 MB 2 MB 4 MB 8MB L3 Cache Size 13 Lasse Natvig 14 Lasse Natvig Example: Increasing parallelism (fig 4.13) Example: Increased block size (fig 4.14) 3 uction Instruction Conflict/Capacity 16 2.5 15 Insruction Memory Cycles per Instru Cold tructions s False Sharing 14 Capacity/Conflict ructions True Sharing 13 2 12 Cold 11 p Misse per 1000 Inst Misses per 1,000 instr 1.5 10 False Sharing 9 8 True Sharing 1 7 C 6 5 0.5 4 es M 3 0 2 1 2 4 6 8 1 0 Processor count 32 64 128 256 Block size in bytes 15 Lasse Natvig 16 Lasse Natvig
  • 167.
    2/18/2011 1 2 2nd Branch Prediction Championship • International competition similar to our prefetching exercise system How to Write a Computer Architecture Paper • Task: Implement your best possible branch predictor and write a paper about it TDT4260 Computer Architecture 18. February 2011 • Submission deadline: 15. April 2011 Magnus Jahre • More info: http://www.jilp.org/jwac-2/ 3 4 How does pfJudge work? Storage Estimation • Each submitted file is one kongull job • We impose an storage limit of 8KB on your – Contains 12 M5 instances since there are 12 CPUs per core – Each M5 instance runs a different SPEC 2000 benchmark prefetchers – This limit is not checked by the exercise system • The kongull job added to the job queue – Status “Running” can mean running or queued, be patient • This is realistic: hardware components are usually – Running a job can take a long time depending on load designed with an area budget in mind – Kongull is usually able to empty the queue during the night • Estimating storage is simple • We can give you a regular user account on Kongull – Table based prefetcher: add up the bits used in each entry and – Remember that Kongull is a shared resource! multiply by the number of entries – Always calculate the expected CPU-hour demand of your experiment before submitting 5 6 Research Workflow Evaluate Solution on Recieve PhD Compute Cluster (get a real job) HOW TO USE A SIMULATOR 1
  • 168.
    2/18/2011 7 8 Why simulate? Know your model • Model of a system • You need to figure out which system is being – Model the interesting parts with high accuracy modeled! – Model the rest of the system with sufficient accuracy • Pfsys is a help to getting started, but to draw • “All models are wrong but some are useful” (G. Box, conclusions from you work you need to understand 1979) what you are modeling • The model does not necessarily have a one-to-one correspondence with the actual hardware – Try to model behavior – Simplify your code wherever possible 9 10 Find Your Story • A good computer architecture paper tells a story – All good stories have a bad guy: the problem – All good stories have a hero: the scheme • Writing a good paper is all about finding and identifying your story HOW TO WRITE A PAPER • Note that this story has to be told within the strict structure of a scientific article 11 12 Paper Format Typical Paper Outline • You will be pressed for space • Abstract • Introduction • Try to say things as precisely as possible • Background/Related Work – Your first write up can be as much as 3x the page limit and it’s still write-up it s • The Scheme ( b tit t with a descriptive title) Th S h (substitute ith d i ti titl ) easy (possible) to get it under the limit • Methodology • Think about your plots/figures • Results – A good plot/figure gives a lot of information • Discussion – Is this figure the best way of conveying this idea? • Conclusion (with optional further work) – Is this plot the best way for visualizing this data? – Plots/figures need to be area efficient (but readable!) 2
  • 169.
    2/18/2011 13 14 Abstract Introduction • An experienced reader should be able to understand exactly what you have done from only reading the • Introduces the larger research area that the paper is abstract a part of – This is different from a summary • Introduces the problem at hand • Should be short, varies from 150 to 200 word maximum • Explains the scheme • Should include a description of the problem, the • Level of abstraction: “20 000 feet” solution and the main results • Typically the last thing you write 15 16 Related Work The Scheme • Reference the work that other researchers have done • Explain your scheme in detail that is related to your scheme – Choose an informative title • Should be complete (i e contain all relevant work) (i.e. • Trick: Add an informative figure that helps explain – Remember: you define the scope of your work your scheme • Can be split into two sections: Background and Related • If your scheme is complex, an informative example Work may be in order – Background is an informative introduction to the field (often section 2) – Related work is a very dense section that includes all relevant references (often section n-1) 17 18 Methodology Results • Show that your scheme works • Explains your experimental setup • Should answer the following questions: • Compare to other schemes that do the same thing – Which simulator did you use? – Hopefully you are better, but you need to compare anyway – How have you extended the simulator? – Which parameters did you use for your simulations? (aim: reproducibility) • Trick: “Oracle Scheme” – Which benchmarks did you use? – Why did you chose these benchmarks? – Uses “perfect” information to create an upper bound on the performance of a class of schemes – Prefetching: Best case is that all L2 accesses are hits • Important: should be realistic • If you are unsure about a parameter, run a simulation • Sensitivity analysis to check its impact – Check the impact of model assumptions on your scheme 3
  • 170.
    2/18/2011 19 20 Discussion Conclusion • Only include this if you need it • Repeat the main results of your work • Can be used if: • Remember that the abstract, introduction and – You have weaknesses in your model that you have not accounted conclusion are usually read before the rest of the for paper – You tested improvements to your scheme that did not give good enough results to be included in “The Scheme” section • Can include Further Work: – Things you thought about doing that you did not have time to do 21 Thank You Visit our website: http://research.idi.ntnu.no/multicore/ 4
  • 171.
    Review on ILP • What is ILP ? • Let the compiler find the ILP TDT 4260 ▫ Advantages? ▫ Disadvantages? Chap 5 • Let the HW find the ILP TLP & Memory Hierarchy ▫ Advantages? ▫ Disadvantages? Multi-threaded execution Contents • Multi-threading Chap 3.5 • Multi-threading: multiple threads share the functional units of 1 processor via overlapping • Memory hierarchy Chap 5.1 ▫ Must duplicate independent state of each thread e.g., a ▫ 6 basic cache optimizations separate copy of register file, PC and page table • 11 advanced cache optimizations Chap 5.2 ▫ Memory shared through virtual memory mechanisms ▫ HW for fast thread switch; much faster than full process switch ≈ 100s to 1000s of clocks • When switch? ▫ Alternate instruction per thread (fine grain) ▫ When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) Fine-Grained Multithreading Coarse-Grained Multithreading • Switch threads only on costly stalls (L2 cache miss) • Switches between threads on each instruction • Advantages ▫ Multiples threads interleaved ▫ No need for very fast thread-switching • Usually round-robin fashion, skipping stalled ▫ Doesn’t slow down thread, since switches only when threads thread encounters a costly stall • Disadvantage: hard to overcome throughput losses • CPU must be able to switch threads every clock from shorter stalls, due to pipeline start-up costs • Hides both short and long stalls ▫ Since CPU issues instructions from 1 thread, when a stall ▫ Other threads executed when one thread stalls occurs, the pipeline must be emptied or frozen • But slows down execution of individual threads ▫ New thread must fill pipeline before instructions can ▫ Thread ready to execute without stalls will be delayed by complete instructions from other threads • => Better for reducing penalty of high cost stalls, • Used on Sun’s Niagara where pipeline refill << stall time
  • 172.
    Simultaneous Multi-threading Do bothILP and TLP? One thread, 8 units Two threads, 8 units Cycle M M FX FX FP FP BR CC Cycle M M FX FX FP FP BR CC • TLP and ILP exploit two different kinds of 1 1 parallel structure in a system 2 2 • Can a high-ILP processor also exploit TLP? 3 3 ▫ Functional units often idle because of stalls or 4 4 dependences in the code 5 5 • Can TLP be a source of independent instructions 6 6 that might reduce processor stalls? 7 7 • Can TLP be used to employ functional units that 8 8 would otherwise lie idle with insufficient ILP? 9 9 • => Simultaneous Multi-threading (SMT) ▫ Intel: Hyper-Threading M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes Simultaneous Multi-threading (SMT) Multi-threaded categories Simultaneous Superscalar Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Multithreading • A dynamically scheduled processor already has many HW mechanisms to support multi-threading ▫ Large set of virtual registers Virtual = not all visible at ISA level Register renaming ▫ Dynamic scheduling • Just add a per thread renaming table and keeping separate PCs ▫ Independent commitment can be supported by logically keeping a separate reorder buffer for each thread Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot Design Challenges in SMT • SMT makes sense only with fine-grained implementation ▫ How to reduce the impact on single thread performance? ▫ Give priority to one or a few preferred threads • Large register file needed to hold multiple contexts • Not affecting clock cycle time, especially in ▫ Instruction issue - more candidate instructions need to be considered ▫ Instruction completion - choosing which instructions to commit may be challenging • Ensuring that cache and TLB conflicts generated by SMT do not degrade performance
  • 173.
    Why memory hierarchy?(fig 5.2) 100,000 Why memory hierarchy? 10,000 • Principle of Locality ▫ Spatial Locality Performance 1,000 Addresses near each other are likely referenced close Processor Processor-Memory together in time 100 Performance Gap Growing ▫ Temporal Locality The same address is likely to reused in the near 10 Memory future 1 1980 1985 1990 1995 2000 2005 2010 • Idea: Store recently used elements a fast Year memories close to the processor ▫ Managed by software or hardware? Memory hierarchy Cache block placement Block 12 placed in cache with 8 Cache lines We want large, fast and cheap at the same time Fully associative: Direct mapped: Set associative: Processor block 12 can go block 12 can go block 12 can go anywhere only into block 4 anywhere in set 0 (12 mod 8) (12 mod 4) Control Block Block Block 01234567 01234567 01234567 Memory no. no. no. Memory Memory Memory Memory Datapath Set Set Set Set Block Address 0 1 2 3 Speed: Fastest Slowest Capacity: Smallest Largest Cheapest Block 1111111111222222222233 Cost: Most expensive no. 01234567890123456789012345678901 Cache performance 6 Basic Cache Optimizations Average access time = Hit time + Reducing Hit Time Miss rate * Miss penalty 1. Giving Reads Priority over Writes Writes in write-buffer can be handled after a newer read if • Miss rate alone is not an accurate measure not causing dependency problems 2. Avoiding Address Translation during Cache Indexing Eg. use Virtual Memory page offset to index the cache • Cache performance is important for CPU perf. Reducing Miss Penalty • More important with higher clock rate 3. Multilevel Caches Both small and fast (L1) and large (&slower) (L2) • Cache design can also affect instructions that don’t Reducing Miss Rate access memory! 4. Larger Block size (Compulsory misses) • Example: A set associative L1 cache on the critical path 5. Larger Cache size (Capacity misses) requires extra logic which will increase the clock cycle time 6. Higher Associativity (Conflict misses) • Trade off: Additional hits vs. cycle time reduction
  • 174.
    1: Giving ReadsPriority over Writes Virtual memory • Caches typically use a write buffer • Processes use a large virtual memory ▫ CPU writes to cache and write buffer • Virtual addresses are dynamically mapped to ▫ Cache controller transfers from buffer to RAM physical addresses using HW & SW ▫ Write buffer usually FIFO with N elements • Page, page frame, page error, translation ▫ Works well as long as buffer does not fill faster than lookaside buffer (TLB) etc. it can be emptied Physical address (PA) Virtual address (VA) Cache 0 Processor DRAM 0 address vir. page phy. page Process 1: translation Write Buffer 2n-1 • Optimization ▫ Handle read misses before write buffer writes 0 ▫ Must check for conflicts with write buffer first Process 2: 2n-1 2m-1 2: Avoiding Address Translation during Cache Indexing • Virtual cache: Use virtual addresses in caches ▫ Saves time on translation VA -> PA ▫ Disadvantages Must flush cache on process switch Can be avoided by including PID in tag Alias problem: OS and a process can have two VAs pointing to the same PA • Compromise:”virtually indexed, physically tagged” ▫ Use page offset to index cache ▫ The same for VA and PA ▫ At the same time as data is read from cache, VA PA is done for the tag ▫ Tag comparison using PA ▫ But: Page size restricts cache size 3: Multilevel Caches (1/2) 3: Multilevel Caches (2/2) • Make cache faster to keep up with CPU or larger to reduce Average access time = L1 Hit time + L1 Miss rate * misses? (L2 Hit time + L2 Miss rate * L2 Miss penalty) • Why not both? • Local miss rate ▫ #cache misses / # cache accesses • Global miss rate ▫ #cache misses / # CPU memory accesses • L1 cache speed affects CPU clock rate • Multilevel caches Small and fast L1 • L2 cache speed affects only L1 miss penalty Large (and cheaper) L2 ▫ Can use more complex mapping for L2 ▫ L2 can be large
  • 175.
    4: Larger Blocksize 5: Larger Cache size 25% Conflict misses 1K • Simple method 20% Compulsory misses 4K • Square-root Rule (quadrupling the size of the Miss 15% Capacity cache will half the miss rate) 16K Rate 10% misses • Disadvantages 64K ▫ Longer hit time 5% 256K ▫ Higher cost 0% Trade-off • Most used for L2/L3 caches 16 32 64 128 256 32 and 64 byte common Block Size (bytes) 11 Advanced Cache Optimizations 6: Higher Associativity Reducing hit time Reducing Miss Penalty • Lower miss rate 7. Critical word first 1. Small and simple caches • Disadvantages 2.Way prediction 8. Merging write buffers ▫ Can increase hit time 3.Trace caches ▫ Higher cost Reducing Miss Rate • 8-way has similar performance to fully Increasing cache bandwidth 9. Compiler optimizations associative 4.Pipelined caches 5.Non-blocking caches Reducing miss penalty or 6.Multibanked caches miss rate via parallelism 10.Hardware prefetching 11. Compiler prefetching 1: Small and simple caches • Compare address to tag memory takes time 2: Way prediction • ⇒ Small cache can help hit time ▫ E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, • Extra bits are kept in the cache to predict Athlon, and Opteron ▫ Also L2 cache small enough to fit on chip with the processor avoids time which way (block) in a set the next access penalty of going off chip will hit • Simple ⇒ direct mapping ▫ Can retrieve the tag early for comparison ▫ Can overlap tag check with data transmission since no choice ▫ Achieves fast hit even with just one • Access time estimate for 90 nm using CACTI model 4.0 ▫ Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, comparator and 1.43 for 2-way, 4-way, and 8-way caches ▫ Several cycles needed to check other blocks 2.50 with misses Access time (ns) 2.00 1-way 2-way 4-way 8-way 1.50 1.00 0.50 - 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB Cache size
  • 176.
    3: Trace caches • Increasingly hard to feed modern superscalar 4: Pipelined caches processors with enough instructions • Pipeline technology applied to cache lookups • Trace cache ▫ Several lookups in processing at once ▫ Stores dynamic instruction sequences rather than ▫ Results in faster cycle time ”bytes of data” ▫ Examples: Pentium (1 cycle), Pentium-III (2 cycles), ▫ Instruction sequence may include branches P4 (4 cycles) Branch prediction integrated in with the cache ▫ L1: Increases the number of pipeline stages needed ▫ Complex and relatively little used to execute an instruction ▫ Used in Pentium 4: Trace cache stores up to 12K ▫ L2/L3: Increases throughput micro-ops decoded from x86 instructions (also Nearly for free since the hit latency on the order of 10 – 20 processor cycles and caches are easy to pipeline saves decode time) 5: Non-blocking caches (1/2) 5: Non-Blocking Cache Implementation • Non-blocking cache or lockup-free cache allow data • The cache can handle as cache to continue to supply cache hits during a miss many concurrent misses as there are MSHRs • “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests • Cache must block when all • “hit under multiple miss” or “miss under miss” may valid bits (V) are set ... further lower the effective miss penalty by overlapping • Very common multiple misses ▫ Requires that the lower-level memory can service multiple concurrent misses ▫ Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses ▫ Pentium Pro allows 4 outstanding memory misses MHA = Miss Handling Architecture MSHR = Miss information/Status Holding Register DMHA = Dynamic Miss Handling Architecture 5: Non-blocking Cache Performance 6: Multibanked caches • Divide cache into independent banks that can support simultaneous accesses ▫ E.g.,T1 (“Niagara”) L2 has 4 banks • Banking works best when accesses naturally spread themselves across banks ⇒ mapping of addresses to banks affects behavior of memory system • Simple mapping that works well is “sequential interleaving” ▫ Spread block addresses sequentially across banks ▫ E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; …
  • 177.
    7: Critical wordfirst 8: Merging write buffers • Don’t wait for full block before restarting CPU • Write buffer allows processor to continue while • Early restart—As soon as the requested word of the block waiting to write to memory arrives, send it to the CPU and let the CPU continue ▫ If buffer contains modified blocks, the addresses execution can be checked to see if address of new data • Critical Word First—Request the missed word first from matches the address of a valid write buffer entry memory and send it to the CPU as soon as it arrives; let the ▫ If so, new data are combined with that entry CPU continue execution while filling the rest of the words in the block • Multiword writes more efficient to memory ▫ Long blocks more popular today ⇒ Critical Word 1st widely • The Sun T1 (Niagara) processor, among many used others, uses write merging block 10: Hardware prefetching 9: Compiler optimizations • Prefetching relies on having extra memory bandwidth that • Instruction order can often be changed without can be used without penalty • Instruction Prefetching affecting correctness ▫ Typically, CPU fetches 2 blocks on a miss: the requested block ▫ May reduce conflict misses and the next consecutive block. ▫ Profiling may help the compiler ▫ Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer • Compiler generate instructions grouped in basic • Data Prefetching blocks ▫ Pentium 4 can prefetch data into L2 cache from up to 8 streams ▫ If the start of a basic block is aligned to a cache block, ▫ Prefetching invoked if 2 successive L2 cache misses to a page misses will be reduced efomn e po e e t P r r a c Im r v m n 2.20 1.97 Important for larger cache block sizes 2.00 1.80 1.60 1.45 1.49 1.40 • Data is even easier to move 1.40 1.20 1.16 1.18 1.20 1.21 1.26 1.29 1.32 1.00 ▫ Lots of different compiler optimizations e e c 3d im lu el id s p cf is re ak ca ga p lg gr m m pw sw ce ap u ga m lu fa eq fa u w SPECint2000 SPECf p2000 11: Compiler prefetching Cache Coherency • Data Prefetch • Consider the following case. I have two processors ▫ Load data into register (HP PA-RISC loads) that are sharing address X. ▫ Cache Prefetch: load into cache • Both cores read address X (MIPS IV, PowerPC, SPARC v. 9) • Address X is brought from memory into the ▫ Special prefetching instructions cannot cause faults; caches of both processors a form of speculative execution • Now, one of the processors writes to address X and changes the value. • Issuing Prefetch Instructions takes time • What happens? How does the other processor get ▫ Is cost of prefetch issues < savings in reduced notified that address X has changed? misses?
  • 178.
    Two types ofcache coherence schemes • Snooping ▫ Broadcast writes, so all copies in all caches will be properly invalidated or updated. • Directory ▫ In a structure, keep track of which cores are caching each address. ▫ When a write occurs, query the directory and properly handle any other cached copies.
  • 179.
    Contents • Introduction App E.1 • Two devices App E.2 TDT 4260 • Multiple devices App E.3 • Topology App E.4 Appendix E Interconnection Networks • Routing, arbitration, switching App E.5 Conceptual overview Motivation • Basic network technology assumed known • Motivation ▫ Increased importance System-to-system connections Intra system connections ▫ Increased demands Bandwidth, latency, reliability, ... ▫ Vital part of system design E.2: Connecting two devices Types of networks Number of devices and distance • OCN – On-chip network ▫ Functional units, register files, caches, … ▫ Also known as: Network on Chip (NoC) Destination • SAN – System/storage area network implicit ▫ Multiprocessor and multicomputer, storage • LAN – Local area network • WAN – Wide area network • Trend: Switches replace buses
  • 180.
    Software to Sendand Receive Network media Twisted Pair: • SW Send steps Copper, 1mm thick, twisted to avoid 1: Application copies data to OS buffer antenna effect (telephone) 2: OS calculates checksum, starts timer Used by cable Coaxial Cable: Plastic Covering 3: OS sends data to network interface HW and says start Braided outer conductor companies: high BW, good noise immunity • SW Receive steps Insulator Copper core 3: OS copies data from network interface HW to OS buffer Light: 3 parts 2: OS calculates checksum, if matches send ACK; if not, are cable, light deletes message (sender resends when timer expires) source, light Fiber Optics Total internal 1: If OK, OS copies data to user address space and signals detector. Transmitter Air reflection application to continue – L.E.D Receiver Multimode – Laser Diode – Photodiode light disperse • Sequence of steps for SW: protocol light (LED), Single source Silica mode single wave (laser) Basic Network Structure and Functions • Media and Form Factor Packet latency Metal layers InfiniBand Ethernet connectors Sender Transmission time Sender Overhead (size/bandwidth) Fiber Optics Media type (processor busy) Cat5E twisted pair Time of Transmission time Receiver Flight (size/bandwidth) Overhead Coaxial Receiver cables (processor Printed Transport Latency circuit busy) boards Myrinet connectors Total Latency OCNs SANs LANs WANs 0.01 1 10 100 >1,000 Total Latency = Sender Overhead + Time of Flight + Distance (meters) Message Size / bandwidth + Receiver Overhead 9 E.3: Connecting multiple devices (1/3) E.3: Connecting multiple devices (2/3) • New issues ▫ Topology Shared Media (Ethernet) • Two types of topology Shared Media (Ethernet) What paths are possible for Node Node Node ▫ Shared media Node Node Node packets? ▫ Switched media ▫ Routing Switched Media (CM-5,ATM) Switched Media (CM-5,ATM) Which of the possible paths are • Shared media (bus) allowable (valid) for packets? Node Node Node Node ▫ Arbitration ▫ Arbitration Switch Carrier Sensing Switch When are paths available for Collision Detection packets? ▫ Switching Node Node ▫ Routing is simple Node Node How are paths allocated to Only one possible path packets?
  • 181.
    Connecting multiple devices(3/3) E.4: Interconnection Topologies • Switched media • One switch or bus can connect a limited number of ▫ “Point-to-point” connections Shared Media (Ethernet) devices ▫ Routing for each packet Node Node Node ▫ Complexity, cost, technology, … ▫ Arbitration for each connection • Interconnected switches needed for larger Switched Media (CM-5,ATM) networks • Comparison Node Node • Topology: connection structure ▫ Much higher aggregate BW in ▫ What paths are possible for packets? switched network than shared Switch ▫ All pairs of devices must have path(s) available media network • A network is partitioned by a set of links if their ▫ Shared media is cheaper Node Node removal disconnects the graph ▫ Distributed arbitration simpler for ▫ Bisection bandwidth switched ▫ Important for performance Crossbar Omega network Source Destination 000 000 • Common topology for 001 001 P connecting CPUs and I/O 0 2x2 switches 010 010 units P 011 1 1 011 • Also used for I/O C 100 101 100 101 Straight Crossover 1 1 interconnecting CPUs I/O C 0 110 110 • Fast and expensive M M M M Upper broadcast Lower broadcast 111 111 (O(N2)) • Non-blocking • Example of multistage network • Usually log2n stages for n inputs - O(N log N) • Can block Linear Arrays and Rings Trees • Distributed switched networks • Node = switch + 1-n end nodes • Diameter and average distance are logarithmic ▫ k-ary tree, height d = logk N External I/O ▫ address = d-vector of radix k coordinates • Linear array= 1D grid P Mem describing path down from root $ • 2D grid Mem • Fixed number of connections per node (i.e. • Torus has wrap-around ctrl and NI fixed degree) connections • CRAY with 3D torus Switch • Bisection bandwidth = 1 near the root
  • 182.
    E.5: Routing, Arbitration,Switching Routing • Routing • Shared Media ▫ Which of the possible paths are allowable for packets? ▫ Broadcast to everyone ▫ Set of operations needed to compute a valid path • Switched Media needs real routing. Options: ▫ Executed at source, intermediate, or even at destination nodes ▫ Source-based routing: message specifies path to the • Arbitration destination (changes of direction) ▫ When are paths available for packets? ▫ Resolves packets requesting the same resources at the same time ▫ Virtual Circuit: circuit established from source to ▫ For every arbitration, there is a winner and possibly many losers destination, message picks the circuit to follow Losers are buffered (lossless) or dropped on overflow (lossy) ▫ Destination-based routing: message specifies • Switching destination, switch must pick the path ▫ How are paths allocated to packets? Deterministic: always follow same path ▫ The winning packet (from arbitration) proceeds towards destination Adaptive: pick different paths to avoid congestion, failures ▫ Paths can be established one fragment at a time or in their entirety Randomized routing: pick between several good paths to balance network load Routing mechanism Deadlock • Need to select output port for each input packet • How can it arise? ▫ necessary conditions: ▫ And fast… shared resources • Simple arithmetic in regular topologies incrementally allocated ▫ Ex: ∆x, ∆y routing in a grid (first ∆x then ∆y) non-preemptible TRC (0,0) TRC (0,1) TRC (0,2) TRC (0,3) west (-x) ∆x < 0 east (+x) ∆x > 0 • How do you handle it? ∆x = 0, ∆y < 0 TRC (1,0) TRC (1,1) TRC (1,2) TRC (1,3) south (-y) ▫ constrain how channel north (+y) ∆x = 0, ∆y > 0 resources are allocated • Unidirectional links sufficient for torus (+x, +y) (deadlock avoidance) TRC (2,0) TRC (2,1) TRC (2,2) TRC (2,3) ▫ Add a mechanism that X X • Dimension-order routing detects likely deadlocks ▫ Reduce relative address of each dimension in and fixes them TRC (3,0) TRC (3,1) TRC (3,2) TRC (3,3) order to avoid deadlock (deadlock recovery) Arbitration (1/2) Arbitration (2/2) • Several simultaneous • Three phases requests to shared • Multiple resource requests • Ideal: Maximize usage of • Better usage network resources • But: • Problem: Starvation Increased ▫ Fairness needed latency • Figure: Two phase arb. ▫ Request, Grant ▫ Poor usage
  • 183.
    Switching Store & Forward vs Cut-Through Routing Store & Forward Routing Cut-Through Routing • Allocating paths for packets Source Dest Dest • Two techniques: 3 2 1 0 32 1 0 ▫ Circuit switching (connection oriented) 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Communication channel 3 2 1 0 3 2 1 0 Allocated before first packet 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Packet headers don’t need routing info 3 2 1 0 3 2 1 0 Wastes bandwidth 3 2 1 0 3 2 1 0 ▫ Packet switching (connection less) 3 2 1 0 Time Each packet handled independently Can’t guarantee response time • Cut-through (on blocking) Two types – next slide ▫ Virtual cut-through (spools rest of packet into buffer) ▫ Wormhole (buffers only a few flits, leaves tail along route)
  • 184.
    Piranha: Designing aScalable CMP-based System for Commercial Workloads Luiz André Barroso Western Research Laboratory April 27, 2001 Asilomar Microcomputer Workshop
  • 185.
    What is Piranha? lAscalableshared memory architecture based on chip multiprocessing (CMP) and targeted at commercial workloads lAresearch prototype under development by Compaq Research and Compaq NonStop Hardware Development Group lA departure from ever increasing processor complexity and system design/verification cycles
  • 186.
    Importance of CommercialApplications Worldwide Server Customer Spending (IDC 1999) Scientific & Other engineering 3% 6% Infrastructure Collaborative 29% 12% Software development 14% Decision Business support processing 14% 22% l Total server market size in 1999: ~$55-60B – technical applications: less than $6B – commercial applications: ~$40B
  • 187.
    Price Structure ofServers Normalized breakdown of HW cost l IBM eServer 680 100% (220KtpmC; $43/tpmC) 90% 80% § 24 CPUs 70% § 96GB DRAM, 18 TB Disk 60% I/O DRAM § $9M price tag 50% CPU 40% Base 30% l Compaq ProLiant ML370 20% 10% (32KtpmC; $12/tpmC) 0% § 4 CPUs IBM eServer 680 Compaq ProLiant ML570 § 8GB DRAM, 2TB Disk Price per component System § $240K price tag $/CPU $/MB DRAM $/GB Disk IBM eServer 680 $65,417 $9 $359 Compaq ProLiant ML570 $6,048 $4 $64 - Storage prices dominate (50%-70% in customer installations) - Software maintenance/management costs even higher (up to $100M) - Price of expensive CPUs/memory system amortized
  • 188.
    Outline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha l Design Methodology l Summary
  • 189.
    Studies of CommercialWorkloads l Collaboration with Kourosh Gharachorloo (Compaq WRL) – ISCA’98: Memory System Characterization of Commercial Workloads (with E. Bugnion) – ISCA’98: An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors (with J. Lo, S. Eggers, H. Levy, and S. Parekh) – ASPLOS’98: Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors (with P. Ranganathan and S. Adve) – HPCA’00: Impact of Chip-Level Integration on Performance of OLTP Workloads (with A. Nowatzyk and B. Verghese) – ISCA’01: Code Layout Optimizations for Transaction Processing Workloads (with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero)
  • 190.
    Studies of CommercialWorkloads: summary l Memory system is the main bottleneck – astronomically high CPI – dominated by memory stall times – instruction stalls as important as data stalls – fast/large L2 caches are critical l Very poor Instruction Level Parallelism (ILP) – frequent hard-to-predict branches – large L1 miss ratios – Ld-Ld dependencies – disappointing gains from wide-issue out-of-order techniques!
  • 191.
    Outline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha l Design Methodology l Summary
  • 192.
    Increasing Complexity ofProcessor Designs l Pushing limits of instruction-level parallelism – multiple instruction issue – speculative out-of-order (OOO) execution l Driven by applications such as SPEC l Increasing design time and team size Processor Year Transisto r Design Design Verification (SGI MIPS) Shipped Count Team Ti m e Team S ize (millions) Size (months) (% of total) R2000 1985 0.10 20 15 15% R4000 1991 1.40 55 24 20% R10000 1996 6.80 >100 36 >35% courtesy: John Hennessy, IEEE Computer, 32(8) l Yielding diminishing returns in performance
  • 193.
    Exploiting Higher Levelsof Integration Alpha 21364 Single M M 1GHz chip 364 364 MEM-CTL 21264 CPU IO IO M M 64KB 64KB Coherence Engine Network Interface I$ D$ 364 364 0 31 IO IO MEM-CTL 1.5MB M M L2$ 364 364 I/O IO IO 0 31 l lower latency, higher bandwidth l incrementally scalable l reuse of existing CPU core glueless multiprocessing addresses complexity issues
  • 194.
    Exploiting Parallelism inCommercial Apps Simultaneous Multithreading (SMT) Chip Multiprocessing (CMP) CPU CPU thread 1 MEM-CTL thread 2 I$ D$ I$ D$ time thread 3 thread 4 Coherence Network MEM-CTL L2$ Example: Alpha 21464 I/O Example: IBM Power4 l SMT superior in single-thread performance l CMP addresses complexity by using simpler cores
  • 195.
    Outline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha – Architecture – Performance l Design Methodology l Summary
  • 196.
    Piranha Project l Explorechip multiprocessing for scalable servers l Focus on parallel commercial workloads l Small team, modest investment, short design time l Address complexity by using: – simple processor cores – standard ASIC methodology Give up on ILP, embrace TLP
  • 197.
    Piranha Team Members Research NonStop Hardware Development – Luiz André Barroso (WRL) ASIC Design Center – Kourosh Gharachorloo (WRL) – Tom Heynemann – David Lowell (WRL) – Dan Joyce – Harland Maxwell – Joel McCormack (WRL) – Harold Miller – Mosur Ravishankar (WRL) – Sanjay Singh – Rob Stets (WRL) – Scott Smith – Yuan Yu (SRC) – Jeff Sprouse – … several contractors Former Contributors Robert McNamara Brian Robinson Basem Nayfeh Barton Sano Andreas Nowatzyk Daniel Scales Joan Pendleton Ben Verghese Shaz Qadeer
  • 198.
    Piranha Processing Node Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1-issue, in-order, 500MHz CPU CPU CPU CPU L1 caches: I&D, 64KB, 2-way HE L2$ L2$ L2$ L2$ Intra-chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1-cycle delay L2 cache: Router shared, 1MB, 8-way ICS Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE): I$ D$ I$ D$ I$ D$ I$ D$ µprog., 1K µinstr., RE L2$ L2$ L2$ L2$ even/odd interleaving System Interconnect: 4-port Xbar router CPU CPU CPU CPU topology independent MEM-CTL MEM-CTL MEM-CTL MEM-CTL 32GB/sec total bandwidth Single Chip
  • 199.
    Piranha I/O Node Router CPU 2 Links @ HE 8GB/s I$ D$D$ PCI-X FB ICS FB RE L2$ MEM-CTL l I/O node is a full-fledged member of system interconnect – CPU indistinguishable from Processing Node CPUs – participates in global coherence protocol
  • 200.
    Example Configuration P P P P- I/O P- I/O P P P l Arbitrary topologies l Match ratio of Processing to I/O nodes to application requirements
  • 201.
    L2 Cache andIntra-Node Coherence l No inclusion between L1s and L2 cache – total L1 capacity equals L2 capacity – L2 misses go directly to L1 – L2 filled by L1 replacements l L2 keeps track of all lines in the chip – sends Invalidates, Forwards – orchestrates L1-to-L2 write-backs to maximize chip-memory utilization – cooperates with Protocol Engines to enforce system-wide coherence
  • 202.
    Inter-Node Coherence Protocol l‘Stealing’ ECC bits for memory directory 8x(64+8) 4X(128+9+7) 2X(256+10+22) 1X(512+11+53) Data-bits ECC Directory-bits 0 28 44 53 l Directory (2b state + 40b sharing info) state info on sharers state info on sharers 2b 20b 2b 20b l Dual representation: limited pointer + coarse vector l “Cruise Missile” Invalidations (CMI) CMI – limit fan-out/fan-in serialization with CV 010000001000 l Several new protocol optimizations
  • 203.
  • 204.
    Single-Chip Piranha Performance 350 350 Normalized Execution Time 300 L2Miss L2Hit 250 233 CPU 200 191 145 150 100 100 100 50 34 44 0 P1 INO OOO P8 P1 INO OOO P8 500 MHz 1GHz 1GHz 500MHz 500 MHz 1GHz 1GHz 500MHz 1-issue 1-issue 4-issue 1-issue 1-issue 1-issue 4-issue 1-issue OLTP DSS l Piranha’s performance margin 3x for OLTP and 2.2x for DSS l Piranha has more outstanding misses è better utilizes memory system
  • 205.
    Single-Chip Performance (Cont.) (Cont.) 8 100 Normalized Breakdown of L1 7 90 80 6 70 Misses (%) 5 Speedup 60 L2 Miss 4 50 L2 Fwd 3 40 L2 Hit 30 2 20 1 10 0 0 0 1 2 3 4 5 6 7 8 P1 P2 P4 P8 Number of Cores 500 MHz, 1-issue l Near-linear scalability – low memory latencies – effectiveness of highly associative L2 and non-inclusive caching
  • 206.
    Potential of aFull-Custom Piranha 120 Normalized Execution Time 100 100 L2 Miss 100 L2 Hit 80 CPU 60 43 40 34 20 19 20 0 OOO P8 P8F OOO P8 P8F 1GHz 500MHz 1.25GHz 1GHz 500MHz 1.25GHz 4-issue 1-issue 1-issue 4-issue 1-issue 1-issue OLTP DSS l 5x margin over OOO for OLTP and DSS l Full-custom design benefits substantially from boost in core speed
  • 207.
    Outline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha l Design Methodology l Summary
  • 208.
    Managing Complexity inthe Architecture l Use of many simpler logic modules – shorter design – easier verification – only short wires* – faster synthesis – simpler chip-level layout l Simplify intra-chip communication – all traffic goes through ICS (no backdoors) l Use of microprogrammed protocol engines l Adoption of large VM pages l Implement sub-set of Alpha ISA – no VAX floating point, no multimedia instructions, etc.
  • 209.
    Methodology Challenges l Isolated sub-module testing – need to create robust bus functional models (BFM) – sub-modules’ behavior highly inter-dependent – not feasible with a small team l System-level (integrated) testing – much easier to create tests – only one BFM at the processor interface – simpler to assert correct operation – Verilog simulation is too slow for comprehensive testing
  • 210.
    Our Approach: l Design in stylized C++ (synthesizable RTL level) – use mostly system-level, semi-random testing – simulations in C++ (faster & cheaper than Verilog) § simulation speed ~1000 clocks/second – employ directed tests to fill test coverage gaps l Automatic C++ to Verilog translation – single design database – reduce translation errors – faster turnaround of design changes – risk: untested methodology l Using industry-standard synthesis tools l IBM ASIC process (Cu11)
  • 211.
    Piranha Methodology: Overview C++ RTL C++ RTL Models: Cycle Models accurate and “synthesizeable” PS1: Fast (C++) Logic CLevel Simulator Verilog Verilog Models: Machine cxx cxx translated from C++ models Models Physical Design: leverages industry standard Verilog-based tools PS1 PS1V Physical PS1V: Can “co-simulate” C++ Design and Verilog module versions and check correspondence cxx: C++ compiler CLevel: C++-to-Verilog Translator
  • 212.
    Summary l CMP architectures are inevitable in the near future l Piranha investigates an extreme point in CMP design – many simple cores l Piranhahas a large architectural advantage over complex single-core designs (> 3x) for database applications l Piranha methodology enables faster design turnaround l Key to Piranha is application focus: – One-size-fits-all solutions may soon be infeasible
  • 213.
    Reference l Papers on commercial workload performance & Piranha research.compaq.com/wrl/projects/Database
  • 215.
    TDT 4260 –lecture 11/3 - 2011 Miniproject – after the first deadline • Miniproject status, update, presentation Implementing 1 Comparison of 2 Improving on • Synchronization, Textbook Chap 4.5 existing or more existing existing – And a short note on BSP (with excellent timing …) prefetcher prefetchers prefetcher • Short presentation of NUTS, NTNU Test Sequential Improving Sattelite System http://nuts iet ntnu no/ http://nuts.iet.ntnu.no/ prefetcher RPT and DCPT sequential prefetcher • UltraSPARC T1 (Niagara), Sequential Chap 4.8 RPT prefetcher (tagged or Improving DCPT • And more on multicores adaptive), RPT and DCPT 1 Lasse Natvig 2 Lasse Natvig Miniproject – after the first deadline Miniproject presentations • Feedback – RPT and DCPT are popular choice; the report should • Friday 15/4 at 1415-1700 (max) properly motivate each group choice of prefetcher (the motivation should not be: “The code was easily • OK for all? available”) – No … we are working on finding a time schedule that is OK for all – Several groups works on similar methods •  “find your story” – too much focus on getting the highest result in the PfJudge ranking; as stated in section 2.3. of the guidelines, the miniproject will be evaluated based on the following criteria: • good use of language • clarity of the problem statement • overall document structure • depth of understanding for the field of prefetching • quality of presentation 3 Lasse Natvig 4 Lasse Natvig IDI Open, a challenge Synchronization for you? • Important concept – Synchronize access to shared resources – Order events from cooperating processes correctly • Smaller MP systems – Implemented by uninterrupted instruction(s) atomically accessing a value l – Requires special hardware support – Simplifies construction of OS / parallel apps • Larger MP systems  Appendix H (not in course) 5 Lasse Natvig 6 Lasse Natvig 1
  • 216.
    Atomic exchange (swap) Implementing atomic exchange (1/2) • Swaps value in register for value in memory • One alternative: Load Linked (LL) and – Mem = 0 means not locked, Mem = 1 means locked Store Conditional (SC) – How does this work – Used in sequence • Register <= 1 ; Processor want to lock • If memory location accessed by LL changes, SC fails • Exchange(Register, Mem) • If context switch between LL and SC SC fails SC, – If Register = 0  Success • Mem was = 0  Was unlocked – Implemented using a special link register • Mem is now = 1  Now locked • Contains address used in LL – If Register = 1  Fail • Reset if matching cache block is invalidated or if we get • Mem was = 1  Was locked an interrupt • Mem is now = 1  Still locked • SC checks if link register contains the same address. If • Exchange must be atomic! so, we have atomic execution of LL & SC 7 Lasse Natvig 8 Lasse Natvig Barrier sync. in BSP Implementing atomic exchange (2/2) • The BSP-model • Example code EXCH (R4, 0(R1)): – Leslie G. Valiant, A bridging model for try: MOV R3, R4 ; mov exchange value parallel computation, LL R2, 0(R1) ; load linked [CACM 1990] SC R3, 0(R1) ; store conditional – Computations BEQZ R3, try ; branch if SC failed organised in MOV R4, R2 ; put load value in R4 supersteps – Algorithms adapt to compute platform • This can now be used to implement e.g. spin locks represented through 4 DADDUI R2, R0, #1 ; R0 always = 0 parameters lockit: EXCH R2, 0(R1) ; atomic exchange – Helps the combination BNEZ R2, lockit ; already locked? of portability & performance http://www.seas.harvard.edu/news-events/press-releases/valiant_turing 9 Lasse Natvig 10 Lasse Natvig Multicore Why multicores? • Important and early example: UltraSPARC T1 • Motivation (See lecture 1) – In all market segments from mobile phones to supercomputers – End of Moores law for single-core – The power wall – The Th memory wall ll – The bandwith problem – ILP limitations – The complexity wall 11 Lasse Natvig 12 Lasse Natvig 2
  • 217.
    Chip Multithreading Opportunities and challenges • Paper by Spracklen & Abraham, HPCA-11 (2005) [SA05] • CMT processors = Chip Multi-Threaded processors • A spectrum of processor architectures – Uni-processors with SMT (one core) – (pure) Chip Multiprocessors (CMP) (one thread pr. core) – Combination of SMT and CMP (They call it CMT) • Best suited to server workloads (with high TLP) 13 Lasse Natvig 14 Lasse Natvig Offchip Bandwidth Sharing processor resources • SMT • A bottleneck – Hardware strand • ”HW for storing the state of a thread of execution” • Bandwidth increasing, but also latency [Patt04] • Several strands can share resources within the core, such as execution resources • Need more than 100 in-flight requests to fully utilize – This improves utilization of processor resources the available bandwidth – Reduces applications sensitivity to off-chip misses • Switch between threads can be very efficient • (pure) CMP – Multiple cores can share chip resources such as memory controller, off-chip bandwidth and L2 cache – No sharing of HW resources between strands within core • Combination (CMT) 15 Lasse Natvig 16 Lasse Natvig 1st generation CMT 2nd generation CMT • 2 cores per chip • 2 or more cores per chip • Cores derived from • Cores still derived from earlier earlier uniprocessor uniprocessor designs designs • Cores now share the L2 cache • Cores do not share any – Speeds inter-core co te co e communication u cat o resources, except off- t ff – Advantageous as most commercial chip data paths applications have significant instruction • Examples: Sun’s Gemini, footprints Sun’s UltraSPARC IV (Jaguar), • Examples: Sun’s UltraSPARC AMD dual core Opteron, Intel dual-core Itanium (Montecito), IV+, IBM’s Power 4/5 Intel dual-core Xeon (Paxville, server) 17 Lasse Natvig 18 Lasse Natvig 3
  • 218.
    3rd generation CMT Multicore generations (?) • CMT processors are best designed from the ground-up, optimized for a CMT design point – Lower power consumption • Multiple cores per chip • Examples: – Sun’s Niagara (T1) • 8 cores, each is 4-way SMT • Each core single-issue, short pipeline • Shared 3MB L2-cache – IBM’s Power-5 • 2 cores, each 2-way SMT 19 Lasse Natvig 20 Lasse Natvig CMT/Multicore design space CMT/Multicore challenges • Number of cores • Multipe threads (strands) share resources – Multiple simple or few complex? – Maximize overall performance • Recent paper of Hill & Marty … • Good resource utilization – See http://www.youtube.com/watch?v=KfgWmQpzD74 • Avoid ”starvation” (Units without work to do) – Heterogeneous cores – Cores must be ”good neighbours” • Serial fraction of parallel application • Fairness, research by Magnus Jahre – Remember Amdahl’s law • See http://research.idi.ntnu.no/multicore/pub • O powerful core for single-threaded applications One f l f i l th d d li ti • P f t hi Prefetching • Resource sharing – Agressive prefetching is OK in single-thread system since the entire – L2 cache! (and L3) system is idle on a miss • (Terminology: LL = Last Level cache) – CMT/Multicore requires more careful prefetching – Floating point units • Prefetch operation may take resources used by other threads – New more expensive resources (amortized over multiple cores) – See research by Marius Grannæs (same link as above) • Shadow tags, more advanced cache techniques, HW accelerators, Cryptographic, OS functions (eg. memcopy), XML parsing, compression • Speculative operations – Your innovation !!!  – OK if using idle resources (delay until resource is idle) – More careful (just as prefetching) / seldomly power efficient 21 Lasse Natvig 22 Lasse Natvig UltraSPARC T1 (“Niagara”) T1 processor – ”logical” overview • Target: Commercial server applications – High thread level parallelism (TLP) • Large numbers of parallel client requests – Low instruction level parallelism (ILP) • High cache miss rates • Many unpredictable branches • Power, cooling, and space are major concerns for data centers • Metric: (Performance / Watt) / Sq. Ft. • Approach: Multicore, Fine-grain multithreading, Simple pipeline, Small L1 1.2 GHz at 72W typical, 79W peak caches, Shared L2 power consumption 23 Lasse Natvig 24 Lasse Natvig 4
  • 219.
    T1 Architecture T1 pipeline / 4 threads • Also ships with 6 or 4 processors • Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W • Shared units: – L1 cache, L2 cache – TLB – Exec. Exec units – pipe registers • Separate units: – PC – instruction buffer – reg file – store buffer 25 Lasse Natvig 26 Lasse Natvig Miss Rates: L2 Cache Size, Block Miss Latency: L2 Cache Size, Size (fig. 4.27) Block Size (fig. 4.28) 200 2.5% 180 TPC-C T1 SPECJBB 2.0% TPC-C 160 SPECJBB 140 ate L2 Miss ra 1.5% 1 5% L2 Miss latency 120 T1 100 1.0% 80 0.5% 60 40 0.0% 20 1.5 MB; 1.5 MB; 3 MB; 3 MB; 6 MB; 6 MB; 32B 64B 32B 64B 32B 64B 0 1.5 MB; 32B 1.5 MB; 64B 3 MB; 32B 3 MB; 64B 6 MB; 32B 6 MB; 64B 27 Lasse Natvig 28 Lasse Natvig Average thread status (fig 4.30) CPI Breakdown of Performance Per Per Effective Effective Thread core CPI for IPC for Benchmark CPI CPI 8 cores 8 cores TPC C TPC-C 7.20 7 20 1.80 1 80 0.23 0 23 4.4 44 SPECJBB 5.60 1.40 0.18 5.7 SPECWeb99 6.60 1.65 0.21 4.8 29 Lasse Natvig 30 Lasse Natvig 5
  • 220.
    Performance Relative toPentium D Not Ready Breakdown (fig 4.31) 6.5 6 100% Fraction of cycles not ready 5.5 Other +Power5 Opteron Sun T1 80% 5 entiumD Pipeline delay 4.5 60% ance relative to P L2 miss 4 40% 3.5 L1 D miss 20% 3 L1 I miss 2.5 erform 0% TPC-C SPECJBB SPECWeb99 2 P 1.5 • Other = ? 1 – TPC-C - store buffer full is largest contributor 0.5 – SPEC-JBB - atomic instructions are largest contributor 0 – SPECWeb99 - both factors contribute SPECIntRate SPECFPRate SPECJBB05 SPECW eb05 TPC-like 31 Lasse Natvig 32 Lasse Natvig Performance/mm2, Performance/Watt 5.5 5 4.5 alized to Pentium D +Power5 Opteron Sun T1 4 3.5 3 Efficiency norma 2.5 25 2 1.5 1 0.5 0 ^2 ^2 t t ^2 t ^2 t at at at at m m W W m m W /W m m m /m e/ e/ 5/ -C e/ e/ 5/ at at B0 -C at at C B0 tR R C JB TP tR R FP In JB TP FP In C C C C E C PE C PE E SP PE PE SP S S S S 33 Lasse Natvig 6
  • 221.
    Cache Coherency And Memory Models
  • 222.
    Review ● Doespipelining help instruction latency? ● Does pipelining help instruction throughput? ● What is Instruction Level Parallelism? ● What are the advantages of OoO machines? ● What are the disadvantages of OoO machines? ● What are the advantages of VLIW? ● What are the disadvantages of VLIW? ● What is an example of Data Spatial Locality? ● What is an example of Data Temporal Locality? ● What is an example of Instruction Spatial Locality? ● What is an example of Instruction Temporal Locality? ● What is a TLB? ● What is a packet switched network?
  • 223.
    Memory Models (MemoryConsistency) Memory Model: The system supports a given model if operations on memory follow specific rules. The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable.
  • 224.
    Memory Models (MemoryConsistency) Memory Model: The system supports a given model if operations on memory follow specific rules. The data consistency model specifies a contract between programmer and system, wherein the system guarantees that if the programmer follows the rules, memory will be consistent and the results of memory operations will be predictable. Huh??????
  • 225.
  • 226.
    Simple Case ● Consider a simple two processor system Memory Interconnect CPU 0 CPU 1 ● The two processors are coherent ● Programs running in parallel may communicate via memory addresses ● Special hardware is required in order to enable communication via memory addresses. ● Shared memory addresses are the standard form of communication for parallel programming
  • 227.
    Simple Case ● CPU 0 wants to send a data word to CPU 1 Memory Interconnect CPU 0 CPU 1 ● What does the code look like ???
  • 228.
    Simple Case ● CPU 0 wants to send a data word to CPU 1 Memory Interconnect CPU 0 CPU 1 ● What does the code look like ??? ● Code on CPU0 writes a value to an address ● Code on CPU1 reads the address to get the new value
  • 229.
    Simple Case int shared_flag= 0; int shared_value = 0; Memory void sender_thread() { shared_value = 42; Interconnect shared_flag = 1; } CPU 0 CPU 1 void receiver_thread() { while (shared_flag == 0) { } Int new_value = shared_value; printf(“%in”, new_value); }
  • 230.
    Simple Case Global variables are shared when using pthreads. This means all threads within int shared_flag = 0; this process may access these variables int shared_value = 0; Memory void sender_thread() { shared_value = 42; Interconnect shared_flag = 1; } CPU 0 CPU 1 void receiver_thread() { while (shared_flag == 0) { } Int new_value = shared_value; printf(“%in”, new_value); }
  • 231.
    Simple Case Global variables are shared when using pthreads. This means all threads within int shared_flag = 0; this process may access these variables int shared_value = 0; Sender writes to Memory void sender_thread() the shared data, then sets a { shared data flag shared_value = 42; that the receiver Interconnect shared_flag = 1; is polling } CPU 0 CPU 1 void receiver_thread() { while (shared_flag == 0) { } Int new_value = shared_value; printf(“%in”, new_value); }
  • 232.
    Simple Case Global variables are shared when using pthreads. This means all threads within int shared_flag = 0; this process may access these variables int shared_value = 0; Sender writes to Memory void sender_thread() the shared data, then sets a { shared data flag shared_value = 42; that the receiver Interconnect shared_flag = 1; is polling } CPU 0 CPU 1 void receiver_thread() { while (shared_flag == 0) { } Receiver is polling on the flag. When the flag is no longer zero, the Int new_value = shared_value; receiver reads the shared_value and printf(“%in”, new_value); prints it out. }
  • 233.
    Simple Case Global variables are shared when using pthreads. This means all threads within int shared_flag = 0; this process may access these variables int shared_value = 0; Sender writes to Memory void sender_thread() the shared data, then sets a { shared data flag shared_value = 42; that the receiver Interconnect shared_flag = 1; is polling } CPU 0 CPU 1 Any Problems??? void receiver_thread() { while (shared_flag == 0) { } Receiver is polling on the flag. When the flag is no longer zero, the Int new_value = shared_value; receiver reads the shared_value and printf(“%in”, new_value); prints it out. }
  • 234.
    Simple CMP CacheCoherency Directory Directory Directory Directory ● Four core machine supporting cache coherency L2 Bank L2 Bank L2 Bank L2 Bank ●Each core has a local L1 Data and Instruction cache. Interconnect ●The L2 cache is shared amongst all cores, and physically distributed into 4 L1 L1 L1 L1 disparate banks CPU 0 CPU 0 CPU 0 CPU 0 ●The interconnect sends memory requests and responses back and forth between the caches
  • 235.
    The Coherency Problem Directory Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Interconnect L1 L1 L1 L1 CPU 0 CPU 0 CPU 0 CPU 0 Ld R1,X
  • 236.
    The Coherency Problem Directory Directory Directory Directory ● Misses in Cache L2 Bank L2 Bank L2 Bank L2 Bank Interconnect Miss! L1 L1 L1 L1 CPU 0 CPU 0 CPU 0 CPU 0 Ld R1,X
  • 237.
    The Coherency Problem Directory Directory Directory Directory ● Misses in Cache L2 Bank L2 Bank L2 Bank L2 Bank ● Goes to “home” l2 (home often determined by Interconnect hash of address) L1 L1 L1 L1 CPU 0 CPU 0 CPU 0 CPU 0 Ld R1,X
  • 238.
    To The Coherency Problem Memory Directory Directory Directory Directory ● Misses in Cache L2 Bank L2 Bank L2 Bank L2 Bank ● Goes to “home” l2 (home often determined by Interconnect hash of address) ●If miss at home L1 L1 L1 L1 L2, read data from CPU 0 CPU 0 CPU 0 CPU 0 memory Ld R1,X
  • 239.
    The Coherency Problem Directory Directory Directory Directory ● Misses in Cache L2 Bank L2 Bank L2 Bank L2 Bank ● Goes to “home” l2 (home often determined by Interconnect hash of address) ●If miss at home L1 L1 L1 L1 L2, read data from CPU 0 CPU 0 CPU 0 CPU 0 memory ●Deposit data in Ld R1,X both home L2 and Local L1
  • 240.
    The Coherency Problem Directory Directory Directory Directory ● Misses in Cache L2 Bank L2 Bank L2 Bank L2 Bank ● Goes to “home” l2 (home often determined by Interconnect hash of address) ●If miss at home L1 L1 L1 L1 L2, read data from CPU 0 CPU 0 CPU 0 CPU 0 memory ●Deposit data in Ld R1,X both home L2 and Local L1 Mem(X) is now in both the L2 and ONE L1 cache
  • 241.
    The Coherency Problem Directory Directory Directory Directory ●CPU 3 reads the L2 Bank L2 Bank L2 Bank L2 Bank same address Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Ld R1,X Ld R2,X
  • 242.
    The Coherency Problem Directory Directory Directory Directory ●CPU 3 reads the L2 Bank L2 Bank L2 Bank L2 Bank same address ● Miss in L1 Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Miss! Ld R1,X Ld R2,X
  • 243.
    The Coherency Problem Directory Directory Directory Directory ●CPU 3 reads the L2 Bank L2 Bank L2 Bank L2 Bank same address ● Miss in L1 Interconnect ● Sends request to L2 ● Hits in L2 L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Ld R1,X Ld R2,X
  • 244.
    The Coherency Problem Directory Directory Directory Directory ●CPU 3 reads the L2 Bank L2 Bank L2 Bank L2 Bank same address ● Miss in L1 Interconnect ● Sends request to L2 ● Hits in L2 L1 L1 L1 L1 ●Data is placed in L1 CPU 0 CPU 1 CPU 2 CPU 3 cache for CPU 3 Ld R1,X Ld R2,X
  • 245.
    The Coherency Problem Directory Directory Directory Directory ● CPU now STORES L2 Bank L2 Bank L2 Bank L2 Bank to address X Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Ld R1,X Ld R2,X Store R2, X What happens?????
  • 246.
    The Coherency Problem Directory Directory Directory Directory ● CPU now STORES L2 Bank L2 Bank L2 Bank L2 Bank to address X Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Ld R1,X Ld R2,X Store R2, X Special hardware is needed in order to either update or invalidate the data in CPU 3's cache
  • 247.
    The Coherency Problem Directory Directory Directory Directory ● For this example, we L2 Bank L2 Bank L2 Bank L2 Bank will assume a directory based invalidate protocol, with write-thru L1 Interconnect caches L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Ld R1,X Ld R2,X Store R2, X
  • 248.
    The Coherency Problem Directory Directory Directory Directory ● Store updates the L2 Bank L2 Bank L2 Bank L2 Bank local L1 and writes- thru to the L2 Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Ld R1,X Ld R2,X Store R2, X
  • 249.
    The Coherency Problem Directory Directory 0, 3 Directory ● Store updates the L2 Bank L2 Bank L2 Bank L2 Bank local L1 and writes- thru to the L2 ●At the L2, the Interconnect directory is inspected, showing CPU3 is sharing the line L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Ld R1,X Ld R2,X Store R2, X
  • 250.
    The Coherency Problem Directory Directory 0, 3 Directory ● Store updates the L2 Bank L2 Bank L2 Bank L2 Bank local L1 and writes- thru to the L2 ●At the L2, the Interconnect directory is inspected, showing CPU3 is sharing the line L1 L1 L1 L1 ●The data in CPU3's CPU 0 CPU 1 CPU 2 CPU 3 cache is invalidated Ld R1,X Ld R2,X Store R2, X
  • 251.
    The Coherency Problem Directory Directory 0 Directory ● Store updates the L2 Bank L2 Bank L2 Bank L2 Bank local L1 and writes- thru to the L2 ●At the L2, the Interconnect directory is inspected, showing CPU3 is sharing the line L1 L1 L1 L1 ●The data in CPU3's CPU 0 CPU 1 CPU 2 CPU 3 cache is invalidated ●The L2 cache is updated with the new Ld R1,X Ld R2,X value Store R2, X
  • 252.
    The Coherency Problem Directory Directory 0 Directory ● Store updates the L2 Bank L2 Bank L2 Bank L2 Bank local L1 and writes- thru to the L2 ●At the L2, the Interconnect directory is inspected, showing CPU3 is sharing the line L1 L1 L1 L1 ●The data in CPU3's CPU 0 CPU 1 CPU 2 CPU 3 cache is invalidated ●The L2 cache is updated with the new Ld R1,X Ld R2,X value ● The system is now Store R2, X “coherent” ● Note that CPU3 was removed from the directory
  • 253.
    Ordering Directory Directory Directory Directory ● Our protocol relies on L2 Bank L2 Bank L2 Bank L2 Bank stores writing through to the L2 cache. Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Store R1,X Store R2, Y
  • 254.
    Ordering Directory Directory Directory Directory ● Our protocol relies on L2 Bank L2 Bank L2 Bank L2 Bank stores writing through to the L2 cache. ● If the stores are to Interconnect different addresses, there are multiple points within the system where the L1 L1 L1 L1 stores may be CPU 0 CPU 1 CPU 2 CPU 3 reordered. Store R1,X Store R2, Y
  • 255.
    Ordering Directory Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Store R1,X Store R2, Y
  • 256.
    Ordering Directory Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Store R1,X Store R2, Y
  • 257.
    Ordering Directory Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Store R1,X Store R2, Y
  • 258.
    Ordering Directory Directory Directory Directory Purple leaves the network first! L2 Bank L2 Bank L2 Bank L2 Bank Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Store R1,X Store R2, Y
  • 259.
    Ordering Directory Directory Directory Directory Stores are written to the shared L2 L2 Bank L2 Bank L2 Bank L2 Bank out-of-order (purple first, then red) !!! Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Store R1,X Store R2, Y
  • 260.
    Ordering Directory Directory Directory Directory Stores are written to the shared L2 L2 Bank L2 Bank L2 Bank L2 Bank out-of-order (purple first, then red) !!! Interconnect L1 L1 L1 L1 Interconnect is not CPU 0 CPU 1 CPU 2 CPU 3 the only cause for out-of-order! Store R1,X Store R2, Y
  • 261.
    Ordering Directory Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 Store R1,X Processor core may issues instructions Store R2, Y out-of-order (remember out-of-order machines??)
  • 262.
    Ordering Directory Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Interconnect L2 pipeline may also reorder requests to L1 L1 L1 L1 different addresses CPU 0 CPU 1 CPU 2 CPU 3 Store R1,X Store R2, Y
  • 263.
    L2 Pipeline Ordering Retry Fifo Resource Allocation L2 Tag L2 Data Coherence And Access Access Control From Network Conflict Detection
  • 264.
    L2 Pipeline Ordering Retry Fifo Resource Allocation L2 Tag L2 Data Coherence And Access Access Control From Network Conflict Detection Two Memory Requests arrive on the network
  • 265.
    L2 Pipeline Ordering Retry Fifo Resource Allocation L2 Tag L2 Data Coherence And Access Access Control From Network Conflict Detection Requests Serviced in- order
  • 266.
    L2 Pipeline Ordering Retry Fifo Resource Allocation L2 Tag L2 Data Coherence And Access Access Control From Network Conflict! Conflict Detection Conflicts are sent to retry fifo
  • 267.
    L2 Pipeline Ordering Retry Fifo Resource Allocation L2 Tag L2 Data Coherence And Access Access Control From Network Conflict Detection Network is given priority
  • 268.
    L2 Pipeline Ordering Retry Fifo Resource Allocation L2 Tag L2 Data Coherence And Access Access Control From Network Conflict Detection Requests are now executing in a different order!
  • 269.
    L2 Pipeline Ordering Retry Fifo Resource Allocation L2 Tag L2 Data Coherence And Access Access Control From Network Conflict Detection Requests are now executing in a different order!
  • 270.
    Simple Case (revisited) intshared_flag = 0; int shared_value = 0; Memory void sender_thread() { shared_value = 42; Interconnect shared_flag = 1; } CPU 0 CPU 1 void receiver_thread() { while (shared_flag == 0) { } Int new_value = shared_value; printf(“%in”, new_value); }
  • 271.
    Simple Case (revisited) Directory Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 272.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 0 Receiver is spinning on “shared_flag” Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 273.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 0 “shared_value” has reset value of 0 Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 274.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 0 Store to shared value writes-thru L1 42 Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 275.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 0 Store to “shared_flag” writes thru L1 1 42 Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 276.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 0 Both stores are now sitting in 1 42 Interconnect the network L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 277.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 0 Store to “shared_flag” is 42 Interconnect first to leave the network L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 278.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 0 1) “shared_flag” is updated 42 Interconnect 2) Coherence protocol L1 L1 L1 0 L1 invalidates copy in CPU3 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 279.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 0 42 Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 280.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Receiver that is 0 1 0 polling now misses in the cache and sends 42 Interconnect request to L2! L1 L1 L1 L1 Miss! CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 281.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Response comes 0 1 0 back. Flag is now set! 42 Interconnect Time to read the “shared_value”! L1 L1 L1 1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 282.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 0 Note that the write 42 Interconnect to “shared_value” is still sitting in the network! L1 L1 L1 1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 283.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 0 42 Interconnect L1 L1 L1 1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 284.
    Simple Case (revisited) 3 Directory Directory 3 L2 Bank L2 Bank L2 Bank L2 Bank 0 1 0 42 Interconnect L1 L1 L1 1 L1 0 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 285.
    Simple Case (revisited) 3 Directory Directory 3 L2 Bank L2 Bank L2 Bank L2 Bank 0 1 0 Write of “42” to “shared_value” finally escapes 42 Interconnect the network, but it is TOO LATE! L1 L1 L1 1 L1 0 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 286.
    Simple Case (revisited) 3 Directory Directory 3 L2 Bank L2 Bank L2 Bank L2 Bank 0 1 0 Our code doesn't always work! 42 Interconnect WTF??? L1 L1 L1 1 L1 0 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value; 0
  • 287.
    Simple Case (revisited) 3 Directory Directory 3 L2 Bank L2 Bank L2 Bank L2 Bank The architecture needs 0 1 0 to expose ordering properties to the programmer, so that 42 Interconnect the programmer may write correct code. L1 L1 L1 1 L1 0 This is called the “Memory Model” CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 shared_flag = 1; 1 new_value = shared_value;
  • 289.
    Sequential Consistency HardwareGUARANTEES that all memory operations are ordered globally. ● Benefits ● Simplifies programming (our initial code would have worked) ● Costs ● Hard to implement micro-architecturally ● Can hurt performance ● Hard to verify
  • 290.
    Weak Consistency Loadsand stores to different addresses may be re-ordered ● Benefits ● Much easier to implement and build ● Higher performing ● Easy to verify ● Costs ● More complicated for the programmer ● Requires special “ordering” instructions for synchronization
  • 291.
    Instructions for WeakMemory Models ● Write Barrier ● Don't issue a write until all preceding writes have completed ● Read Barrier ● Don't issue a read until all preceding reads have completed ● Memory Barrier ● Don't issue a memory operation until all preceding memory operations have completed Etc etc
  • 292.
    Simple Case (writebarrier) int shared_flag = 0; int shared_value = 0; Memory void sender_thread() { shared_value = 42; Interconnect __write_barrier(); shared_flag = 1; CPU 0 CPU 1 } void receiver_thread() { while (shared_flag == 0) { } Int new_value = shared_value; printf(“%in”, new_value); }
  • 293.
    Simple Case (revisited) Directory Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 294.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 0 Receiver is spinning on “shared_flag” Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 295.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 0 “shared_value” has reset value of 0 Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 296.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 0 Store to shared value writes-thru L1 42 Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 297.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 0 write_barrier prevents issues of “shared_flag = 1” 42 Interconnect until the “shared_value = 42” is complete. L1 L1 L1 0 L1 This is tracked via acknowledgments CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1 Blocked
  • 298.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 42 0 Write eventually leaves network 42 Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1 Blocked
  • 299.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 42 0 Write is acknowledged Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1 Still Blocked
  • 300.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 42 0 Barrier is now complete! Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 301.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 42 0 Store to “shared_flag” writes thru L1 1 Interconnect L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 302.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 42 0 Store to “shared_flag” Interconnect leaves the network L1 L1 L1 0 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 303.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 42 0 1) “shared_flag” is updated Interconnect 2) Coherence protocol L1 L1 L1 0 L1 invalidates copy in CPU3 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 304.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 42 0 Interconnect L1 L1 L1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 305.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Receiver that is 0 1 42 0 polling now misses in the cache and sends Interconnect request to L2! L1 L1 L1 L1 Miss! CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 306.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank Response comes 0 1 42 0 back. Flag is now set! Interconnect Time to read the “shared_value”! L1 L1 L1 1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 307.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 42 0 Interconnect L1 L1 L1 1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 308.
    Simple Case (revisited) 3 Directory Directory Directory L2 Bank L2 Bank L2 Bank L2 Bank 0 1 42 0 Interconnect L1 L1 L1 1 L1 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 309.
    Simple Case (revisited) 3 Directory Directory 3 L2 Bank L2 Bank L2 Bank L2 Bank Correct Code!!! 0 1 42 0 Interconnect L1 L1 L1 1 0 L1 42 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 310.
    Simple Case (revisited) 3 Directory Directory 3 L2 Bank L2 Bank L2 Bank L2 Bank Correct Code!!! 0 1 42 0 Interconnect What about reads..... L1 L1 L1 1 0 L1 42 CPU 0 CPU 1 CPU 2 CPU 3 shared_value = 42; 42 while (shared_flag == 0) { } 0 __write_barrier(); new_value = shared_value; shared_flag = 1; 1
  • 311.
    Weak or Strong? ●Theacademic community pushed hard for sequential consistency: “Multiprocessors Should Support Simple Memory Consistency Models” Mark Hill, IEEE Computer, August 1998
  • 312.
    Weak or Strong? ●Theacademic community pushed hard for sequential consistency: “Multiprocessors Should Support Simple Memory Consistency Models” Mark Hill, IEEE Computer, August 1998 WRONG!!! Most new architectures support relaxed memory models (ARM, IA64, TILE, etc). Much easier to implement and verify. Not a programming issue, because the complexity is hidden behind a library, and 99.9% of programmers don't have to worry about these issues!
  • 314.
    Break Problem You areone of P recently arrested prisoners. The warden makes the following announcement: "You may meet together today and plan a strategy, but after today you will be in isolated cells and have no communication with one another. I have set up a "switch room" which contains a light switch, which is either on or off. The switch is not connected to anything. Every now and then, I will select one prisoner at random to enter the "switch room". This prisoner may throw the switch (from on to off, or vice-versa), or may leave the switch unchanged. Nobody else will ever enter this room. Each prisoner will visit the switch room arbitrarily often. More precisely, for any N, eventually each of you will visit the switch room at least N times. At any time, any of you may declare: "we have all visited the switch room at least once." If the claim is correct, I will set you free. If the claim is incorrect, I will feed all of you to the sharks." Devise a winning strategy when you know that the initial state of the switch is off. Hint: not all prisoners need to do the same thing.
  • 315.
    1 2 Introduction to Green Computing • What do we mean by Green Computing? • Why Green Computing? TDT4260 • Measuring “greenness” Introduction to Green Computing • Research into energy consumption reduction Asymmetric multicore processors Alexandru Iordan 3 4 What do we mean by Green What do we mean by Green Computing? Computing? The green computing movement is a multifaceted global effort to reduce energy consumption and to promote sustainable development in the IT world. [Patrick Kurp, Green computing in Communications of the ACM, 2008] 5 6 Why Green Computing? Measuring “greenness” • Heat dissipation • Non-standard metrics problems – Energy (Joules) – Power (Watts) – Energy-per-instructions ( Joules / No. instructions ) • High energy bills – Energy-delayN-product ( Joules * secondsN ) – PerformanceN / Watt ( (No. instructions / second)N / Watt ) • Growing environmental • Standard metrics impact – Data centers: Power Usage Effectiveness metric (The Green Grid consortium) – Servers: ssj_ops / Watt metric (SPEC consortium) 1
  • 316.
    8 9 Research into energy consumption Maximizing Power Efficiency with reduction Asymmetric Multicore Systems Fedorova et al., Communications of the ACM, 2009 • Outline – Asymmetric multicore processors – Scheduling for parallel and serial applications – Scheduling for CPU- and memory-intensive applications 10 11 Asymmetric multicore processors Efficient utilization of AMPs • What makes a multicore asymmetric? – a few powerful cores (high clock freq., complex pipelines, OoO execution) • Efficient mapping of threads/workloads – many simple cores (low clock freq., simple pipeline, low power requirement) – parallel applications • serial part → complex cores • Homogeneous ISA AMP • scalable parallel part → simple cores – the same binary code can run on both types of cores – microarchitectural characteristics of workloads • CPU intensive applications → complex cores • Heterogeneous ISA AMP • memory intensive applications → simple cores – code compiled separately for each type of core – examples: IBM Cell, Intel Larrabee 12 13 Sequential vs. parallel characteristics Parallelism-aware scheduling • Sequential programs • Goal: improve overall system efficiency (not the – high degree of ILP performance of a particular application) – can utilize features of a complex core (super-scalar pipeline, OoO execution, complex branch prediction) • Idea: assign sequential applications/phases to run on • Parallel programs the complex cores – high number of parallel threads/tasks (compensates for low ILP and masks memory delays) • Does NOT provide fairness • Having both complex and simple cores, give AMPs applicability for wider range of applications 2
  • 317.
    14 15 Challenges of PA scheduling “Heterogeneity”-aware scheduling • Detecting serial and parallel phases – limited scalability of threads can yield wrong solutions • Goal: improve overall system efficiency • Thread migration overhead • Idea: – migration across memory domains is expensive – CPU-intensive applications/phases → complex cores – scheduler must be topology aware – memory-intensive applications/phases → simple cores • Inherently unfair 16 17 Challenges of HA scheduling Summary • Green Computing focuses on improving energy- • Classifying threads/phases as CPU- or memory- efficiency and sustainable development in the IT bound world – two approaches presented: direct measurement and modeling • AMPs promise higher energy-efficiency than • Long execution time (direct measurement approach) symmetric processors or need of offline information (modeling approach) • Schedulers must by designed to take advantage of the asymmetric hardware 18 19 References • Kirk W. Cameron, The road to greener IT pastures in IEEE Computer, 2009 • Dan Herrick and Mark Ritschard, Greening your computing technology, the near and far perspectives in Proceedings of the 37th ACM SIGUCCS, 2009 • Luiz A. Barroso, The price of performance in ACM Queue , 2005 3
  • 318.
    2 Contents 1 Njord Power5+ hardware 2 Kongull AMD Istanbul hardware NTNU HPC Infrastructure 3 Resource Managers IBM AIX Power5+, CentOS AMD Istanbul 4 Documentation Jørn Amundsen IDI/NTNU IT 2011-03-25 www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT 3 4 Power5+ hardware Cache and memory • 16 x 64-bit word cache lines (32 in L3) • Hardware cache line prefetch on loads Cache and memory • Reads from memory are written into L2 Chip layout • External L3, acts as a victim cache for L2 System level • L2 and L3 are shared between cores TOC • L1 is write-through • Cache coherence is maintained system-wide at L2 level • 4K pages sizes default, kernel supports 64K and 16M pages www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT
  • 319.
    5 6 Chip design SMT L3 cache 36M • In a concrete application, the processor core might be idle 50-80% of 12−way the time, waiting for memory • An obvious solution would be to let another thread execute while our 35.2 GB/s Execution Units decode & schedule L1 I−cache L1 I−cache decode & schedule Execution Units thread is waiting for memory 64K 64K 2 LSU logic 2−way 2−way logic 2 LSU • This is known as hyper-threading in the Intel/AMD world, and 2 FXU 2 FXU Simultaneous Multithreading (SMT) with IBM 2 FPU 2 FPU 1 BXU 64−bit registers L1 D−cache 32K L2 cache 1.92M L1 D−cache 32K 64−bit registers 1 BXU • SMT is supported in hardware throughout the processor core 1 CRL 32 GPR, 32 FPR 4−way 10−way 4−way 32 GPR, 32 FPR 1 CRL • SMT is more efficient than hyper-threading with less context switch power5+ core power5+ core overhead • Power5 and 6 supports 1 thread/core or SMT with 2 threads/core, Switch Fabric while the latest Power7 supports 4 threads/core Memory Controller power5+ chip • SMT is enabled or disabled dynamically on a node with the 25.6 GB/s (privileged) command smtctl Main memory 16−128GB DDR2 www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT 7 8 SMT (2) Chip module packaging • 4 chips and 4 L3 caches are HW integrated onto a MCM • 90.25 cm2 , 89 layers of metal • SMT is beneficial if you are doing a lot of memory references, and your application performance is memory bound • Enabling SMT doubles the number of MPI tasks per node, from 16 to 32. Requires your application to be sufficiently scalable. • SMT is only available in user space with batch processing, by adding the structured comment string: #@ requirements = ( Feature == "SMT" ) www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT
  • 320.
    9 10 The system level GPFS • An important feature of a HPC system is the capability of moving • On a p575 system, a node is 2 MCM’s / 8 chips / 16 1.9GHz cores large amounts of data from or to memory, across nodes and from or • The Njord system is to permanent storage - 2 x 16-way 32 GiB login nodes • In this respect a high quality and high performance global file system - 4 x 16-way 16 GiB I/O nodes (used with GPFS) is essential - 186 x 16-way 32 GiB compute nodes • GPFS is a robust parallel FS geared at high BW I/O, used - 6 x 16-way 128 GiB compute nodes extensively in HPC and in the database industry • GPFS parallel file system, 33 TiB fiber disks 62 TiB SATA disks • Disk access is ≈ 1000 times slower than memory access, hence key • Interconnect factor for performance are - IBM Federation, a multistage crossbar network providing 2 GiB/s - spreading (striping) files across many disk units bidirectional bandwidth and 5µs latency system-wide MPI performance - using memory to cache files - hiding latencies in software www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT 11 12 GPFS and parallel I/O (2) File buffering • High transfer rates is achieved by distributing files in blocks round • The kernel does read-aheads robin across a large number of disk units, up to thousands of disks and write-behinds of file blocks application user buffer application • On njord, the GPFS block size and stripe unit is 1 MB • The kernel does heuristics on • In addition to multiple disks servicing file I/O, multiple threads might I/O to discover sequential and read, write or update (R+W) a file simultaneously strided forward and backward reads. • GPFS use multiple I/O servers (4 dedicated nodes on njord), working file system KERNEL in parallel for performance, maintaining file and file metadata • The disadvantage is memory buffer consistency. copying of all data • High performance comes at a cost. Although GPFS can handle • Can bypass with DIRECT_IO – directories with millions of files, it is usually the best to use fewer and can be useful with large I/O DISK larger files, and to access files in larger chunks. (MB-sized), utilizing application SUBSYSTEM I/O patterns www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT
  • 321.
    13 14 AMD Istanbul hardware Cache and memory Cache and memory • 6 x 128 KiB L1 cache • 6 x 512 KiB L2 cache System level • 1 x 6 MiB L3 cache TOC • 24 or 48 GiB DDR3 RAM www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT 15 16 The system level Resource Managers • A node is 2 chips / 12 2.4GHz cores • The Kongull system is - 1 x 12-way 24 GiB login nodes - 4 x 12-way 24 GiB I/O nodes (used with GPFS) - 52 x 12-way 24 GiB compute nodes Resource Managers - 44 x 12-way 48 GiB compute nodes Njord classes • Nodes compute-0-0 – compute-0-39 and compute-1-0 – compute-1-11 are 24 GiB @ 800 MHz, while compute-1-12 – Kongull queues compute-1-15 and compute-2-0 – compute-2-39 are 48 GiB TOC @ 667 MHz bus frequency • GPFS parallel file system, 73 TiB • Interconnect - A fat tree implemented with HP Procurve switches, 1 Gb from node to rack switch, then 10Gb from the rack switch to the toplevel switch. Bandwidth and latency is left as a programming exercise. www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT
  • 322.
    17 18 Resource Managers Njord job class overview • Need efficient (and fair) utilization of the large pool of resources • This is the domain of queueing (batch) systems or resource class min-max max nodes max description nodes / job runtime managers forecast top priority class dedicated • Administers the execution of (computational) jobs and provides 1-180 180 unlimited to forecast jobs resource accounting across usersand accounts high priority 115GB memory bigmem 1-6 4 7 days • Includes distribution of parallel (OpenMP/MPI) threads/processes class across physical cores and gang scheduling of parallel execution large 4-180 128 21 days high priority class for jobs of 64 processors or more • Jobs are Unix shell scripts with batch system keywords embedded normal 1-52 42 21 days default class within structured comments • Both Njord and Kongull employs a series of queues (classes) express high priority class for debug- 1-186 4 1 hour ging and test runs administering various sets of possibly overlapping nodes with small low priority class for serial or possibly different priorities 1/2 1/2 14 days small SMP jobs • IBM LoadLeveler on Njord, Torque (development from OpenPBS) on optimist 1-186 48 unlimited checkpoint-restart jobs Kongull www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT 19 20 Njord job class overview (2) LoadLeveler sample jobscript # @ job_name = hybrid_job # @ account_no = ntnuXXX • Forecast is the highest priority queue, suspends everything else # @ job_type = parallel # @ node = 3 # @ tasks_per_node = 8 • Beware: node memory (except bigmem) is split in 2, to guarantee # @ class = normal # @ ConsumableCpus(2) ConsumableMemory(1664mb) available memory for forecast jobs # @ error = $(job_name).$(jobid).err # @ output = $(job_name).$(jobid).out • A C-R job runs at the very lowest priority, any other job will terminate # @ queue and requeue an optimist queue job if not enough available nodes export OMP_NUM_THREADS=2 # Create (if necessary) and move to my working directory • Optimist class jobs need an internal checkpoint-restart mechanism w=$WORKDIR/$USER/test if [ ! -d $w ]; then mkdir -p $w; fi • AIX LoadLeveler impose node job memory limits, e.g. jobs cd $w oversubscribing available node memory are aborted with an email $HOME/a.out llq -w $LOADL_STEP_ID exit 0 www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT
  • 323.
    21 22 LoadLeveler sample C-R email (1/2) LoadLeveler sample C-R email (2/2) Date: Mon, 21 Mar 2011 18:31:37 +0100 From: loadl@hpc.ntnu.no To: joern@hpc.ntnu.no Subject: z2rank_s_5 This job step was dispatched to run 18 time(s). From: LoadLeveler This job step was rejected by Starter 0 time(s). Submitted at: Mon Mar 21 10:02:56 2011 LoadLeveler Job Step: f05n02io.791345.0 Started at: Mon Mar 21 18:16:59 2011 Executable: /home/ntnu/joern/run/z2rank/logs/skipped/z2rank_s_5.job Exited at: Mon Mar 21 18:31:37 2011 Executable arguments: Real Time: 0 08:28:41 State for machine: f14n06 Job Step User Time: 16 06:34:29 LoadL_starter: The program, z2rank_s_5.job, exited normally and returned Job Step System Time: 0 00:21:15 an exit code of 0. Total Job Step Time: 16 06:55:44 State for machine: f09n06 Starter User Time: 0 00:00:19 State for machine: f13n04 Starter System Time: 0 00:00:09 State for machine: f14n04 Total Starter Time: 0 00:00:28 State for machine: f08n06 State for machine: f12n06 State for machine: f15n07 State for machine: f18n04 www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT 23 24 Kongull job queue overview Documentation class min-max max nodes max description nodes / job runtime default default queue except IPT, Njord User Guide 1-52 52 35 days SFI IO and Sintef Petroleum http://docs.notur.no/ntnu/njord-ibm-power-5 express high priority queue for de- Notur load stats 1-96 96 1 hour bugging and test runs http://www.notur.no/hardware/status/ bigmem default queue for IPT, SFI IO 1-44 44 7 days and Sintef Petroleum Kongull support wiki optimist 1-96 48 28 days checkpoint-restart jobs http://hpc-support.idi.ntnu.no/ Kongull load stats • Oversubscribing node physical memory crashes the node http://kongull.hpc.ntnu.no/ganglia/ • this might happen if not specifying the below in your job script: #PBS -lnodes=1:ppn=12 • If all nodes are not reserved, the batch system will attempt to share nodes by default www.ntnu.no Jørn Amundsen, NTNU IT www.ntnu.no Jørn Amundsen, NTNU IT
  • 324.
    TDT4260 Computer Architecture Mini-Project Guidelines Alexandru Ciprian Iordan iordan@idi.ntnu.no January 10, 2011 1 Introduction The Mini-Project accounts for 20% of the final grade in TDT4260 Computer Architecture. Your task is to develop and evaluate a prefetcher using the M5 simulator. M5 is currently one of the most popular simulators for computer architecture research and has a rich feature set. Consequently, it is a very complex piece of software. To make your task easier, we have created a simple interface to the memory system that you can use to develop your prefetcher. Furthermore, you can evaluate your prefetchers by submitting your code via web interface. This web interface runs your code on the Kongull cluster with the default simulator setup. It is also possible to experiment with other parameters, but then you will have to run the simulator yourself. The web interface, the modified M5 simulator and more documentation can be found at http://dm-ark.idi.ntnu.no/. The Mini-Project is carried out in groups of 2 to 4 students. In some cases we will allow students to work alone. Your will be graded based on both a written paper and a short oral presentation. Make sure you clearly cite the source of information, data and figures. Failure to do so is regarded as cheating and is handled according to NTNU guidelines. If you have any questions, send an e-mail to teaching assistant Alexandru Ciprian Iordan (iordan@idi.ntnu.no) . 1.1 Mini-Project Goals The Mini-Project has the following goals: • Many computer architecture topics are best analyzed by experiments and/or detailed studies. The Mini-Project should provide training in such exercises. • Writing about a topic often increases the understanding of it. Consequently, we require that the result of the Mini-Project is a scientific paper. 2 Practical Guidelines 2.1 Time Schedule and Deadlines The Mini-Project schedule is shown in Table 1. If these deadlines collide with deadlines in other subjects, we suggest that you consider handing in the Mini-Project earlier than the deadline. If you miss the final deadline, this will reduce the maximum score you can be awarded. 1
  • 325.
    Deadline Description Friday 21. January List of group members delivered to Alexandru Ciprian Ior- dan (iordan@idi.ntnu.no) by e-mail Friday 4. March Short status report and an outline of the final report delivered to Alexandru Ciprian Iordan (iordan@idi.ntnu.no) by e-mail Friday 8. April 12:00 (noon) Final paper deadline. Deliver the paper through It’s Learning. De- tailed report layout requirements can be found in section 2.2. Week 15 (11. - 15. April) Compulsory 10 minute oral presentations Table 1: Mini-Project Deadlines 2.2 Paper Layout The paper must follow the IEEE Transactions style guidelines available here: http://www.ieee.org/publications_standards/publications/authors/authors_ journals.html#sect2 Both Latex and Word templates are available, but we recommend that you use Latex. The paper must use a maximum of 8 pages. Failure to comply with these requirements will reduce the maximum score you can be awarded. In addition, we will deduct points if: • The paper does not have a proper scientific structure. All reports must contain the following sec- tions: Abstract, Introduction, Related Work or Background, Prefetcher Description, Methodology, Results, Discussion and Conclusion. You may rename the “Prefetcher Description” section to a more descriptive title. Acknowledgements and Author biographies are optional. • Use citations correctly. If you use a figure that somebody else has made, a citation must appear in the figure text. • NTNU has acquired an automated system that checks for plagiarism. We may run this system on your papers so make sure you write all text yourself. 2.3 Evaluation The Mini-Project accounts for 20% of the total grade in TDT4260 Computer Architecture. Within the Mini-Project, the report counts 80% and the oral presentation 20%. The report grade will be based on the following criteria: • Language and use of figures • Clarity of the problem statement • Overall document structure • Depth of understanding for the field of computer architecture • Depth of understanding of the investigated problem The oral presentation grade will be based on following criteria: • Presentation structure • Quality and clarity of the slides • Presentation style • If you use more than the provided time, you will lose points. 2
  • 326.
    M5 simulator system TDT4260Computer Architecture User documentation Last modified: November 23, 2010
  • 327.
    Contents 1 Introduction 2 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Chapter outlines . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Installing and running M5 4 2.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.2 VirtualBox disk image . . . . . . . . . . . . . . . . . . 5 2.3 Build . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4.1 CPU2000 benchmark tests . . . . . . . . . . . . . . . . 6 2.4.2 Running M5 with custom test programs . . . . . . . . 7 2.5 Submitting the prefetcher for benchmarking . . . . . . . . . . 8 3 The prefetcher interface 9 3.1 Memory model . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Interface specification . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Using the interface . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.1 Example prefetcher . . . . . . . . . . . . . . . . . . . . 13 4 Statistics 14 5 Debugging the prefetcher 16 5.1 m5.debug and trace flags . . . . . . . . . . . . . . . . . . . . . 16 5.2 GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.3 Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1
  • 328.
    Chapter 1 Introduction You arenow going to write your own hardware prefetcher, using a modified version of M5, an open-source hardware simulator system. This modified version presents a simplified interface to M5’s cache, allowing you to con- centrate on a specific part of the memory hierarchy: a prefetcher for the second level (L2) cache. 1.1 Overview This documentation covers the following: • Installing and running the simulator • Machine model and memory hierarchy • Prefetcher interface specification • Using the interface • Testing and debugging the prefetcher on your local machine • Submitting the prefetcher for benchmarking • Statistics 1.2 Chapter outlines The first chapter gives a short introduction, and contains an outline of the documentation. 2
  • 329.
    The second chapterstarts with the basics: how to install the M5 simulator. There are two possible ways to install and use it. The first is as a stand- alone VirtualBox disk-image, which requires the installation of VirtualBox. This is the best option for those who use Windows as their operating system of choice. For Linux enthusiasts, there is also the option of downloading a tarball, and installing a few required software packages. The chapter then continues to walk you through the necessary steps to get M5 up and running: building from source, running with command-line options that enables prefetching, running local benchmarks, compiling and running custom test-programs, and finally, how to submit your prefetcher for testing on a computing cluster. The third chapter gives an overview over the simulated system, and de- scribes its memory model. There is also a detailed specification of the prefetcher interface, and tips on how to use it when writing your own prefetcher. It includes a very simple example prefetcher with extensive com- ments. The fourth chapter contains definitions of the statistics used to quantita- tively measure prefetchers. The fifth chapter gives details on how to debug prefetchers using advanced tools such as GDB and Valgrind, and how to use trace-flags to get detailed debug printouts. 3
  • 330.
    Chapter 2 Installing andrunning M5 2.1 Download Download the modified M5 simulator from the PfJudgeβ website. 2.2 Installation 2.2.1 Linux Software requirements (specific Debian/Ubuntu packages mentioned in paren- theses): • g++ >= 3.4.6 • Python and libpython >= 2.4 (python and python-dev) • Scons > 0.98.1 (scons) • SWIG >= 1.3.31 (swig) • zlib (zlib1g-dev) • m4 (m4) To install all required packages in one go, issue instructions to apt-get: sudo apt-get install g++ python-dev scons swig zlib1g-dev m4 The simulator framework comes packaged as a gzipped tarball. Start the ad- venture by unpacking with tar xvzf framework.tar.gz. This will create a directory named framework. 4
  • 331.
    2.2.2 VirtualBox disk image If you do not have convenient access to a Linux machine, you can download a virtual machine with M5 preconfigured. You can run the virtual machine with VirtualBox, which can be downloaded from http.//www.virtualbox. org. The virtual machine is available as a zip archive from the PfJudgeβ web- site. After unpacking the archive, you can import the virtual machine into VirtualBox by selecting “Import Appliance” in the file menu and opening “Prefetcher framework.ovf”. 2.3 Build M5 uses the scons build system: scons -j2 ./build/ALPHA SE/m5.opt builds the optimized version of the M5 binaries. -j2 specifies that the build process should built two targets in parallel. This is a useful option to cut down on compile time if your machine has several processors or cores. The included build script compile.sh encapsulates the necessary build com- mands and options. 2.4 Run Before running M5, it is necessary to specify the architecture and parameters for the simulated system. This is a nontrivial task in itself. Fortunately there is an easy way: use the included example python script for running M5 in syscall emulation mode, m5/config/example/se.py. When using a prefetcher with M5, this script needs some extra options, described in Table 2.1. For an overview of all possible options to se.py, do ./build/ALPHA SE/m5.opt common/example/se.py --help When combining all these options, the command line will look something like this: ./build/ALPHA SE/m5.opt common/example/se.py --detailed --caches --l2cache --l2size=1MB --prefetcher=policy=proxy --prefetcher=on access=True This command will run se.py with a default program, which prints out “Hello, world!” and exits. To run something more complicated, use the 5
  • 332.
    Option Description --detailed Detailed timing simulation --caches Use caches --l2cache Use level two cache --l2size=1MB Level two cache size --prefetcher=policy=proxy Use the C-style prefetcher interface --prefetcher=on access=True Have the cache notify the prefetcher on all accesses, both hits and misses --cmd The program (an Alpha binary) to run Table 2.1: Basic se.py command line options. --cmd option to specify another program. See subsection 2.4.2 about cross- compiling binaries for the Alpha architecture. Another possibility is to run a benchmark program, as described in the next section. 2.4.1 CPU2000 benchmark tests The test prefetcher.py script can be used to evaluate the performance of your prefetcher against the SPEC CPU2000 benchmarks. It runs a selected suite of CPU2000 tests with your prefetcher, and compares the results to some reference prefetchers. The per-test statistics that M5 generates are written to output/<testname-prefetcher>/stats.txt. The statistics most relevant for hardware prefetching are then filtered and aggregated to a stats.txt file in the framework base directory. See chapter 4 for an explanation of the reported statistics. Since programs often do some initialization and setup on startup, a sample from the start of a program run is unlikely to be representative for the whole program. It is therefore desirable to begin the performance tests after the program has been running for some time. To save simulation time, M5 can resume a program state from a previously stored checkpoint. The prefetcher framework comes with checkpoints for the CPU2000 benchmarks taken after 109 instructions. It is often useful to run a specific test to reproduce a bug. To run the CPU2000 tests outside of test prefetcher.py, you will need to set the M5 CPU2000 environment variable. If this is set incorrectly, M5 will give the error message “Unable to find workload”. To export this as a shell variable, do 6
  • 333.
    export M5 CPU2000=lib/cpu2000 Nearthe top of test prefetcher.py there is a commented-out call to dry run(). If this is uncommented, test prefetcher.py will print the command line it would use to run each test. This will typically look like this: m5/build/ALPHA SE/m5.opt --remote-gdb-port=0 -re --outdir=output/ammp-user m5/configs/example/se.py --checkpoint-dir=lib/cp --checkpoint-restore=1000000000 --at-instruction --caches --l2cache --standard-switch --warmup-insts=10000000 --max-inst=10000000 --l2size=1MB --bench=ammp --prefetcher=on access=true:policy=proxy This uses some additional command line options, these are explained in Table 2.2. Option Description --bench=ammp Run one of the SPEC CPU2000 benchmarks. --checkpoint-dir=lib/cp The directory where program checkpoints are stored. --at-instruction Restore at an instruction count. --checkpoint-restore=n The instruction count to restore at. --standard-switch Warm up caches with a simple CPU model, then switch to an advanced model to gather statistics. --warmup-insts=n Number of instructions to run warmup for. --max-inst=n Exit after running this number of instructions. Table 2.2: Advanced se.py command line options. 2.4.2 Running M5 with custom test programs If you wish to run your self-written test programs with M5, it is necessary to cross-compile them for the Alpha architecture. The easiest way to achieve this is to download the precompiled compiler-binaries provided by crosstool from the M5 website. Install the one that fits your host machine best (32 or 64 bit version). When cross-compiling your test program, you must use the -static option to enforce static linkage. To run the cross-compiled Alpha binary with M5, pass it to the script with the --cmd option. Example: ./build/ALPHA SE/m5.opt configs/example/se.py --detailed --caches --l2cache --l2size=512kB --prefetcher=policy=proxy --prefetcher=on access=True --cmd /path/to/testprogram 7
  • 334.
    2.5 Submitting the prefetcher for benchmarking First of all, you need a user account on the PfJudgeβ web pages. The teaching assistant in TDT4260 Computer Architecture will create one for you. You must also be assigned to a group to submit prefetcher code or view earlier submissions. Sign in with your username and password, then click “Submit prefetcher” in the menu. Select your prefetcher file, and optionally give the submission a name. This is the name that will be shown in the highscore list, so choose with care. If no name is given, it defaults to the name of the uploaded file. If you check “Email on complete”, you will receive an email when the results are ready. This could take some time, depending on the cluster’s current workload. When you click “Submit”, a job will be sent to the Kongull cluster, which then compiles your prefetcher and runs it with a subset of the CPU2000 tests. You are then shown the “View submissions” page, with a list of all your submissions, the most recent at the top. When the prefetcher is uploaded, the status is “Uploaded”. As soon as it is sent to the cluster, it changes to “Compiling”. If it compiles successfully, the status will be “Running”. If your prefetcher does not compile, status will be “Compile error”. Check “Compilation output” found under the detailed view. When the results are ready, status will be “Completed”, and a score will be given. The highest scoring prefetcher for each group is listed on the highscore list, found under “Top prefetchers” in the menu. Click on the prefetcher name to go a more detailed view, with per-test output and statistics. If the prefetcher crashes on some or all tests, status will be “Runtime error”. To locate the failed tests, check the detailed view. You can take a look at the output from the failed tests by clicking on the “output” link found after each test statistic. To allow easier exploration of different prefetcher configurations, it is possi- ble to submit several prefetchers at once, bundled into a zipped file. Each .cc file in the archive is submitted independently for testing on the cluster. The submission is named after the compressed source file, possibly prefixed with the name specified in the submission form. There is a limit of 50 prefetchers per archive. 8
  • 335.
    Chapter 3 The prefetcherinterface 3.1 Memory model The simulated architecture is loosely based on the DEC Alpha Tsunami system, specifically the Alpha 21264 microprocessor. This is a superscalar, out-of-order (OoO) CPU which can reorder a large number of instructions, and do speculative execution. The L1 prefetcher is split in a 32kB instruction cache, and a 64kB data cache. Each cache block is 64B. The L2 cache size is 1MB, also with a cache block size of 64B. The L2 prefetcher is notified on every access to the L2 cache, both hits and misses. There is no prefetching for the L1 cache. The memory bus runs at 400MHz, is 64 bits wide, and has a latency of 30ns. 3.2 Interface specification The interface the prefetcher will use is defined in a header file located at prefetcher/interface.hh. To use the prefetcher interface, you should include interface.hh by putting the line #include "interface.hh" at the top of your source file. #define Value Description BLOCK SIZE 64 Size of cache blocks (cache lines) in bytes MAX QUEUE SIZE 100 Maximum number of pending prefetch requests MAX PHYS MEM SIZE 228 − 1 The largest possible physical memory address Table 3.1: Interface #defines. NOTE: All interface functions that take an address as a parameter block- align the address before issuing requests to the cache. 9
  • 336.
    Function Description void prefetch init(void) Called before any memory access to let the prefetcher initialize its data structures void prefetch access(AccessStat stat) Notifies the prefetcher about a cache access void prefetch complete(Addr addr) Notifies the prefetcher about a prefetch load that has just completed Table 3.2: Functions called by the simulator. Function Description void issue prefetch(Addr addr) Called by the prefetcher to initiate a prefetch int get prefetch bit(Addr addr) Is the prefetch bit set for addr? int set prefetch bit(Addr addr) Set the prefetch bit for addr int clear prefetch bit(Addr addr) Clear the prefetch bit for addr int in cache(Addr addr) Is addr currently in the L2 cache? int in mshr queue(Addr addr) Is there a prefetch request for addr in the MSHR (miss status holding register) queue? int current queue size(void) Returns the number of queued prefetch requests void DPRINTF(trace, format, ...) Macro to print debug information. trace is a trace flag (HWPrefetch), and format is a printf format string. Table 3.3: Functions callable from the user-defined prefetcher. AccessStat member Description Addr pc The address of the instruction that caused the access (Program Counter) Addr mem addr The memory address that was requested Tick time The simulator time cycle when the request was sent int miss Whether this demand access was a cache hit or miss Table 3.4: AccessStat members. 10
  • 337.
    The prefetcher mustimplement the three functions prefetch init, prefetch access and prefetch complete. The implementation may be empty. The function prefetch init(void) is called at the start of the simulation to allow the prefetcher to initialize any data structures it will need. When the L2 cache is accessed by the CPU (through the L1 cache), the func- tion void prefetch access(AccessStat stat) is called with an argument (AccessStat stat) that gives various information about the access. When the prefetcher decides to issue a prefetch request, it should call issue prefetch(Addr addr), which queues up a prefetch request for the block containing addr. When a cache block that was requested by issue prefetch arrives from memory, prefetch complete is called with the address of the completed request as parameter. Prefetches issued by issue prefetch(Addr addr) go into a prefetch request queue. The cache will issue requests from the queue when it is not fetching data for the CPU. This queue has a fixed size (available as MAX QUEUE SIZE), and when it gets full, the oldest entry is evicted. If you want to check the current size of this queue, use the function current queue size(void). 3.3 Using the interface Start by studying interface.hh. This is the only M5-specific header file you need to include in your header file. You might want to include standard header files for things like printing debug information and memory alloca- tion. Have a look at what the supplied example prefetcher (a very simple sequential prefetcher) to see what it does. If your prefetcher needs to initialize something, prefetch init is the place to do so. If not, just leave the implementation empty. You will need to implement the prefetch access function, which the cache calls when accessed by the CPU. This function takes an argument, AccessStat stat, which supplies information from the cache: the address of the executing instruction that accessed cache, what memory address was access, the cycle tick number, and whether the access was a cache miss. The block size is available as BLOCK SIZE. Note that you probably will not need all of this information for a specific prefetching algorithm. If your algorithm decides to issue a prefetch request, it must call the issue prefetch function with the address to prefetch from as argument. The cache block containing this address is then added to the prefetch request 11
  • 338.
    queue. This queuehas a fixed limit of MAX QUEUE SIZE pending prefetch re- quests. Unless your prefetcher is using a high degree of prefetching, the number of outstanding prefetches will stay well below this limit. Every time the cache has loaded a block requested by the prefetcher, prefetch complete is called with the address of the loaded block. Other functionality available through the interface are the functions for get- ting, setting and clearing the prefetch bit. Each cache block has one such tag bit. You are free to use this bit as you see fit in your algorithms. Note that this bit is not automatically set if the block has been prefetched, it has to be set manually by calling set prefetch bit. set prefetch bit on an address that is not in cache has no effect, and get prefetch bit on an address that is not in cache will always return false. When you are ready to write code for your prefetching algorithm of choice, put it in prefetcher/prefetcher.cc. When you have several prefetchers, you may want to to make prefetcher.cc a symlink. The prefetcher is statically compiled into M5. After prefetcher.cc has been changed, recompile with ./compile.sh. No options needed. 12
  • 339.
    3.3.1 Example prefetcher /* * A sample prefetcher which does sequential one-block lookahead. * This means that the prefetcher fetches the next block _after_ the one that * was just accessed. It also ignores requests to blocks already in the cache. */ #include "interface.hh" void prefetch_init(void) { /* Called before any calls to prefetch_access. */ /* This is the place to initialize data structures. */ DPRINTF(HWPrefetch, "Initialized sequential-on-access prefetchern"); } void prefetch_access(AccessStat stat) { /* pf_addr is now an address within the _next_ cache block */ Addr pf_addr = stat.mem_addr + BLOCK_SIZE; /* * Issue a prefetch request if a demand miss occured, * and the block is not already in cache. */ if (stat.miss && !in_cache(pf_addr)) { issue_prefetch(pf_addr); } } void prefetch_complete(Addr addr) { /* * Called when a block requested by the prefetcher has been loaded. */ } 13
  • 340.
    Chapter 4 Statistics This chaptergives an overview of the statistics by which your prefetcher is measured and ranked. IPC instructions per cycle. Since we are using a superscalar architecture, IPC rates > 1 is possible. Speedup Speedup is a commonly used proxy for overall performance when running benchmark tests suites. execution timeno prefetcher IP Cwith prefetcher speedup = = execution timewith prefetcher IP Cno prefetcher Good prefetch The prefetched block is referenced by the application be- fore it is replaced. Bad prefetch The prefetched block is replaced without being referenced. Accuracy Accuracy measures the number of useful prefetches issued by the prefetcher. good prefetches acc = total prefetches Coverage How many of the potential candidates for prefetches were actu- ally identified by the prefetcher? good prefetches cov = cache misses without prefetching Identified Number of prefetches generated and queued by the prefetcher. 14
  • 341.
    Issued Number ofprefetches issued by the cache controller. This can be significantly less than the number of identified prefetches, due to duplicate prefetches already found in the prefetch queue, duplicate prefetches found in the MSHR queue, and prefetches dropped due to a full prefetch queue. Misses Total number of L2 cache misses. Degree of prefetching Number of blocks fetched from memory in a single prefetch request. Harmonic mean A kind of average used to aggregate each benchmark speedup score into a final average speedup. n n Havg = 1 1 1 = n 1 x1 + x2 + ... + xn i=1 xi 15
  • 342.
    Chapter 5 Debugging theprefetcher 5.1 m5.debug and trace flags When debugging M5 it is best to use binaries built with debugging support (m5.debug), instead of the standard build (m5.opt). So let us start by recompiling M5 to be better suited to debugging: scons -j2 ./build/ALPHA SE/m5.debug. To see in detail what’s going on inside M5, one can specify enable trace flags, which selectively enables output from specific parts of M5. The most useful flag when debugging a prefetcher is HWPrefetch. Pass the option --trace-flags=HWPrefetch to M5: ./build/ALPHA SE/m5.debug --trace-flags=HWPrefetch [...] Warning: this can produce a lot of output! It might be better to redirect stdout to file when running with --trace-flags enabled. 5.2 GDB The GNU Project Debugger gdb can be used to inspect the state of the simulator while running, and to investigate the cause of a crash. Pass GDB the executable you want to debug when starting it. gdb --args m5/build/ALPHA SE/m5.debug --remote-gdb-port=0 -re --outdir=output/ammp-user m5/configs/example/se.py --checkpoint-dir=lib/cp --checkpoint-restore=1000000000 --at-instruction --caches --l2cache --standard-switch --warmup-insts=10000000 --max-inst=10000000 --l2size=1MB --bench=ammp --prefetcher=on access=true:policy=proxy You can then use the run command to start the executable. 16
  • 343.
    Some useful GDBcommands: run <args> Restart the executable with the given command line arguments. run Restart the executable with the same arguments as last time. where Show stack trace. up Move up stack trace. down Move down stack frame. print <expr> Print the value of an expression. help Get help for commands. quit Exit GDB. GDB has many other useful features, for more information you can consult the GDB User Manual at http://sourceware.org/gdb/current/onlinedocs/ gdb/. 5.3 Valgrind Valgrind is a very useful tool for memory debugging and memory leak detec- tion. If your prefetcher causes M5 to crash or behave strangely, it is useful to run it under Valgrind and see if it reports any potential problems. By default, M5 uses a custom memory allocator instead of malloc. This will not work with Valgrind, since it replaces malloc with its own custom mem- ory allocator. Fortunately, M5 can be recompiled with NO FAST ALLOC=True to use normal malloc: scons NO FAST ALLOC=True ./m5/build/ALPHA SE/m5.debug To avoid spurious warnings by Valgrind, it can be fed a file with warning suppressions. To run M5 under Valgrind, use valgrind --suppressions=lib/valgrind.suppressions ./m5/build/ALPHA SE/m5.debug [...] Note that everything runs much slower under Valgrind. 17
  • 344.
    Page 1 of5 Norwegian University of Science and Technology (NTNU) DEPT. OF COMPUTER AND INFORMATION SCIENCE (IDI) Course responsible: Professor Lasse Natvig Quality assurance of the exam: PhD Jon Olav Hauglid Contact person during exam: Magnus Jahre Deadline for examination results: 23rd of June 2009. EXAM IN COURSE TDT4260 COMPUTER ARCHITECTURE Tuesday 2nd of June 2009 Time: 0900 - 1300 Supporting materials: No written and handwritten examination support materials are permitted. A specified, simple calculator is permitted. By answering in short sentences it is easier to cover all exercises within the duration of the exam. The numbers in parenthesis indicate the maximum score for each exercise. We recommend that you start by reading through all the sub questions before answering each exercise. The exam counts for 80% of the total evaluation in the course. Maximum score is therefore 80 points. Exercise 1) Instruction level parallelism (Max 10 points) a) (Max 5 points) What is the difference between (true) data dependencies and name dependencies? Which of the two presents the most serious problem? Explain why such dependencies will not always result in a data hazard. Solution sketch: True data dependency: One instruction reads what an earlier has written (data flows) (RAW). Name dependency: Two instructions use the same register or memory location, but there is no flow of data between them. One instruction writes what an earlier has read (WAR) or written (WAW). (no data flow). True data dependency is the most serious problem, as name dependencies can be prevented by register renaming. Also, many pipelines are designed so that name-dependencies will not cause a hazard. A dependency between two instructions will only result in a data hazard if the instructions are close enough together and the processor executes them out of order. b) (Max 5 points) Explain why loop unrolling can improve performance. Are there any potential downsides to using loop unrolling? Solution sketch: Loop unrolling can improve performance by reducing the loop overhead (e.g. loop overhead instructions executed every 4th element, rather than for each). It also makes it possible for scheduling techniques to further improve instruction order as instructions for different elements
  • 345.
    Page 2 of5 (iterations) now can be interchanged. Downsides include increased code size which may lead to more cache misses and increased number of registers used. Exercise 2) Multithreading (Max 15 points) a) (Max 5 points) What are the differences between fine-grained and coarse-grained multithreading? Solution sketch: Fine-grained: Switch between threads after each instruction. Coarse-grained: Switch on costly stalls (cache miss). b) (Max 5 points) Can techniques for instruction level parallelism (ILP) and thread level parallelism (TLP) be used simultaneously? Why/why not? Solution sketch: ILP and TLP can be used simultaneously. TLP looks at parallelism between different threads, while ILP looks at parallelism inside a single instruction stream/thread. c) (Max 5 points) Assume that you are asked to redesign a processor from single threaded to simultaneous multithreading (SMT). How would that change the requirements for the caches? (I.e., what would you look at to ensure that the caches would not degrade performance when moving to SMT) Solution sketch: Several threads executing at once will lead to increased cache traffic and more cache conflicts. Techniques that could help: Increased cache size, more cache ports/banks, higher associativity, non-blocking caches. Exercise 3) Multiprocessors (Max 15 points) a) (Max 5 points) Give a short example illustrating the cache coherence problem for multiprocessors. Solution sketch: See Figure 4.3 on page 206 of the text book. (A reads X, B reads X, A stores X, B now has inconsistent value for X). b) (Max 5 points) Why does bus snooping scale badly with number of processors? Discuss how cache block size could influence the choice between write invalidate and write update. Solution sketch: Bus snooping relies on a common bus where information is broadcasted. As number of devices increase, this common medium becomes a bottleneck. Invalidates are done at cache block level, while updates are done on individual words. False sharing coherence misses only appear when using write invalidate with block sizes larger than
  • 346.
    Page 3 of5 one word. So as cache block size increases, the number of false sharing coherence misses will increase, thereby making write update increasingly more appealing. c) (Max 5 points) What makes the architecture of UltraSPARC T1 (“Niagara”) different from most other processor architectures? Solution sketch: High focus on TLP, low focus on ILP. Poor single thread performance, but great multithread performance. Thread switch on any stall. Short pipeline, in-order, no branch prediction. Exercise 4) Memory, vector processors and networks (Max 15 points) a) (Max 5 points) Briefly describe 5 different optimizations of cache performance. Solution sketch: (1 point pr. optimization) 6 techniques listed on page 291 in the text book, 11 more in 5.2 on page 293. b) (Max 5 points) What makes vector processors fast at executing a vector operation? Solution sketch: A Vector operation can be executed with a single instruction, reducing code size and improving cache utilization. Further, the single instruction has no loop overhead and no control dependencies which a scalar processor would have. Hazard checks can also be done per vector, rather than per element. A vector processor also contains a deep pipeline especially designed for vector operations. c) (Max 5 points) Discuss how the number of devices to be connected influences the choice of topology. Solution sketch: This is a classic example of performance vs. cost. Different topologies scale differently with respect to performance or cost as the number of devices grows. Crossbar scales performance well, but cost badly. Ring or bus scale performance badly, but cost well. Exercise 5) Multicore architectures and programming (Max 25 points) a) (Max 6 points) Explain briefly the research method called design space exploration (DSE). When doing DSE, explain how a cache sensitive application can be made processor bound, and how it can be made bandwidth bound. Solution sketch: (Lecture 10-slide 4) DSE is to try out different points in an n-dimensional space of possible designs, where n is the number of different main design parameters, such as #cores, core-types (IO vs. OOO etc.), cache size etc. Cache sensitive applications can become processor bound by
  • 347.
    Page 4 of5 increasing the cache size, and they can be made bandwidth bound by decreasing it.. b) (Max 5 points) In connection with GPU-programming (shader programming), David Blythe uses the concept ”computational coherence”. Explain it briefly. LF: See lecture 10, slide 36 + evt. the paper. c) (Max 8 points) Give an overview of the architecture of the Cell processor. Solution sketch: All details of this figure are not expected, but the main elements. * One main processor (Power-architecture, called PPE = Power processing element) – this acts as a host (master) processor. (Power arch., 64 bit, in-order two-issue superscalar, SMT (Simultaneous multithreading. Has a vector media extension (VMX) (Kahle figure 2)) * 8 identical SIMD processors (called SPE = Synergistic Processing element), each of these consists of SPU processing element (Synergistic processor unit) and local storage (LS, 256 KB SRAM --- not cache). On chip memory controller + bus interface. (Can operate on integers in different formats., 8, 16, 32 and floating point numbers in 32 og 64 bit. (64 bit floats in later version). * Interconnect is a ring-bus (Element Interconnect Bus, EIB), connects PPE + 8 SPE. two unidirectional busses in each direction. Worst case latency is half distance, can support up to three simultaneous transfers * Highly programmable DMA controller. d) (Max 6 points) The Cell design team made several design decisions that were motivated by a wish to make it easier to develop programs with predictable (more deterministic) processing time (performance). Describe two of these. Solution sketch: 1) They discarded the common out-of-order execution in the Power-processor, developed a simpler in-order processor
  • 348.
    Page 5 of5 2) The local store memory (LS) in the SPE processing elements do not use HW cache-coherency snooping protocols to avoid the in-determinate nature of cache misses. The programmer handles memory in a more explicit way 3) Also the large number of registers (128) might help making the processing more deterministic wrt. execution time. 4) Extensive timers and counters (probably performance counters) (that may be used by the SW/programmer to monitor/adjust/control performance) …---oooOOOooo---…
  • 349.
    Page 1 of4 Norwegian University of Science and Technology (NTNU) DEPT. OF COMPUTER AND INFORMATION SCIENCE (IDI) Contact person for questions regarding exam exercises: Name: Lasse Natvig Phone: 906 44 580 EXAM IN COURSE TDT4260 COMPUTER ARCHITECTURE Monday 26th of May 2008 Time: 0900 – 1300 Solution sketches in blue text Supporting materials: No handwritten or printed materials allowed, simple specified calculator is allowed. By answering in short sentences it is easier to cover all exercises within the duration of the exam. The numbers in parenthesis indicate the maximum score for each exercise. We recommend that you start by reading through all the sub questions before answering each exercise. The exam counts for 80% of the total evaluation in the course. Maximum score is therefore 80 points. Exercise 1) Parallel Architecture (Max 25 points) a) (Max 5 points) The feature size of integrated circuits is now often 65 nanometres or smaller, and it is still decreasing. Explain briefly how the number of transistors on a chip and the wire delay changes with shrinking feature size. The number of transistors can be 4 times larger when the feature size is halved. However the wire delay does not improve (scales poorly). (The textbook page 17 gives more details, but we here ask for the main trends) b) (Max 5 points) In a cache coherent multiprocessor, the concepts migration and replication of shared data items are central. Explain both concepts briefly and also how they influence on latency to access of shared data and the bandwidth demand on the shared memory. Migration means that data move to a place closer to requesting/accessing unit. Replication just means storing several copies. Having a local copy in general means faster access, and it is harmelss to have several copies of read-only data. (Textbook page 207) c) (Max 5 points) Explain briefly how a write buffer can be used in cache systems to increase performance. Explain also what “write merging” is in this context. The main purpose of the write buffer is to temporarily store data that are evicted from the cache so new data can reuse the cache space as fast as possible, i.e. to avoid waiting for the latency of the memory one level further away from the processor. If more writes are to the same cache block (adress) these writes can be combined, resulting in a reduced traffic towards the next memory level. (Textbook page 300) ((Also slides 11-6-3)). // Retting: 3 poeng for skrive-buffer-forståelse og 2 for skrive-fletting. d) (Max 5 points) Sketch a figure that shows how a hypercube with 16 nodes are built by combining two smaller hypercubes. Compare the hypercube-topology with the 2-dimensional mesh topology with respect to connectivity and node cost (number of links/ports per node). (Figure E-14 c) A mesh has a fixed degree of connectivity and becomes slower in general when the number of nodes is increased, since the number of hops needed for reaching another node on average is increasing. For a hypercube it is the other way around, the connectivity increase for larger networks, so the communication time does not increase much, but the node cost does also increase. When going to a larger network, increasing the
  • 350.
    Page 2 of4 dimension, every node must be extended with a new port, and this is a drawback when it comes to building computers using such networks. e) (Max 5 points) When messages are sent between nodes in a multiprocessor two possible strategies are source routing and distributed routing. Explain the difference between these two. For source routing, the entire routing path is precomputed by the source (possibly by table lookup—and placed in the packet header). This usually consists of the output port or ports supplied for each switch along the predetermined path from the source to the destination, (which can be stripped off by the routing control mechanism at each switch. An additional bit field can be included in the header to signify whether adaptive routing is allowed (i.e., that any one of the supplied output ports can be used). For distributed routing, the routing information usually consists of the destination address. This is used by the routing control mechanism in each switch along the path to determine the next output port, either by computing it using a finite-state machine or by looking it up in a local routing table (i.e., forwarding table). (Textbook page E- 48) Exercise 2) Parallel processing (Max 15 points) a) (Max 5 points) Explain briefly the main difference between a VLIW processor and a dynamically scheduled superscalar processor. Include the role of the compiler in your explanation. Parallel execution of several operations is scheduled (analysed and planned) at compile time and assembled into very long/broad instructions for VLIW. (Such work done at compile time is often called static). In a dynamically scheduled superscalar processor dependency and resource analysis are done at run time (dynamically) to find opportunities to do operations in parallell. (Textbook page 114 -> and VLIW paper) b) (Max 5 points) What function has the vector mask register in a vector processor? If you want to update just some subset of the elements in a vector register, i.e. to implement IF A[i] != 0 THEN A[i] = A[i] – B[i] for (i=0..n) in a simple way, this can be done by setting the vector mask register to 1 only for the elements with A[i] != 0. In this way, the vectorinstruction A = A - B can be performed without testing every element explicitly. c) (Max 5 points) Explain briefly the principle of vector chaining in vector processors. The execution of instructions using several/different functional and memory pipelines can be chained together directly or by using vector registers. The chaining forms one longer pipeline. (This is the technique of forwarding (used in processor, as in Tomasulos algorithm) extended to vector registers (Textbook F-23) ((Slides forel-9, slide 20)) – bør sjekkes Exercise 3) Multicore processors (Max 20 points) a) (Max 5 points) In the paper Chip Multithreading: Opportunities and Challenges, by Spracklen & Abraham is the concept Chip Multithreaded processor (CMT) described. The authors describe three generations of CMT processors. Describe each of these briefly. Make simple drawings if you like. a) 1. generation: typically 2 cores pr. chip, every core is a traditional processor-core, no shared resources except the off-chip bandwidth. 2.generation: Shared L2 cache, but still traditional processor cores. 3. generation: as 2. gen., but the cores are now custom-made for being used in a CMP, and might also use simultaneous multithreading (SMT). (This description is a bit ”biased” and colored by the backgorund of the authors (in Sun Microsystems) that was involved in the design of Niagara 1 og 2 (T1)) // fig. 1 i artikkel, og slides // Var deloppgave mai 2007, b) (Max 5 points) Outline the main architecture in SUN’s T1 (Niagara) multicore processor. Describe the placement of L1 and L2 cache, as well as how the L1 caches are kept coherent. Fig 4.24 at page 250 in the textbook, that shows 8 cores, each with its own L1-cache (described in the text), 4 x L2 cache banks, each having a channel to external memory, 1x FPU unit, crossbar as interconnection. Coherence
  • 351.
    Page 3 of4 is maintained by a catalog associated with each L2 cache. This knows which L1-caches that havbe a copy of data in the L2 cache. // Læreboka side 249-250, også forelsning c) (Max 6 points) In the paper Exploring the Design Space of Future CMP’s the authors perform a design space exploration where several main architectural parameters are varied assuming a fixed total chip area of 400mm2. Outline the approach by explaining the following figure; Technology independent area models – found empirically, – core area and cache area measured in cache byte equivalents (CBE). Study the relative costs in area versus the associated performance gains --- maximize performance per unit area for future technology generations. With smaller feature sizes, the available area for cache banks and processing cores increases. Table 3 displays die area in terms of the cache-byte-equivalents (CBE), and PIN and POUT columns show how many of each type of processor with 32KB separate L1 instruction and data caches could be implemented on the chip if no L2 cache area were required. (PIN is a simple in-order- execution processor, POUT is a larger out-of-order exec processor). And, for reference, Lambda-squared where lambda is equal to one half of the feature size. The primary goal of this paper is to determine the best balance between per-processor cache area, area consumed by different processor organizations, and the number of cores on a single die. LF; Ny oppgave / Middels/vanskelig / foil 1-6, og 2-3 d) (Max 4 points) Explain the argument of the authors of the paper Exploring the Design Space of Future CMP’s that we in the future may have chips with useless area on the chip that performs no other function than as a placeholder for pin area? As applications become bandwidth bound, and global wire delays increase, an interesting scenario may arise. It is likely that monolithic caches cannot be grown past a certain point in 50 or 35nm technologies, since the wire delays will make them too slow. It is also likely that, given a ceiling on cache size, off-chip bandwidth will limit the number of cores. Thus, there may be useless area on the chip which cannot be used for cache or processing logic, and which performs no function other than as a placeholder for pin area. That area may be useful to use for compression engines, or intelligent controllers to manage the caches and memory channels. (Fra forel 8, slide 6 på side 4) Exercise 4) Research prototypes (Max 20 points) a) (Max 5 points) Sketch a figure of the main system structure of the Manchester Dataflow Machine (MDM). Include the following units: Matching unit, Token Queue, IO switch, Instruction store, Overflow unit and Processing unit. Show also how these are connected. See figure 5 in the paper, and slides. The Overflow unit is coupled to the matching unit, in parallel..
  • 352.
    Page 4 of4 Matching Unit Token Queue Output Instruction Store Switch Processing Input Unit P0...P19 b) (Max 5 points) What was the function of the overflow unit in MDM and explain very briefly how it was implemented. If an operand does not find its corresponding operand in the Matching Unit (MU), and it is not space in MU to store it (for waiting on the other operand), the operand is stored in the overflow store. This is a separate and much slower subsystem with much larger storage capcity. It is composed of a separate overflow-bus, memory and a microcoded processors, in other words a SW-solution. See also figure 7 in the paper. c) (Max 5 points) In the paper The Stanford FLASH Multiprocessor by Kuskin et.al., the FLASH computer is described. FLASH is an abbreviation for FLexible Architecture for SHared memory. What kind of flexibility was the main goal for the project? Programming paradigm, flexibility in the choice between distributed shared memory (DSM) i.e. cache coherent shared memory and message passing, but also other alternative ways of communication between the nodes could be explored. d) (Max 5 points) Outline the main architecture of a node in a FLASH system. What was the most central design choice to achieve this flexibility? Fig. 2.1 explain much of this Interconnection of PE’s in a mesh. The most central design choice was the MAGIC unit, a specially designed node controller. All memory accesses goes through this, and it can as an example realise a cache-coherence protocol. Every Node is identical. The whole computer has one single adress space, but the memory is physically distributed. ---oooOOOooo---