SlideShare a Scribd company logo
1 of 352
Download to read offline
TDT 4260 – lecture 1 – 2011                                                                              Course goal
• Course introduction                                                                                   • To get a general and deep understanding of the
   –   course goals                                                                                       organization of modern computers and the
   –   staff                                                                                              motivation for different computer architectures. Give
   –   contents                                                                                           a base for understanding of research themes within
   –   evaluation                                                                                         the field.
   –   web, ITSL
                                                                                                        • High level
• Textbook
                                                                                                        • Mostly HW and low-level SW
   – Computer Architecture, A
     Quantitative Approach, Fourth                                                                      • HW/SW interplay
     Edition                                                                                            • Parallelism
        • by John Hennessy & David Patterson
           (HP90 - 96 – 03) - 06                                                                        • Principles, not details
• Today: Introduction (Chapter 1)
   – Partly covered                                                                                         inspire to learn more
                  1                                                                     Lasse Natvig                   2                                  Lasse Natvig




 Contents                                                                                                TDT-4260 / DT8803
                                                                                                       • Recommended background
• Computer architecture fundamentals, trends, measuring                                                   – Course TDT4160 Computer Fundamentals, or
  performance, quantitative principles. Instruction set                                                     equivalent.
  architectures and the role of compilers. Instruction-level                                           • http://www.idi.ntnu.no/emner/tdt4260/
  parallelism, thread-level parallelism, VLIW.                                                            – And Its Learning
• Memory hierarchy design, cache. Multiprocessors, shared                                              • Friday 1215-1400
  memory architectures, vector processors, NTNU/Notur                                                     – And/or some Thursdays 1015-1200
  supercomputers,
  supercomputers distributed shared memory
                                        memory,                                                           – 12 lectures planned
  synchronization, multithreading.                                                                        – some exceptions may occur
• Interconnection networks, topologies                                                                 • Evaluation
• Multicores,homogeneous and heterogeneous, principles and                                                – Obligatory exercise (counts 20%). Written
  product examples                                                                                          exam counts 80%. Final grade (A to F) given
                                                                                                            at end of semester. If there is a re-sit
• Green computing (introduction)                                                                            examination, the examination form may
• Miniproject - prefetching                                                                                 change from written to oral.



                  3                                                                     Lasse Natvig                   4                                  Lasse Natvig




Lecture plan                                    Subject to change
                                                                                                       EMECS, new European Master's
 Date  and lecturer     Topic                                                                          Course in Embedded Computing Systems
 1:  14 Jan (LN, AI)    Introduction, Chapter 1 / Alex: PfJudge

 2:  21 Jan (IB)        Pipelining, Appendix A; ILP, Chapter 2

 3: 28 Jan (IB)         ILP, Chapter 2; TLP, Chapter 3

 4: 4 Feb (LN)          Multiprocessors, Chapter 4 

 5: 11 Feb MG(?))       Prefetching + Energy Micro guest lecture

 6: 18 Feb (LN)         Multiprocessors continued 

 7: 25 Feb (IB)         Piranha CMP + Interconnection networks 

 8: 4 Mar (IB)          Memory and cache, cache coherence  (Chap. 5)

 9: 11 Mar (LN)         Multicore architectures (Wiley book chapter) + Hill Marty Amdahl 
                        multicore ... Fedorova ... assymetric multicore ...

 10: 18 Mar (IB)        Memory consistency (4.6) + more on memory

 11: 25 Mar (JA, AI)    (1) Kongull and other NTNU and NOTUR supercomputers   (2) Green 
                        computing
 12: 1 Apr (IB/LN)      Wrap up lecture, remaining stuff

 13: 8 Apr              Slack – no lecture planned 
                  5                                                                     Lasse Natvig                   6                                  Lasse Natvig




                                                                                                                                                                         1
Preliminary reading list,                              subject to change!!!                  People involved
• Chap.1: Fundamentals, sections 1.1 - 1.12 (pages 2-54)
• Chap.2: ILP, sections 2.1 - 2.2 and parts of 2.3 (pages 66-81), section 2.7
      (pages 114-118), parts of section 2.9 (pages 121-127, stop at speculation),            Lasse Natvig
      section 2.11 - 2.12 (pages 138 - 141). (Sections 2.4 - 2.6 are covered by similar
      material in our computer design course)                                                Course responsible, lecturer
•     Chap.3: Limits on ILP, section 3.1 and parts of section 3.2 (pages 154 -159),          lasse@idi.ntnu.no
      section 3.5 - 3.8 (pages 172-185).
•     Chap.4: Multiprocessors and TLP, sections 4.1 - 4.5, 4.8 - 4.10
•     Chap.5: Memory hierachy, section 5.1 - 5.3 (pages 288 - 315).                          Ian Bratt
•     App A: section A 1 (Expected to be repetition from other courses)
                      A.1                                                                    Lecturer                   (Also t Til
                                                                                                                        (Al at Tilera.com)
                                                                                                                                         )
•     Appendix E, interconnection networks, pages E2-E14, E20-E25, E29-E37                   ianbra@idi.ntnu.no
      and E45-E51.
•     App. F: Vector processors, sections F1 - F4 and F8 (pages F-2 - F-32, F-
      44 - F-45)                                                                             Alexandru Iordan
•     Data prefetch mechanisms (ACM Computing Survey)
                                                                                             Teaching assistant         (Also PhD-student)
•     Piranha, (To be announced)
                                                                                             iordan@idi.ntnu.no
•     Multicores (New bookchapter) (To be announced)
•     (App. D; embedded systems?)  see our new course TDT4258
      Mikrokontroller systemdesign                                                                       http://www.idi.ntnu.no/people/

                   7                                                        Lasse Natvig                     8                               Lasse Natvig




    research.idi.ntnu.no/multicore                                                           Prefetching ---pfjudge




        Some few highlights:
        - Green computing, 2xPhD + master students
        - Multicore memory systems, 3 x PhD theses
        - Multicore programming and parallel computing
        - Cooperation with industry




                   9                                                        Lasse Natvig                     10                              Lasse Natvig




”Computational computer architecture”                                                        Experiment Infrastructure
    • Computational science and engineering (CSE)                                          • Stallo compute cluster
       – Computational X, X = comp.arch.                                                      –   60 Teraflop/s peak
    • Simulates new multicore architectures                                                   –   5632 processing cores
       – Last level, shared cache fairness (PhD-student M. Jahre)                             –   12 TB total memory
       – Bandwidth aware prefetching (PhD-student M. Grannæs)                                 –   128 TB centralized disk
    • Complex cycle-accurate simulators                                                       –   Weighs 16 tons
       – 80 000 lines C++ 20 000 lines python
                       C++,
       – Open source, Linux-based
                                                                                           • Multi-core research
    • Design space exploration (DSE)
                                                                                              – About 60 CPU years allocated per
       – one dimension for each arch. parameter                                                 year to our projects
       – DSE sample point = specific multicore configuration                                  – Typical research paper uses 5 to
       – performance of a selected set of configurations evaluated by                           12 CPU years for simulation
         simulating the execution of a set of workloads                                         (extensive, detailed design space
                                                                                                exploration)



                   11                                                       Lasse Natvig                     12                              Lasse Natvig




                                                                                                                                                            2
The End of Moore’s law
                                                                                                  Motivational background
for single-core microprocessors
                                                                               • Why multicores
                                                                                                              – in all market segments from mobile phones to supercomputers
                                                                               •                              The ”end” of Moores law
                                                                               •                              The power wall
                                                                               •                              The memory wall
                                                                               •                              The bandwith problem
                                                                               •                              ILP limitations
                                                                               •                              The complexity wall
                              But Moore’s law still holds for
                              FPGA, memory and
                              multicore processors

             13                                                Lasse Natvig                                                              14                                                                                     Lasse Natvig




 Energy & Heat Problems                                                                           The Memory Wall
                                                                                                                                     1000

 • Large power                                                                                                                                                       “Moore’s Law”

   consumption                                                                                                                       100                                                            CPU
                                                                                                                                                                                                    60%/year
   – Costly
                                                                                                                       Performance




                                                                                                                                                   P-M gap grows 50% / year
   – Heat problems                                                                                                                   10
   – Restricted battery
                                                                                                                                                                                                          DRAM
     operation time                                                                                                                                                                                       9%/year
                                                                                                                                                                                                          9%/
                                                                                                                                     1
 • Google ”Open
  House Trondheim                                                                                                           1980                              1990                             2000

  2006”                                                                        • The Processor Memory Gap
   – ”Performance/Watt
     is the only flat
                                                                               • Consequence: deeper memory hierachies
     trend line”                                                                                              – P – Registers – L1 cache – L2 cache – L3 cache – Memory - - -
                                                                                                              – Complicates understanding of performance
                                                                                                                   • cache usage has an increasing influence on performance

             15                                                Lasse Natvig                                                              16                                                                                     Lasse Natvig




 The I/O pin or Bandwidth problem                                                                 The limitations of ILP
                                                                                                  (Instruction Level Parallelism)
                                            • # I/O signaling pins                                in Applications
                                                  – limited by physical
                                                    tecnology                                                 30                                                                      3
                                                                                                                                                                                                                                                   
                                                  – speeds have not                                                                                                                  2.5
                                                                                                                                                                                                          
                                                                                                              25
                                                    increased at the same
                                                                               Fraction of total cycles (%)




                                                    rate as processor clock                                   20                                                                      2
                                                    rates                                                                                                                                           
                                                                                                                                                                               dup
                                                                                                                                                                           Speed




                                                                                                                                                                                     1.5
                                            • Projections                                                     15

                                                  – from ITRS (International                                  10                                                                      1         
                                                    Technology Roadmap
                                                    for Semiconductors)                                        5                                                                     0.5

                                                                                                               0                                                                      0
                          [Huh, Burger and Keckler 2001]                                                           0           1           2   3    4     5          6+                    0                   5           10                  15
                                                                                                                       Number of instructions issued                                                    Instructions issued per cycle




             17                                                Lasse Natvig                                                              18                                                                                     Lasse Natvig




                                                                                                                                                                                                                                                        3
Reduced Increase in Clock Frequency                                                            Solution: Multicore architectures
                                                                                               (also called Chip Multi-processors - CMP)

                                                                                               • More power-efficient
                                                                                                   – Two cores with clock frequency f/2
                                                                                                     can potentially achieve the same
                                                                                                     speed as one at frequency f with 50%
                                                                                                     reduction in total energy consumption
                                                                                                     [Olukotun & Hammond 2005]
                                                                                               • Exploits Thread Level
                                                                                                 Parallelism (TLP)
                                                                                                   – in addition to ILP
                                                                                                   – requires multiprogramming or
                                                                                                     parallel programming
                                                                                               • Opens new possibilities for
                                                                                                 architectural innovations
                  19                                                            Lasse Natvig                  20                                                      Lasse Natvig




 Why heterogeneous multicores?                                                                   CPU – GPU – convergence
• Specialized HW is                                                                              (Performance – Programmability)
                                                            Cell BE processor
  faster than general
  HW                                                                                                                                                 Processors: Larrabee,
                                                                                                                                                     Fermi, …
     – Math co-processor                                                                                                                             Languages: CUDA,
                                                                                                                                                     OpenCL, …
     – GPU, DSP, etc…
• Benefits of
  customization
     – Similar to ASIC vs. general
       purpose programmable
       HW
• Amdahl’s law
     – Parallel speedup limited by
       serial fraction
          •  1 super-core
                  21                                                            Lasse Natvig                  22                                                      Lasse Natvig




 Parallel processing – conflicting                                                              Multicore programming challenges
 goals                                                                                         • Instability, diversity, conflicting goals … what to do?
                                                   Performance                                 • What kind of parallel programming?
 The P6-model: Parallel Processing                                                                 – Homogeneous vs. heterogeneous
 challenges: Performance, Portability,                                                             – DSL vs. general languages
 Programmability and Power efficiency                                                              – Memory locality
                                                                Portability                    • What to teach?
                                                                                                   – Teaching should be founded on
                                                                                                     active research
                                          Programmability              Powerefficiency         • Two layers of programmers
                                                                                                       y       p g
                                                                                                  – The Landscape of Parallel Computing Research: A View from
• Examples;                                                                                         Berkeley [Asan+06]
   – Performance tuning may reduce portability                                                         • Krste Asanovic presentation at ACACES Summerschool 2007
         • Eg. Datastructures adapted to cache block size                                         – 1) Programmability layer (Productivity layer) (80 - 90%)
                                                                                                     • ”Joe the programmer”
    – New languages for higher programmability may reduce performance
      and increase power consumption                                                              – 2) Performance layer (Efficiency layer) (10 - 20%)
                                                                                               • Both layers involved in HPC
                                                                                               • Programmability an issue also at the performance-layer


                  23                                                            Lasse Natvig                  24                                                      Lasse Natvig




                                                                                                                                                                                     4
Parallel Computing Laboratory, U.C. Berkeley,
(Slide adapted from Dave Patterson )                                                                                          Classes of computers
             Easy to write correct programs that run efficiently on manycore                                                 • Servers
                                                                                                                                –   storage servers
                                                             Personal     Image        Hearing,                 Parallel        –   compute servers (supercomputers)
                                                                                                    Speech
                                                              Health     Retrieval      Music                   Browser
                                                                                                                                –   web servers
                                                                             Design Patterns/Motifs                             –   high availability
                                                                  Composition & Coordination Language (C&CL)                    –   scalability
                                                                                                                                –   throughput oriented (response time of less importance)
                                                 ormance




                                                                           C&CL Compiler/Interpreter
                                                                                                                             • Desktop (price 3000 NOK – 50 000 NOK)
                                                                                                                                – the largest market
                                                                                                                                         g
                            Diagnosing Power/Perfo




                                                                                  Parallel
                                                                                  P ll l
                                                                                 Libraries
                                                                                                    Parallel Frameworks         – price/performance focus
                                                                                                                                – latency oriented (response time)
                                                                                                                             • Embedded systems
                                                           Efficiency Languages              Sketching
                                                                                                                                – the fastest growing market (”everywhere”)
                                                                                             Autotuners                         – TDT 4258 Microcontroller system design
                                                            Legacy                                 Communication & Synch.       – ATMEL, Nordic Semic., ARM, EM, ++
                                                                            Schedulers
                                                             Code                                       Primitives
                                                                          Efficiency Language Compilers
                                                                                                OS Libraries & Services
                                                                     Legacy OS
                                                                                                      Hypervisor

                                                                Multicore/GPGPU                     RAMP Manycore
               25                                                                                          Lasse Natvig                      26                                                        Lasse Natvig
                                                                                                      25




                                                                                                                              Borgar  FXI Technologies
   Falanx (Mali)
   ARM                                                                                                                                                                         ”An idependent compute
                                                                                                                                                                                platform to gather the

   Norway                                                                                                                                                                       fragmented mobile space
                                                                                                                                                                                and thus help accelerate the
                                                                                                                                                                                prolifitation of content and
                                                                                                                                                                                applications eco- systems (I.e
                                                                                                                                                                                build an ARM based SoC, put it
                                                                                                                                                                                                           ,p
                                                                                                                                                                                in a memory card, connect it to
                                                                                                                                                                                the web- and voila, you got
                                                                                                                                                                                iPhone for the masses ).”


                                                                                                                             • http://www.fxitech.com/
                                                                                                                                – ”Headquartered in Trondheim
                                                                                                                                      • But also an office in Silicon Valley …”


               27                                                                                             Lasse Natvig                   28                                                        Lasse Natvig




   Trends                                                                                                                            Comp. Arch. is an
                                                                                                                                     Integrated Approach
  • For technology, costs, use
                                                                                                                             • What really matters is the functioning of the
  • Help predicting the future                                                                                                 complete system
  • Product development time                                                                                                     – hardware, runtime system, compiler, operating system, and
     – 2-3 years                                                                                                                   application
     –  design for the next technology                                                                                          – In networking, this is called the “End to End argument”
                                                                                                                             • Computer architecture is not just about
     – Why should an architecture live longer than a product?
                                                                                                                               transistors(not at all), individual instructions, or
                                                                                                                               particular implementations
                                                                                                                                 – E.g., Original RISC projects replaced complex instructions with a
                                                                                                                                   compiler + simple instructions




               29                                                                                             Lasse Natvig                   30                                                        Lasse Natvig




                                                                                                                                                                                                                      5
Computer Architecture is
Design and Analysis                                                                                TDT4260 Course Focus
                          Architecture is an iterative process:                           Understanding the design techniques, machine
                          • Searching the space of possible designs
           Design
                          • At all levels of computer systems                              structures, technology factors, evaluation
Analysis
                                                                                           methods that will determine the form of
                                                                                           computers in 21st Century
                                                                                                                     Technology           Parallelism
                                                                                                                                                                   Programming
Creativity
C   ti it                                                                                                                                                          Languages
                                                                                           Applications                                                                     Interface Design
                             Cost /                                                                                          Computer Architecture:                              (ISA)
                             Performance                                                                                     • Organization
                             Analysis                                                                                        • Hardware/Software Boundary
                                                                                                                                                                               Compilers

                                                         Good Ideas                                          Operating            Measurement &
                                                                                                             Systems                 Evaluation                              History
                                            Mediocre Ideas
                Bad Ideas
                    31                                                Lasse Natvig                             32                                                                     Lasse Natvig




                                                                                             Moore’s Law: 2X transistors /
Holistic approach                                                                            “year”
 e.g., to programmability




                          Parallel & concurrent programming


                         Operating System & system software


                         Multicore, interconnect, memory
                                                                                      •     “Cramming More Components onto Integrated Circuits”
                                                                                             –    Gordon Moore, Electronics, 1965
                                                                                      •     # of transistors / cost-effective integrated circuit double
                                                                                            every N months (12 ≤ N ≤ 24)


                    33                                                Lasse Natvig                             34                                                                     Lasse Natvig




 Tracking Technology
                                                                                      Latency Lags Bandwidth (last ~20 years)
 Performance Trends
                                                                                                 10000
  • 4 critical implementation technologies:                                          CPU high,                                            • Performance Milestones
                                                                                                                      Processor
       –     Disks,                                                                  Memory low                                           • Processor: ‘286, ‘386, ‘486, Pentium,
       –     Memory,                                                                 (“Memory                                               Pentium Pro, Pentium 4 (21x,2250x)
                                                                                     Wall”) 1000
       –     Network,                                                                                                                     • Ethernet: 10Mb, 100Mb, 1000Mb,
                                                                                                                             Network
       –     Processors                                                                                                                     10000 Mb/s (16x,1000x)
                                                                                          Relative    Memory
  • Compare for Bandwidth vs. Latency                                                       BW
                                                                                                  100
                                                                                                                         Disk             • Memory Module: 16bit plain DRAM,
                                                                                                                                            Page Mode DRAM 32b 64b SDRAM,
                                                                                                                                            P     M d DRAM, 32b, 64b, SDRAM
    improvements in performance over time                                                 Improve
                                                                                            ment                                            DDR SDRAM (4x,120x)
  • Bandwidth: number of events per unit time                                                                                             • Disk : 3600, 5400, 7200, 10000, 15000
       – E.g., M bits/second over network, M bytes / second from                                    10                                      RPM (8x, 143x)
         disk
                                                                                                                                              (Processor latency = typical # of pipeline-stages * time
  • Latency: elapsed time for a single event                                                                     (Latency improvement
                                                                                                             = Bandwidth improvement)
                                                                                                                                              pr. clock-cycle)

       – E.g., one-way network delay in microseconds,                                                1
         average disk access time in milliseconds                                                        1              10              100
                                                                                                         Relative Latency Improvement


                    35                                                Lasse Natvig                             36                                                                     Lasse Natvig




                                                                                                                                                                                                         6
COST and COTS                                                                     Speedup                     Superlinear speedup ?

• Cost                                                                         • General definition:
                                                                                                                             Performance (p processors)
   – to produce one unit                                                           Speedup (p processors) =                  Performance (1 processor)
   – include (development cost / # sold units)
   – benefit of large volume
• COTS                                                                         • For a fixed problem size (input data set),
   – commodity off the shelf
          dit ff th h lf                                                         performance = 1/time
                                                                                    – Speedup
                                                                                                                     Time (1 processor)
                                                                                      fixed problem (p processors) =
                                                                                                                      Time (p processors)


                                                                               • Note: use best sequential algorithm in the uni-processor
                                                                                 solution, not the parallel algorithm with p = 1


               37                                               Lasse Natvig                  38                                                Lasse Natvig




 Amdahl’s Law (1967) (fixed problem size)                                          Gustafson’s “law” (1987)
                                                                                   (scaled problem size, fixed execution time)
• “If a fraction s of a
  (uniprocessor)                                                               • Total execution time on
  computation is inherently                                                      parallel computer with n
  serial, the speedup is at                                                      processors is fixed
  most 1/s”                                                                         – serial fraction s’
• Total work in computation                                                         – parallel fraction p’
  – serial fraction s                                                               – s’ + p’ = 1 (100%)
  – parallel fraction p
    p                                                                          •   S (n) Time’(1)/Time’(n)
                                                                                   S’(n) = Time (1)/Time (n)
  – s + p = 1 (100%)
                                                                                   = (s’ + p’n)/(s’ + p’)
• S(n) = Time(1) / Time(n)                                                         = s’ + p’n = s’ + (1-s’)n
 = (s + p) / [s +(p/n)]                                                            = n +(1-n)s’
                                                                               • Reevaluating Amdahl's law,
 = 1 / [s + (1-s) / n]                                                           John L. Gustafson, CACM May
                                                                                 1988, pp 532-533. ”Not a new
  = n / [1 + (n - 1)s]                                                           law, but Amdahl’s law with
                                                                                 changed assumptions”
• ”pessimistic and famous”
               39                                               Lasse Natvig                  40                                                Lasse Natvig




 How the serial fraction limits speedup


• Amdahl’s law

• Work hard to
  reduce the
  serial part of
  the application
    – remember IO
    – think different
      (than traditionally                  = serial fraction
      or sequentially)

               41                                               Lasse Natvig




                                                                                                                                                               7
1




         TDT4260 Computer architecture
                 Mini-project

    PhD candidate Alexandru Ciprian Iordan
    Institutt for datateknikk og informasjonsvitenskap
2




    What is it…? How much…?
    • The mini-project is the exercise part of TDT4260
      course

    • This year the students will need to develop and
      evaluate a PREFETCHER

    • The mini-project accounts for 20 % of the final grade
      in TDT4260
          • 80 % for report
          • 20 % for oral presentation
3




    What will you work with…

    • Modified version of M5 (for development and
      evaluation)

    • Computing time on Kongull cluster (for
      benchmarking)

    • More at: http://dm-ark.idi.ntnu.no/
4




    M5
    • Initially developed by the University of Michigan

    • Enjoys a large community of users and developers

    • Flexible object-oriented architecture

    • Has support for 3 ISA: ALPHA, SPARC and MIPS
5




    Team work…

    • You need to work in groups of 2-4 students



    • Grade is based on written paper AND oral
      presentation (chose you best speaker)
6




    Time Schedule and Deadlines




              More on It’s learning
7




    Web page presentation
Contents
                                                     • Instruction level parallelism                Chap 2
                                                     • Pipelining (repetition)                      App A
TDT 4260                                               ▫ Basic 5-step pipeline
                                                     • Dependencies and hazards                     Chap 2.1
App A.1, Chap 2
                                                       ▫ Data, name, control, structural
Instruction Level Parallelism
                                                     • Compiler techniques for ILP                  Chap 2.2
                                                     • (Static prediction                           Chap 2.3)
                                                       ▫ Read this on your own
                                                     • Project introduction




                                                    Pipelining
Instruction level parallelism (ILP)                 (1/3)

• A program is sequence of instructions typically
  written to be executed one after the other
• Poor usage of CPU resources! (Why?)
• Better: Execute instructions in parallel
  ▫ 1: Pipeline
       Partial overlap of instruction execution
  ▫ 2: Multiple issue
       Total overlap of instruction execution
• Today: Pipelining




Pipelining (2/3)                                     Pipelining (3/3)
• Multiple different stages executed in parallel     • Good Utilization: All stages are ALWAYS in use
  ▫ Laundry in 4 different stages                      ▫ Washing, drying, folding, ...
  ▫ Wash / Dry / Fold / Store                          ▫ Great usage of resources!
• Assumptions:                                       • Common technique, used everywhere
  ▫   Task can be split into stages                    ▫ Manufacturing, CPUs, etc
  ▫   Storage of temporary data                      • Ideal: time_stage = time_instruction / stages
                                                       ▫   But stages are not perfectly balanced
  ▫   Stages synchronized
                                                       ▫   But transfer between stages takes time
  ▫   Next operation known before last finished?
                                                       ▫   But pipeline may have to be emptied
                                                       ▫   ...
Example: MIPS64 (2/2)
Example: MIPS64 (1/2)                                                                      Time (clock cycles)

                                                                       Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5       Cycle 6 Cycle 7
•   RISC                       • Pipeline                     I




                                                                                              ALU
•   Load/store                   ▫ IF: Instruction fetch      n         Ifetch    Reg               DMem     Reg

                                                              s
•   Few instruction formats      ▫ ID: Instruction decode /   t
                                   register fetch             r.




                                                                                                       ALU
•   Fixed instruction length     ▫ EX: Execute / effective
                                                                                 Ifetch      Reg             DMem     Reg



•   64-bit                         address (EA)               O
                                                              r




                                                                                                               ALU
    ▫ DADD = 64 bits ADD         ▫ MEM: Memory access         d
                                                                                           Ifetch     Reg             DMem     Reg


    ▫ LD = 64 bits L(oad)        ▫ WB: Write back (reg)       e
                                                              r
• 32 registers (R0 = 0)




                                                                                                                        ALU
                                                                                                    Ifetch    Reg             DMem     Reg


• EA = offset(Register)




Big Picture:                                                  Big Picture (continued):
• What are some real world examples of                         • Computer Architecture is the study of design
  pipelining?                                                    tradeoffs!!!!
• Why do we pipeline?                                          • There is no “philosophy of architecture” and no
• Does pipelining increase or decrease instruction               “perfect architecture”. This is engineering, not
  throughput?                                                    science.
• Does pipelining increase or decrease instruction             • What are the costs of pipelining?
  latency?                                                     • For what types of devices is pipelining not a
                                                                 good choice?




Improve speedup?                                              Dependencies and hazards
• Why not perfect speedup?                                     • Dependencies
    ▫ Sequential programs                                          ▫ Parallel instructions can be executed in parallel
                                                                   ▫ Dependent instructions are not parallel
    ▫ One instruction dependent on another                            I1: DADD            R1, R2, R3
    ▫ Not enough CPU resources                                        I2: DSUB            R4, R1, R5
• What can be done?                                                ▫ Property of the instructions
    ▫ Forwarding (HW)                                          • Hazards
                                                                   ▫ Situation where a dependency causes an instruction to
    ▫ Scheduling (SW / HW)                                           give a wrong result
    ▫ Prediction (SW / HW)                                         ▫ Property of the pipeline
• Both hardware (dynamic) and compiler (static)                    ▫ Not all dependencies give hazards
  can help                                                            Dependencies must be close enough in the instruction
                                                                      stream to cause a hazard
Dependencies                                                                                  Hazards
      • (True) data dependencies                                                                     • Data hazards
         ▫ One instruction reads what an earlier has written                                           ▫ Overlap will give different result from sequential
      • Name dependencies                                                                              ▫ RAW / WAW / WAR
         ▫ Two instructions use the same register / mem loc                                          • Control hazards
         ▫ But no flow of data between them                                                            ▫ Branches
         ▫ Two types: Anti and output dependencies                                                     ▫ Ex: Started executing the wrong instruction
      • Control dependencies                                                                         • Structural hazards
         ▫ Instructions dependent on the result of a branch                                            ▫ Pipeline does not support this combination of instr.
      • Again: Independent of pipeline implementation                                                  ▫ Ex: Register with one port, two stages want to read




       Data dependency                               Hazard?
       Figure A.6, Page A-16
                                                                                                        Data Hazards (1/3)
                                                                                                      • Read After Write (RAW)
I                                                                                                       InstrJ tries to read operand before InstrI writes
                                               ALU




                                                              Reg
        add r1,r2,r3      Ifetch    Reg              DMem

n                                                                                                       it
s
                                                        ALU




t       sub r4,r1,r3               Ifetch    Reg              DMem     Reg                                         I: add r1,r2,r3
r.                                                                                                                 J: sub r4,r1,r3
                                                                ALU




                                            Ifetch    Reg              DMem    Reg
O       and r6,r1,r7
r                                                                                                     • Caused by a true data dependency
d
                                                                                                      • This hazard results from an actual need for
                                                                         ALU




                                                     Ifetch    Reg             DMem    Reg
e       or     r8,r1,r9
r                                                                                                       communication.
                                                                                 ALU




                                                              Ifetch    Reg            DMem   Reg
        xor r10,r1,r11




      Data Hazards (2/3)                                                                            Data Hazards (3/3)
                                                                                                    • Write After Write (WAW)
     • Write After Read (WAR)                                                                         InstrJ writes operand before InstrI writes it.
       InstrJ writes operand before InstrI reads it
                                                                                                                   I: sub r1,r4,r3
                      I: sub r4,r1,r3                                                                              J: add r1,r2,r3
                      J: add r1,r2,r3

                                                                                                    • Caused by an output dependency
     • Caused by an anti dependency
       This results from reuse of the name “r1”                                                     • Can’t happen in MIPS 5 stage pipeline because:
                                                                                                      ▫ All instructions take 5 stages, and
     • Can’t happen in MIPS 5 stage pipeline because:                                                 ▫ Writes are always in stage 5
       ▫ All instructions take 5 stages, and                                                        • WAR and WAW can occur in more
       ▫ Reads are always in stage 2, and
       ▫ Writes are always in stage 5
                                                                                                      complicated pipes
Forwarding                                                                                                                               Can all data hazards be solved via
      Figure A.7, Page A-18                                                                                                                    forwarding???
                                 IF ID/RF EX                            MEM          WB                                                                                    IF ID/RF EX                            MEM          WB

I                                                                                                                                        I

                                                                  ALU




                                                                                                                                                                                                           ALU
                                                                                      Reg                                                                                                                                        Reg
       add r1,r2,r3             Ifetch           Reg                     DMem
                                                                                                                                                Ld     r1,r2              Ifetch           Reg                     DMem

n                                                                                                                                        n
s                                                                                                                                        s

                                                                               ALU




                                                                                                                                                                                                                         ALU
t      sub r4,r1,r3                             Ifetch          Reg                  DMem     Reg
                                                                                                                                         t      add r4,r1,r3                              Ifetch          Reg                   DMem     Reg

r.                                                                                                                                       r.

                                                                                       ALU




                                                                                                                                                                                                                                  ALU
                                                            Ifetch         Reg                DMem          Reg                                                                                      Ifetch          Reg                 DMem          Reg
O      and r6,r1,r7                                                                                                                      O      and r6,r1,r7
r                                                                                                                                        r
d                                                                                                                                        d
                                                                                                ALU




                                                                                                                                                                                                                                           ALU
                                                                         Ifetch       Reg                   DMem          Reg                                                                                      Ifetch        Reg                   DMem         Reg
e      or      r8,r1,r9                                                                                                                  e      or      r8,r1,r9
r                                                                                                                                        r
                                                                                                              ALU




                                                                                                                                                                                                                                                         ALU
                                                                                     Ifetch    Reg                        DMem     Reg                                                                                          Ifetch    Reg                   DMem        Reg
       xor r10,r1,r11                                                                                                                           xor r10,r1,r11




     Structural Hazards (Memory Port)                                                                                                         Hazards, Bubbles                                 (Similar to Figure A.5, Page A-15)
     Figure A.4, Page A-14
                                                                                                                                                  Time (clock cycles)
          Time (clock cycles)
                                                                                                                                                       Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5                                 Cycle 6 Cycle 7
              Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5                                Cycle 6 Cycle 7

                                                                                                                                                                                    ALU
                                                                                                                                          I Load        Ifetch    Reg                              DMem            Reg
                                          ALU




 I Load Ifetch          Reg                              DMem            Reg
                                                                                                                                          n
 n                                                                                                                                        s
                                                                                                                                                                                                     ALU
                                                                                                                                                                                                                                 Reg
 s                                                                                                                                        t
                                                                                                                                             Instr 1             Ifetch            Reg                            DMem
                                                            ALU




                                                                                      Reg
 t
    Instr 1            Ifetch            Reg                            DMem

                                                                                                                                          r.
 r.
                                                                                                                                                                                                                     ALU
                                                                                                                                                                              Ifetch                Reg                         DMem        Reg
                                                                                                                                                Ld r1, r2
                                                                           ALU




                                    Ifetch                Reg                         DMem       Reg
       Instr 2                                                                                                                            O
 O                                                                                                                                        r
 r                                                                                                                                              Stall                                        Bubble              Bubble        Bubble    Bubble           Bubble
                                                                                                                                          d
                                                                                        ALU




                                                         Ifetch          Reg                    DMem                Reg
 d     Instr 3
                                                                                                                                          e
 e
                                                                                                                                                                                                                                                 ALU
                                                                                                                                          r     Add r1, r1, r1                                                    Ifetch          Reg                        DMem         Reg
                                                                                                      ALU




 r      Instr 4                                                         Ifetch         Reg                        DMem           Reg




                                                                                                                                               How do you “bubble” the pipe?                               How can we avoid this hazard?




     Control hazards (1/2)
                                                                                                                                              Control hazards (2/2)
      • Sequential execution is predictable,                                                                                                  • What can be done?
        (conditional) branches are not
                                                                                                                                               ▫ Always stop (previous slide)
      • May have fetched instructions that should not be
        executed                                                                                                                                     Also called freeze or flushing of the pipeline
      • Simple solution (figure): Stall the pipeline (bubble)                                                                                  ▫ Assume no branch (=assume sequential)
        ▫ Performance loss depends on number of branches in the program                                                                              Must not change state before branch instr. is complete
          and pipeline implementation
        ▫ Branch penaltyC                                                                                                                      ▫ Assume branch
                                                                                                                                                     Only smart if the target address is ready early
                                                                                                                                               ▫ Delayed branch
                                                                                                                                                     Execute a different instruction while branch is evaluated
                                                                                                                                                 Static techniques (fixed rule or compiler)
             Possibly wrong instruction                                  Correct instruction
Example
  • Assume branch conditionals are evaluated in the EX
                                                                      Dynamic scheduling
    stage, and determine the fetch address for the following
    cycle.                                                             • So far: Static scheduling
  • If we always stall, how many cycles are bubbled?                        ▫ Instructions executed in program order
  • Assume branch not taken, how many bubbles for an                        ▫ Any reordering is done by the compiler
    incorrect assumption?
  • Is stalling on every branch ok?                                    • Dynamic scheduling
  • What optimizations could be done to improve stall                       ▫ CPU reorders to get a more optimal order
    penalty?                                                                   Fewer hazards, fewer stalls, ...
                                                                            ▫ Must preserve order of operations where
                                                                              reordering could change the result
                                                                            ▫ Covered by TDT 4255 Hardware design




                                                                      Example
 Compiler techniques for ILP                                          Source code:                       Notice:
                                                                      for (i = 1000; i >0; i=i-1)        • Lots of dependencies
  • For a given pipeline and superscalarity                                                              • No dependencies between iterations
                                                                        x[i] = x[i] + s;
    ▫ How can these be best utilized?                                                                    • High loop overhead
    ▫ As few stalls from hazards as possible                                                                Loop unrolling

  • Dynamic scheduling                                               MIPS:
    ▫ Tomasulo’s algorithm etc. (TDT4255)                            Loop: L.D              F0,0(R1)                ; F0 = x[i]
    ▫ Makes the CPU much more complicated                                  ADD.D            F4,F0,F2                ; F2 = s
  • What can be done by the compiler?                                      S.D              F4,0(R1)                ; Store x[i] + s
    ▫ Has ”ages” to spend, but less knowledge                              DADDUI           R1,R1,#-8               ; x[i] is 8 bytes
    ▫ Static scheduling, but what else?                                    BNE              R1,R2,Loop              ; R1 = R2?




                                                                                                        Loop:     L.D           F0,0(R1)

 Static scheduling                                                    Loop unrolling                              ADD.D         F4,F0,F2
                                                                                                                  S.D           F4,0(R1)
Loop: L.D          F0,0(R1)     Loop: L.D              F0,0(R1)     Loop:   L.D           F0,0(R1)                L.D           F6,-8(R1)
      stopp                           DADDUI           R1,R1,#-8            ADD.D         F4,F0,F2                ADD.D         F8,F6,F2
      ADD.D        F4,F0,F2           ADD.D            F4,F0,F2             S.D           F4,0(R1)                S.D           F8,-8(R1)
      stopp                           stopp                                 DADDUI        R1,R1,#-8               L.D           F10,-16(R1)
      stopp                           stopp                                 BNE           R1,R2,Loop              ADD.D         F12,F10,F2
      S.D          F4,0(R1)           S.D              F4,8(R1)                                                   S.D           F12,-16(R1)
      DADDUI       R1,R1,#-8          BNE              R1,R2,Loop
                                                                                                                  L.D           F14,-24(R1)
      stopp                                                         • Reduced loop overhead
                                                                                                                  ADD.D         F16,F14,F2
      BNE          R1,R2,Loop                                       • Requires number of iterations               S.D           F16,-24(R1)
                                                                      divisible by n (here n=4)
                                                                                                                  DADDUI        R1,R1,#-32
                                                                    • Register renaming                           BNE           R1,R2,Loop
                                                                    • Offsets have changed
            Result: From 9 cycles per iteration to 7                • Stalls not shown
               (Delays from table in figure 2.2)
Loop:   L.D           F0,0(R1)      Loop:   L.D         F0,0(R1)
        ADD.D         F4,F0,F2              L.D         F6,-8(R1)
        S.D           F4,0(R1)              L.D         F10,-16(R1)   Loop unrolling: Summary
        L.D           F6,-8(R1)             L.D         F14,-24(R1)
        ADD.D         F8,F6,F2              ADD.D       F4,F0,F2      • Original code         9 cycles per element
        S.D           F8,-8(R1)             ADD.D       F8,F6,F2      • Scheduling            7 cycles per element
        L.D           F10,-16(R1)           ADD.D       F12,F10,F2    • Loop unrolling        6,75 cycles per element
        ADD.D         F12,F10,F2            ADD.D       F16,F14,F2
                                                                       ▫ Unrolled 4 iterations
        S.D           F12,-16(R1)           S.D         F4,0(R1)
        L.D           F14,-24(R1)           S.D         F8,-8(R1)     • Combination           3,5 cycles per element
        ADD.D         F16,F14,F2            DADDUI      R1,R1,#-32     ▫ Avoids stalls entirely
        S.D           F16,-24(R1)           S.D         F12,-16(R1)
        DADDUI        R1,R1,#-32            S.D         F16,-24(R1)
                                            BNE         R1,R2,Loop
                                                                      Compiler reduced execution time by 61%
        BNE           R1,R2,Loop



   Avoids stall after: L.D(1), ADD.D(2), DADDUI(1)




   Loop unrolling in practice
    • Do not usually know upper bound of loop
    • Suppose it is n, and we would like to unroll the loop
      to make k copies of the body
    • Instead of a single unrolled loop, we generate a pair
      of consecutive loops:
        ▫ 1st executes (n mod k) times and has a body that is the
          original loop
        ▫ 2nd is the unrolled body surrounded by an outer loop
          that iterates (n/k) times
    • For large values of n, most of the execution time will
      be spent in the unrolled loop
Review
                                                             • Name real-world examples of pipelining
                                                             • Does pipelining lower instruction latency?
                                                             • What is the advantage of pipelining?
                                                             • What are some disadvantages of pipelining?
 TDT 4260                                                    • What can a compiler do to avoid processor
 Chap 2, Chap 3                                                stalls?
 Instruction Level Parallelism (cont)                        • What are the three types of data dependences?
                                                             • What are the three types of pipeline hazards?




                                                            Getting CPI below 1
 Contents                                                   • CPI ≥ 1 if issue only 1 instruction every clock cycle
                                                            • Multiple-issue processors come in 3 flavors:
 • Very Large Instruction Word           Chap 2.7
                                                            1. Statically-scheduled superscalar processors
   ▫ IA-64 and EPIC                                              • In-order execution
 • Instruction fetching                  Chap 2.9                • Varying number of instructions issued (compiler)
                                                            2. Dynamically-scheduled superscalar processors
 • Limits to ILP                         Chap 3.1/2              • Out-of-order execution
 • Multi-threading                       Chap 3.5                • Varying number of instructions issued (CPU)
                                                            3. VLIW (very long instruction word) processors
                                                                 • In-order execution
                                                                 • Fixed number of instructions issued




                                                            VLIW: Very Large Instruction Word (2/2)
VLIW: Very Large Instruction Word (1/2)
                                                             • Assume 2 load/store, 2 fp, 1 int/branch
                                                                 ▫ VLIW with 0-5 operations.
• Each VLIW has explicit coding for multiple
                                                                 ▫ Why 0?
  operations
 ▫ Several instructions combined into packets                • Important to avoid empty instruction slots
 ▫ Possibly with parallelism indicated                           ▫ Loop unrolling
                                                                 ▫ Local scheduling
• Tradeoff instruction space for simple decoding
                                                                 ▫ Global scheduling
 ▫ Room for many operations
                                                                     Scheduling across branches
 ▫ Independent operations => execute in parallel
 ▫ E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1    • Difficult to find all dependencies in advance
   branch                                                        ▫ Solution1: Block on memory accesses
                                                                 ▫ Solution2: CPU detects some dependencies
Loop:    L.D             F0,0(R1)
Recall:                                      L.D             F6,-8(R1)
                                                                                Loop Unrolling in VLIW
Unrolled Loop                                L.D
                                             L.D
                                                             F10,-16(R1)
                                                             F14,-24(R1)
                                                                           Memory         Memory          FP                 FP         Int. op/   Clock
                                                                           reference 1    reference 2     operation 1        op. 2      branch
that minimizes                               ADD.D           F4,F0,F2      L.D F0,0(R1)    L.D F6,-8(R1)                                             1
                                             ADD.D           F8,F6,F2      L.D F10,-16(R1) L.D F14,-24(R1)                                           2
stalls for Scalar                            ADD.D           F12,F10,F2    L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2   ADD.D F8,F6,F2           3
                                                                           L.D F26,-48(R1)                 ADD.D F12,F10,F2 ADD.D F16,F14,F2         4
                                             ADD.D           F16,F14,F2
Source code:                                                                                               ADD.D F20,F18,F2 ADD.D F24,F22,F2         5
                                             S.D             F4,0(R1)      S.D 0(R1),F4    S.D -8(R1),F8   ADD.D F28,F26,F2                          6
for (i = 1000; i >0; i=i-1)                                                S.D -16(R1),F12 S.D -24(R1),F16                                           7
                                             S.D             F8,-8(R1)
  x[i] = x[i] + s;                                                         S.D -32(R1),F20 S.D -40(R1),F24                          DSUBUI R1,R1,#48 8
                                             DADDUI          R1,R1,#-32
                                                                           S.D -0(R1),F28                                           BNEZ R1,LOOP     9
                                             S.D             F12,-16(R1)
Register mapping:                                                           Unrolled 7 iterations to avoid delays
                                             S.D             F16,-24(R1)
                                                                            7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
s F2                                         BNE             R1,R2,Loop
                                                                            Average: 2.5 ops per clock, 50% efficiency
i R1                                                                        Note: Need more registers in VLIW (15 vs. 6 in SS)




 Problems with 1st Generation VLIW                                         VLIW Tradeoffs
• Increase in code size                                                    • Advantages
  ▫ Loop unrolling                                                           ▫ “Simpler” hardware because the HW does not have to
  ▫ Partially empty VLIW                                                       identify independent instructions.
• Operated in lock-step; no hazard detection HW                            • Disadvantages
  ▫ A stall in any functional unit pipeline causes entire processor to       ▫ Relies on smart compiler
    stall, since all functional units must be kept synchronized
                                                                             ▫ Code incompatibility between generations
  ▫ Compiler might predict function units, but caches hard to predict
                                                                             ▫ There are limits to what the compiler can do (can’t move
  ▫ Moder VLIWs are “interlocked” (identify dependences between
                                                                               loads above branches, can’t move loads above stores)
    bundles and stall).
• Binary code compatibility
                                                                           • Common uses
  ▫ Strict VLIW => different numbers of functional units and unit            ▫ Embedded market where hardware simplicity is
    latencies require different versions of the code                           important, applications exhibit plenty of ILP, and binary
                                                                               compatibility is a non-issue.




 IA-64 and EPIC                                                            Instruction bundle (VLIW)
 • 64 bit instruction set architecture
    ▫ Not a CPU, but an architecture
    ▫ Itanium and Itanium 2 are CPUs
      based on IA-64
 • Made by Intel and Hewlett-Packard (itanium 2 and 3
   designed in Colorado)
 • Uses EPIC: Explicitly Parallel Instruction Computing
 • Departure from the x86 architecture
 • Meant to achieve out-of-order performance with in-
   order HW + compiler-smarts
    ▫ Stop bits to help with code density
    ▫ Support for control speculation (moving loads above
      branches)
    ▫ Support for data speculation (moving loads above stores)
                   Details in Appendix G.6
Functional units and template
 • Functional units:
                                                               Code example (1/2)
   ▫ I (Integer), M (Integer + Memory), F (FP), B (Branch),
     L + X (64 bit operands + special inst.)
 • Template field:
   ▫ Maps instruction to functional unit
   ▫ Indicates stops: Limitations to ILP




                                                               Control Speculation
Code example 2/2                                                • Can the compiler schedule an independent load
                                                                  above a branch?
                                                                         Bne R1, R2, TARGET
                                                                         Ld R3, R4(0)
                                                                • What are the problems?
                                                                • EPIC provides speculative loads
                                                                         Ld.s R3, R4(0)
                                                                         Bne R1, R2, TARGET
                                                                         Check R4(0)




Data Speculation                                               EPIC Conclusions
                                                                • Goal of EPIC was to maintain advantages of VLIW, but
 • Can the compiler schedule an independent load                  achieve performance of out-of-order.
   above a store?                                               • Results:
          St R5, R6(0)                                            ▫ Complicated bundling rules saves some space, but
          Ld R3, R4(0)                                              makes the hardware more complicated
 • What are the problems?                                         ▫ Add special hardware and instructions for scheduling
 • EPIC provides “advanced loads” and an ALAT                       loads above stores and branches (new complicated
   (Advanced Load Address Table)                                    hardware)
          Ld.a R3, R4(0)    creates entry in ALAT                 ▫ Add special hardware to remove branch penalties
          St R5, R6(0)      looks up ALAT, if match, jump to        (predication)
                             fixup code                           ▫ End result is a machine as complicated as an out-of-
                                                                    order, but now also requiring a super-sophisticated
                                                                    compiler.
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260

More Related Content

Viewers also liked

SDL 2000 Tutorial
SDL 2000 TutorialSDL 2000 Tutorial
SDL 2000 Tutorialjonecx
 
Latex tutorial
Latex tutorialLatex tutorial
Latex tutorialjonecx
 
Tdt4242
Tdt4242Tdt4242
Tdt4242jonecx
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detectionjonecx
 
MODAL AND RESPONSE SPECTRUM (IS 18932002) ANALYSIS 0F R.C FRAME BUILDING (IT ...
MODAL AND RESPONSE SPECTRUM (IS 18932002) ANALYSIS 0F R.C FRAME BUILDING (IT ...MODAL AND RESPONSE SPECTRUM (IS 18932002) ANALYSIS 0F R.C FRAME BUILDING (IT ...
MODAL AND RESPONSE SPECTRUM (IS 18932002) ANALYSIS 0F R.C FRAME BUILDING (IT ...Mintu Choudhury
 
Beam Structures including sap2000
Beam Structures  including sap2000Beam Structures  including sap2000
Beam Structures including sap2000Wolfgang Schueller
 

Viewers also liked (8)

SDL 2000 Tutorial
SDL 2000 TutorialSDL 2000 Tutorial
SDL 2000 Tutorial
 
Latex tutorial
Latex tutorialLatex tutorial
Latex tutorial
 
Tdt4242
Tdt4242Tdt4242
Tdt4242
 
BPMN
BPMNBPMN
BPMN
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detection
 
Sap example
Sap exampleSap example
Sap example
 
MODAL AND RESPONSE SPECTRUM (IS 18932002) ANALYSIS 0F R.C FRAME BUILDING (IT ...
MODAL AND RESPONSE SPECTRUM (IS 18932002) ANALYSIS 0F R.C FRAME BUILDING (IT ...MODAL AND RESPONSE SPECTRUM (IS 18932002) ANALYSIS 0F R.C FRAME BUILDING (IT ...
MODAL AND RESPONSE SPECTRUM (IS 18932002) ANALYSIS 0F R.C FRAME BUILDING (IT ...
 
Beam Structures including sap2000
Beam Structures  including sap2000Beam Structures  including sap2000
Beam Structures including sap2000
 

Similar to tdt4260

Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesSimon Lia-Jonassen
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...MLconf
 
Data Normalization and Alignment in Heterogeneous Data Sets
Data Normalization and Alignment in Heterogeneous Data SetsData Normalization and Alignment in Heterogeneous Data Sets
Data Normalization and Alignment in Heterogeneous Data SetsDataCards
 
The Multi-Dataflow Composer Tool: a Runtime Reconfigurable HDL Platform Composer
The Multi-Dataflow Composer Tool: a Runtime Reconfigurable HDL Platform ComposerThe Multi-Dataflow Composer Tool: a Runtime Reconfigurable HDL Platform Composer
The Multi-Dataflow Composer Tool: a Runtime Reconfigurable HDL Platform ComposerMDC_UNICA
 
A Teamwork-based Approach to Programming Fundamentals with Scheme, Smalltalk ...
A Teamwork-based Approach to Programming Fundamentals with Scheme, Smalltalk ...A Teamwork-based Approach to Programming Fundamentals with Scheme, Smalltalk ...
A Teamwork-based Approach to Programming Fundamentals with Scheme, Smalltalk ...Michele Lanza
 
12109 microprocessor & programming
12109 microprocessor & programming12109 microprocessor & programming
12109 microprocessor & programmingGaurang Thakar
 
microprocessor & programming
 microprocessor & programming microprocessor & programming
microprocessor & programmingGaurang Thakar
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Scheme of Work for the Web Design for Beginners (using Adobe Dreamweaver CS6)...
Scheme of Work for the Web Design for Beginners (using Adobe Dreamweaver CS6)...Scheme of Work for the Web Design for Beginners (using Adobe Dreamweaver CS6)...
Scheme of Work for the Web Design for Beginners (using Adobe Dreamweaver CS6)...Vishal Raja
 
BSC-Microsoft Research Center Overview
BSC-Microsoft Research Center OverviewBSC-Microsoft Research Center Overview
BSC-Microsoft Research Center OverviewMICProductivity
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchNatasha Latysheva
 
Lecture 1
Lecture 1Lecture 1
Lecture 1Mr SMAK
 
Requirementv4
Requirementv4Requirementv4
Requirementv4stat
 

Similar to tdt4260 (20)

Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
Sylvain Bellemare Resume
Sylvain Bellemare ResumeSylvain Bellemare Resume
Sylvain Bellemare Resume
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela BarreiroTowards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Data Normalization and Alignment in Heterogeneous Data Sets
Data Normalization and Alignment in Heterogeneous Data SetsData Normalization and Alignment in Heterogeneous Data Sets
Data Normalization and Alignment in Heterogeneous Data Sets
 
The Multi-Dataflow Composer Tool: a Runtime Reconfigurable HDL Platform Composer
The Multi-Dataflow Composer Tool: a Runtime Reconfigurable HDL Platform ComposerThe Multi-Dataflow Composer Tool: a Runtime Reconfigurable HDL Platform Composer
The Multi-Dataflow Composer Tool: a Runtime Reconfigurable HDL Platform Composer
 
A Teamwork-based Approach to Programming Fundamentals with Scheme, Smalltalk ...
A Teamwork-based Approach to Programming Fundamentals with Scheme, Smalltalk ...A Teamwork-based Approach to Programming Fundamentals with Scheme, Smalltalk ...
A Teamwork-based Approach to Programming Fundamentals with Scheme, Smalltalk ...
 
12109 microprocessor & programming
12109 microprocessor & programming12109 microprocessor & programming
12109 microprocessor & programming
 
microprocessor & programming
 microprocessor & programming microprocessor & programming
microprocessor & programming
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Scheme of Work for the Web Design for Beginners (using Adobe Dreamweaver CS6)...
Scheme of Work for the Web Design for Beginners (using Adobe Dreamweaver CS6)...Scheme of Work for the Web Design for Beginners (using Adobe Dreamweaver CS6)...
Scheme of Work for the Web Design for Beginners (using Adobe Dreamweaver CS6)...
 
BSC-Microsoft Research Center Overview
BSC-Microsoft Research Center OverviewBSC-Microsoft Research Center Overview
BSC-Microsoft Research Center Overview
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Requirementv4
Requirementv4Requirementv4
Requirementv4
 

Recently uploaded

_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 

Recently uploaded (20)

_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 

tdt4260

  • 1. TDT 4260 – lecture 1 – 2011 Course goal • Course introduction • To get a general and deep understanding of the – course goals organization of modern computers and the – staff motivation for different computer architectures. Give – contents a base for understanding of research themes within – evaluation the field. – web, ITSL • High level • Textbook • Mostly HW and low-level SW – Computer Architecture, A Quantitative Approach, Fourth • HW/SW interplay Edition • Parallelism • by John Hennessy & David Patterson (HP90 - 96 – 03) - 06 • Principles, not details • Today: Introduction (Chapter 1) – Partly covered  inspire to learn more 1 Lasse Natvig 2 Lasse Natvig Contents TDT-4260 / DT8803 • Recommended background • Computer architecture fundamentals, trends, measuring – Course TDT4160 Computer Fundamentals, or performance, quantitative principles. Instruction set equivalent. architectures and the role of compilers. Instruction-level • http://www.idi.ntnu.no/emner/tdt4260/ parallelism, thread-level parallelism, VLIW. – And Its Learning • Memory hierarchy design, cache. Multiprocessors, shared • Friday 1215-1400 memory architectures, vector processors, NTNU/Notur – And/or some Thursdays 1015-1200 supercomputers, supercomputers distributed shared memory memory, – 12 lectures planned synchronization, multithreading. – some exceptions may occur • Interconnection networks, topologies • Evaluation • Multicores,homogeneous and heterogeneous, principles and – Obligatory exercise (counts 20%). Written product examples exam counts 80%. Final grade (A to F) given at end of semester. If there is a re-sit • Green computing (introduction) examination, the examination form may • Miniproject - prefetching change from written to oral. 3 Lasse Natvig 4 Lasse Natvig Lecture plan Subject to change EMECS, new European Master's Date  and lecturer  Topic Course in Embedded Computing Systems 1:  14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge 2:  21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 2 3: 28 Jan (IB) ILP, Chapter 2; TLP, Chapter 3 4: 4 Feb (LN) Multiprocessors, Chapter 4  5: 11 Feb MG(?)) Prefetching + Energy Micro guest lecture 6: 18 Feb (LN) Multiprocessors continued  7: 25 Feb (IB) Piranha CMP + Interconnection networks  8: 4 Mar (IB) Memory and cache, cache coherence  (Chap. 5) 9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl  multicore ... Fedorova ... assymetric multicore ... 10: 18 Mar (IB) Memory consistency (4.6) + more on memory 11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers   (2) Green  computing 12: 1 Apr (IB/LN) Wrap up lecture, remaining stuff 13: 8 Apr  Slack – no lecture planned  5 Lasse Natvig 6 Lasse Natvig 1
  • 2. Preliminary reading list, subject to change!!! People involved • Chap.1: Fundamentals, sections 1.1 - 1.12 (pages 2-54) • Chap.2: ILP, sections 2.1 - 2.2 and parts of 2.3 (pages 66-81), section 2.7 (pages 114-118), parts of section 2.9 (pages 121-127, stop at speculation), Lasse Natvig section 2.11 - 2.12 (pages 138 - 141). (Sections 2.4 - 2.6 are covered by similar material in our computer design course) Course responsible, lecturer • Chap.3: Limits on ILP, section 3.1 and parts of section 3.2 (pages 154 -159), lasse@idi.ntnu.no section 3.5 - 3.8 (pages 172-185). • Chap.4: Multiprocessors and TLP, sections 4.1 - 4.5, 4.8 - 4.10 • Chap.5: Memory hierachy, section 5.1 - 5.3 (pages 288 - 315). Ian Bratt • App A: section A 1 (Expected to be repetition from other courses) A.1 Lecturer (Also t Til (Al at Tilera.com) ) • Appendix E, interconnection networks, pages E2-E14, E20-E25, E29-E37 ianbra@idi.ntnu.no and E45-E51. • App. F: Vector processors, sections F1 - F4 and F8 (pages F-2 - F-32, F- 44 - F-45) Alexandru Iordan • Data prefetch mechanisms (ACM Computing Survey) Teaching assistant (Also PhD-student) • Piranha, (To be announced) iordan@idi.ntnu.no • Multicores (New bookchapter) (To be announced) • (App. D; embedded systems?)  see our new course TDT4258 Mikrokontroller systemdesign http://www.idi.ntnu.no/people/ 7 Lasse Natvig 8 Lasse Natvig research.idi.ntnu.no/multicore Prefetching ---pfjudge Some few highlights: - Green computing, 2xPhD + master students - Multicore memory systems, 3 x PhD theses - Multicore programming and parallel computing - Cooperation with industry 9 Lasse Natvig 10 Lasse Natvig ”Computational computer architecture” Experiment Infrastructure • Computational science and engineering (CSE) • Stallo compute cluster – Computational X, X = comp.arch. – 60 Teraflop/s peak • Simulates new multicore architectures – 5632 processing cores – Last level, shared cache fairness (PhD-student M. Jahre) – 12 TB total memory – Bandwidth aware prefetching (PhD-student M. Grannæs) – 128 TB centralized disk • Complex cycle-accurate simulators – Weighs 16 tons – 80 000 lines C++ 20 000 lines python C++, – Open source, Linux-based • Multi-core research • Design space exploration (DSE) – About 60 CPU years allocated per – one dimension for each arch. parameter year to our projects – DSE sample point = specific multicore configuration – Typical research paper uses 5 to – performance of a selected set of configurations evaluated by 12 CPU years for simulation simulating the execution of a set of workloads (extensive, detailed design space exploration) 11 Lasse Natvig 12 Lasse Natvig 2
  • 3. The End of Moore’s law Motivational background for single-core microprocessors • Why multicores – in all market segments from mobile phones to supercomputers • The ”end” of Moores law • The power wall • The memory wall • The bandwith problem • ILP limitations • The complexity wall But Moore’s law still holds for FPGA, memory and multicore processors 13 Lasse Natvig 14 Lasse Natvig Energy & Heat Problems The Memory Wall 1000 • Large power “Moore’s Law” consumption 100 CPU 60%/year – Costly Performance P-M gap grows 50% / year – Heat problems 10 – Restricted battery DRAM operation time 9%/year 9%/ 1 • Google ”Open House Trondheim 1980 1990 2000 2006” • The Processor Memory Gap – ”Performance/Watt is the only flat • Consequence: deeper memory hierachies trend line” – P – Registers – L1 cache – L2 cache – L3 cache – Memory - - - – Complicates understanding of performance • cache usage has an increasing influence on performance 15 Lasse Natvig 16 Lasse Natvig The I/O pin or Bandwidth problem The limitations of ILP (Instruction Level Parallelism) • # I/O signaling pins in Applications – limited by physical tecnology 30 3   – speeds have not 2.5  25 increased at the same Fraction of total cycles (%) rate as processor clock 20 2 rates  dup Speed 1.5 • Projections 15 – from ITRS (International 10 1  Technology Roadmap for Semiconductors) 5 0.5 0 0 [Huh, Burger and Keckler 2001] 0 1 2 3 4 5 6+ 0 5 10 15 Number of instructions issued Instructions issued per cycle 17 Lasse Natvig 18 Lasse Natvig 3
  • 4. Reduced Increase in Clock Frequency Solution: Multicore architectures (also called Chip Multi-processors - CMP) • More power-efficient – Two cores with clock frequency f/2 can potentially achieve the same speed as one at frequency f with 50% reduction in total energy consumption [Olukotun & Hammond 2005] • Exploits Thread Level Parallelism (TLP) – in addition to ILP – requires multiprogramming or parallel programming • Opens new possibilities for architectural innovations 19 Lasse Natvig 20 Lasse Natvig Why heterogeneous multicores? CPU – GPU – convergence • Specialized HW is (Performance – Programmability) Cell BE processor faster than general HW Processors: Larrabee, Fermi, … – Math co-processor Languages: CUDA, OpenCL, … – GPU, DSP, etc… • Benefits of customization – Similar to ASIC vs. general purpose programmable HW • Amdahl’s law – Parallel speedup limited by serial fraction •  1 super-core 21 Lasse Natvig 22 Lasse Natvig Parallel processing – conflicting Multicore programming challenges goals • Instability, diversity, conflicting goals … what to do? Performance • What kind of parallel programming? The P6-model: Parallel Processing – Homogeneous vs. heterogeneous challenges: Performance, Portability, – DSL vs. general languages Programmability and Power efficiency – Memory locality Portability • What to teach? – Teaching should be founded on active research Programmability Powerefficiency • Two layers of programmers y p g – The Landscape of Parallel Computing Research: A View from • Examples; Berkeley [Asan+06] – Performance tuning may reduce portability • Krste Asanovic presentation at ACACES Summerschool 2007 • Eg. Datastructures adapted to cache block size – 1) Programmability layer (Productivity layer) (80 - 90%) • ”Joe the programmer” – New languages for higher programmability may reduce performance and increase power consumption – 2) Performance layer (Efficiency layer) (10 - 20%) • Both layers involved in HPC • Programmability an issue also at the performance-layer 23 Lasse Natvig 24 Lasse Natvig 4
  • 5. Parallel Computing Laboratory, U.C. Berkeley, (Slide adapted from Dave Patterson ) Classes of computers Easy to write correct programs that run efficiently on manycore • Servers – storage servers Personal Image Hearing, Parallel – compute servers (supercomputers) Speech Health Retrieval Music Browser – web servers Design Patterns/Motifs – high availability Composition & Coordination Language (C&CL) – scalability – throughput oriented (response time of less importance) ormance C&CL Compiler/Interpreter • Desktop (price 3000 NOK – 50 000 NOK) – the largest market g Diagnosing Power/Perfo Parallel P ll l Libraries Parallel Frameworks – price/performance focus – latency oriented (response time) • Embedded systems Efficiency Languages Sketching – the fastest growing market (”everywhere”) Autotuners – TDT 4258 Microcontroller system design Legacy Communication & Synch. – ATMEL, Nordic Semic., ARM, EM, ++ Schedulers Code Primitives Efficiency Language Compilers OS Libraries & Services Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore 25 Lasse Natvig 26 Lasse Natvig 25 Borgar  FXI Technologies Falanx (Mali) ARM ”An idependent compute platform to gather the Norway fragmented mobile space and thus help accelerate the prolifitation of content and applications eco- systems (I.e build an ARM based SoC, put it ,p in a memory card, connect it to the web- and voila, you got iPhone for the masses ).” • http://www.fxitech.com/ – ”Headquartered in Trondheim • But also an office in Silicon Valley …” 27 Lasse Natvig 28 Lasse Natvig Trends Comp. Arch. is an Integrated Approach • For technology, costs, use • What really matters is the functioning of the • Help predicting the future complete system • Product development time – hardware, runtime system, compiler, operating system, and – 2-3 years application –  design for the next technology – In networking, this is called the “End to End argument” • Computer architecture is not just about – Why should an architecture live longer than a product? transistors(not at all), individual instructions, or particular implementations – E.g., Original RISC projects replaced complex instructions with a compiler + simple instructions 29 Lasse Natvig 30 Lasse Natvig 5
  • 6. Computer Architecture is Design and Analysis TDT4260 Course Focus Architecture is an iterative process: Understanding the design techniques, machine • Searching the space of possible designs Design • At all levels of computer systems structures, technology factors, evaluation Analysis methods that will determine the form of computers in 21st Century Technology Parallelism Programming Creativity C ti it Languages Applications Interface Design Cost / Computer Architecture: (ISA) Performance • Organization Analysis • Hardware/Software Boundary Compilers Good Ideas Operating Measurement & Systems Evaluation History Mediocre Ideas Bad Ideas 31 Lasse Natvig 32 Lasse Natvig Moore’s Law: 2X transistors / Holistic approach “year” e.g., to programmability Parallel & concurrent programming Operating System & system software Multicore, interconnect, memory • “Cramming More Components onto Integrated Circuits” – Gordon Moore, Electronics, 1965 • # of transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24) 33 Lasse Natvig 34 Lasse Natvig Tracking Technology Latency Lags Bandwidth (last ~20 years) Performance Trends 10000 • 4 critical implementation technologies: CPU high, • Performance Milestones Processor – Disks, Memory low • Processor: ‘286, ‘386, ‘486, Pentium, – Memory, (“Memory Pentium Pro, Pentium 4 (21x,2250x) Wall”) 1000 – Network, • Ethernet: 10Mb, 100Mb, 1000Mb, Network – Processors 10000 Mb/s (16x,1000x) Relative Memory • Compare for Bandwidth vs. Latency BW 100 Disk • Memory Module: 16bit plain DRAM, Page Mode DRAM 32b 64b SDRAM, P M d DRAM, 32b, 64b, SDRAM improvements in performance over time Improve ment DDR SDRAM (4x,120x) • Bandwidth: number of events per unit time • Disk : 3600, 5400, 7200, 10000, 15000 – E.g., M bits/second over network, M bytes / second from 10 RPM (8x, 143x) disk (Processor latency = typical # of pipeline-stages * time • Latency: elapsed time for a single event (Latency improvement = Bandwidth improvement) pr. clock-cycle) – E.g., one-way network delay in microseconds, 1 average disk access time in milliseconds 1 10 100 Relative Latency Improvement 35 Lasse Natvig 36 Lasse Natvig 6
  • 7. COST and COTS Speedup Superlinear speedup ? • Cost • General definition: Performance (p processors) – to produce one unit Speedup (p processors) = Performance (1 processor) – include (development cost / # sold units) – benefit of large volume • COTS • For a fixed problem size (input data set), – commodity off the shelf dit ff th h lf performance = 1/time – Speedup Time (1 processor) fixed problem (p processors) = Time (p processors) • Note: use best sequential algorithm in the uni-processor solution, not the parallel algorithm with p = 1 37 Lasse Natvig 38 Lasse Natvig Amdahl’s Law (1967) (fixed problem size) Gustafson’s “law” (1987) (scaled problem size, fixed execution time) • “If a fraction s of a (uniprocessor) • Total execution time on computation is inherently parallel computer with n serial, the speedup is at processors is fixed most 1/s” – serial fraction s’ • Total work in computation – parallel fraction p’ – serial fraction s – s’ + p’ = 1 (100%) – parallel fraction p p • S (n) Time’(1)/Time’(n) S’(n) = Time (1)/Time (n) – s + p = 1 (100%) = (s’ + p’n)/(s’ + p’) • S(n) = Time(1) / Time(n) = s’ + p’n = s’ + (1-s’)n = (s + p) / [s +(p/n)] = n +(1-n)s’ • Reevaluating Amdahl's law, = 1 / [s + (1-s) / n] John L. Gustafson, CACM May 1988, pp 532-533. ”Not a new = n / [1 + (n - 1)s] law, but Amdahl’s law with changed assumptions” • ”pessimistic and famous” 39 Lasse Natvig 40 Lasse Natvig How the serial fraction limits speedup • Amdahl’s law • Work hard to reduce the serial part of the application – remember IO – think different (than traditionally  = serial fraction or sequentially) 41 Lasse Natvig 7
  • 8. 1 TDT4260 Computer architecture Mini-project PhD candidate Alexandru Ciprian Iordan Institutt for datateknikk og informasjonsvitenskap
  • 9. 2 What is it…? How much…? • The mini-project is the exercise part of TDT4260 course • This year the students will need to develop and evaluate a PREFETCHER • The mini-project accounts for 20 % of the final grade in TDT4260 • 80 % for report • 20 % for oral presentation
  • 10. 3 What will you work with… • Modified version of M5 (for development and evaluation) • Computing time on Kongull cluster (for benchmarking) • More at: http://dm-ark.idi.ntnu.no/
  • 11. 4 M5 • Initially developed by the University of Michigan • Enjoys a large community of users and developers • Flexible object-oriented architecture • Has support for 3 ISA: ALPHA, SPARC and MIPS
  • 12. 5 Team work… • You need to work in groups of 2-4 students • Grade is based on written paper AND oral presentation (chose you best speaker)
  • 13. 6 Time Schedule and Deadlines More on It’s learning
  • 14. 7 Web page presentation
  • 15. Contents • Instruction level parallelism Chap 2 • Pipelining (repetition) App A TDT 4260 ▫ Basic 5-step pipeline • Dependencies and hazards Chap 2.1 App A.1, Chap 2 ▫ Data, name, control, structural Instruction Level Parallelism • Compiler techniques for ILP Chap 2.2 • (Static prediction Chap 2.3) ▫ Read this on your own • Project introduction Pipelining Instruction level parallelism (ILP) (1/3) • A program is sequence of instructions typically written to be executed one after the other • Poor usage of CPU resources! (Why?) • Better: Execute instructions in parallel ▫ 1: Pipeline Partial overlap of instruction execution ▫ 2: Multiple issue Total overlap of instruction execution • Today: Pipelining Pipelining (2/3) Pipelining (3/3) • Multiple different stages executed in parallel • Good Utilization: All stages are ALWAYS in use ▫ Laundry in 4 different stages ▫ Washing, drying, folding, ... ▫ Wash / Dry / Fold / Store ▫ Great usage of resources! • Assumptions: • Common technique, used everywhere ▫ Task can be split into stages ▫ Manufacturing, CPUs, etc ▫ Storage of temporary data • Ideal: time_stage = time_instruction / stages ▫ But stages are not perfectly balanced ▫ Stages synchronized ▫ But transfer between stages takes time ▫ Next operation known before last finished? ▫ But pipeline may have to be emptied ▫ ...
  • 16. Example: MIPS64 (2/2) Example: MIPS64 (1/2) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 • RISC • Pipeline I ALU • Load/store ▫ IF: Instruction fetch n Ifetch Reg DMem Reg s • Few instruction formats ▫ ID: Instruction decode / t register fetch r. ALU • Fixed instruction length ▫ EX: Execute / effective Ifetch Reg DMem Reg • 64-bit address (EA) O r ALU ▫ DADD = 64 bits ADD ▫ MEM: Memory access d Ifetch Reg DMem Reg ▫ LD = 64 bits L(oad) ▫ WB: Write back (reg) e r • 32 registers (R0 = 0) ALU Ifetch Reg DMem Reg • EA = offset(Register) Big Picture: Big Picture (continued): • What are some real world examples of • Computer Architecture is the study of design pipelining? tradeoffs!!!! • Why do we pipeline? • There is no “philosophy of architecture” and no • Does pipelining increase or decrease instruction “perfect architecture”. This is engineering, not throughput? science. • Does pipelining increase or decrease instruction • What are the costs of pipelining? latency? • For what types of devices is pipelining not a good choice? Improve speedup? Dependencies and hazards • Why not perfect speedup? • Dependencies ▫ Sequential programs ▫ Parallel instructions can be executed in parallel ▫ Dependent instructions are not parallel ▫ One instruction dependent on another I1: DADD R1, R2, R3 ▫ Not enough CPU resources I2: DSUB R4, R1, R5 • What can be done? ▫ Property of the instructions ▫ Forwarding (HW) • Hazards ▫ Situation where a dependency causes an instruction to ▫ Scheduling (SW / HW) give a wrong result ▫ Prediction (SW / HW) ▫ Property of the pipeline • Both hardware (dynamic) and compiler (static) ▫ Not all dependencies give hazards can help Dependencies must be close enough in the instruction stream to cause a hazard
  • 17. Dependencies Hazards • (True) data dependencies • Data hazards ▫ One instruction reads what an earlier has written ▫ Overlap will give different result from sequential • Name dependencies ▫ RAW / WAW / WAR ▫ Two instructions use the same register / mem loc • Control hazards ▫ But no flow of data between them ▫ Branches ▫ Two types: Anti and output dependencies ▫ Ex: Started executing the wrong instruction • Control dependencies • Structural hazards ▫ Instructions dependent on the result of a branch ▫ Pipeline does not support this combination of instr. • Again: Independent of pipeline implementation ▫ Ex: Register with one port, two stages want to read Data dependency Hazard? Figure A.6, Page A-16 Data Hazards (1/3) • Read After Write (RAW) I InstrJ tries to read operand before InstrI writes ALU Reg add r1,r2,r3 Ifetch Reg DMem n it s ALU t sub r4,r1,r3 Ifetch Reg DMem Reg I: add r1,r2,r3 r. J: sub r4,r1,r3 ALU Ifetch Reg DMem Reg O and r6,r1,r7 r • Caused by a true data dependency d • This hazard results from an actual need for ALU Ifetch Reg DMem Reg e or r8,r1,r9 r communication. ALU Ifetch Reg DMem Reg xor r10,r1,r11 Data Hazards (2/3) Data Hazards (3/3) • Write After Write (WAW) • Write After Read (WAR) InstrJ writes operand before InstrI writes it. InstrJ writes operand before InstrI reads it I: sub r1,r4,r3 I: sub r4,r1,r3 J: add r1,r2,r3 J: add r1,r2,r3 • Caused by an output dependency • Caused by an anti dependency This results from reuse of the name “r1” • Can’t happen in MIPS 5 stage pipeline because: ▫ All instructions take 5 stages, and • Can’t happen in MIPS 5 stage pipeline because: ▫ Writes are always in stage 5 ▫ All instructions take 5 stages, and • WAR and WAW can occur in more ▫ Reads are always in stage 2, and ▫ Writes are always in stage 5 complicated pipes
  • 18. Forwarding Can all data hazards be solved via Figure A.7, Page A-18 forwarding??? IF ID/RF EX MEM WB IF ID/RF EX MEM WB I I ALU ALU Reg Reg add r1,r2,r3 Ifetch Reg DMem Ld r1,r2 Ifetch Reg DMem n n s s ALU ALU t sub r4,r1,r3 Ifetch Reg DMem Reg t add r4,r1,r3 Ifetch Reg DMem Reg r. r. ALU ALU Ifetch Reg DMem Reg Ifetch Reg DMem Reg O and r6,r1,r7 O and r6,r1,r7 r r d d ALU ALU Ifetch Reg DMem Reg Ifetch Reg DMem Reg e or r8,r1,r9 e or r8,r1,r9 r r ALU ALU Ifetch Reg DMem Reg Ifetch Reg DMem Reg xor r10,r1,r11 xor r10,r1,r11 Structural Hazards (Memory Port) Hazards, Bubbles (Similar to Figure A.5, Page A-15) Figure A.4, Page A-14 Time (clock cycles) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I Load Ifetch Reg DMem Reg ALU I Load Ifetch Reg DMem Reg n n s ALU Reg s t Instr 1 Ifetch Reg DMem ALU Reg t Instr 1 Ifetch Reg DMem r. r. ALU Ifetch Reg DMem Reg Ld r1, r2 ALU Ifetch Reg DMem Reg Instr 2 O O r r Stall Bubble Bubble Bubble Bubble Bubble d ALU Ifetch Reg DMem Reg d Instr 3 e e ALU r Add r1, r1, r1 Ifetch Reg DMem Reg ALU r Instr 4 Ifetch Reg DMem Reg How do you “bubble” the pipe? How can we avoid this hazard? Control hazards (1/2) Control hazards (2/2) • Sequential execution is predictable, • What can be done? (conditional) branches are not ▫ Always stop (previous slide) • May have fetched instructions that should not be executed Also called freeze or flushing of the pipeline • Simple solution (figure): Stall the pipeline (bubble) ▫ Assume no branch (=assume sequential) ▫ Performance loss depends on number of branches in the program Must not change state before branch instr. is complete and pipeline implementation ▫ Branch penaltyC ▫ Assume branch Only smart if the target address is ready early ▫ Delayed branch Execute a different instruction while branch is evaluated Static techniques (fixed rule or compiler) Possibly wrong instruction Correct instruction
  • 19. Example • Assume branch conditionals are evaluated in the EX Dynamic scheduling stage, and determine the fetch address for the following cycle. • So far: Static scheduling • If we always stall, how many cycles are bubbled? ▫ Instructions executed in program order • Assume branch not taken, how many bubbles for an ▫ Any reordering is done by the compiler incorrect assumption? • Is stalling on every branch ok? • Dynamic scheduling • What optimizations could be done to improve stall ▫ CPU reorders to get a more optimal order penalty? Fewer hazards, fewer stalls, ... ▫ Must preserve order of operations where reordering could change the result ▫ Covered by TDT 4255 Hardware design Example Compiler techniques for ILP Source code: Notice: for (i = 1000; i >0; i=i-1) • Lots of dependencies • For a given pipeline and superscalarity • No dependencies between iterations x[i] = x[i] + s; ▫ How can these be best utilized? • High loop overhead ▫ As few stalls from hazards as possible Loop unrolling • Dynamic scheduling MIPS: ▫ Tomasulo’s algorithm etc. (TDT4255) Loop: L.D F0,0(R1) ; F0 = x[i] ▫ Makes the CPU much more complicated ADD.D F4,F0,F2 ; F2 = s • What can be done by the compiler? S.D F4,0(R1) ; Store x[i] + s ▫ Has ”ages” to spend, but less knowledge DADDUI R1,R1,#-8 ; x[i] is 8 bytes ▫ Static scheduling, but what else? BNE R1,R2,Loop ; R1 = R2? Loop: L.D F0,0(R1) Static scheduling Loop unrolling ADD.D F4,F0,F2 S.D F4,0(R1) Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) L.D F6,-8(R1) stopp DADDUI R1,R1,#-8 ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F4,F0,F2 ADD.D F4,F0,F2 S.D F4,0(R1) S.D F8,-8(R1) stopp stopp DADDUI R1,R1,#-8 L.D F10,-16(R1) stopp stopp BNE R1,R2,Loop ADD.D F12,F10,F2 S.D F4,0(R1) S.D F4,8(R1) S.D F12,-16(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop L.D F14,-24(R1) stopp • Reduced loop overhead ADD.D F16,F14,F2 BNE R1,R2,Loop • Requires number of iterations S.D F16,-24(R1) divisible by n (here n=4) DADDUI R1,R1,#-32 • Register renaming BNE R1,R2,Loop • Offsets have changed Result: From 9 cycles per iteration to 7 • Stalls not shown (Delays from table in figure 2.2)
  • 20. Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) ADD.D F4,F0,F2 L.D F6,-8(R1) S.D F4,0(R1) L.D F10,-16(R1) Loop unrolling: Summary L.D F6,-8(R1) L.D F14,-24(R1) ADD.D F8,F6,F2 ADD.D F4,F0,F2 • Original code 9 cycles per element S.D F8,-8(R1) ADD.D F8,F6,F2 • Scheduling 7 cycles per element L.D F10,-16(R1) ADD.D F12,F10,F2 • Loop unrolling 6,75 cycles per element ADD.D F12,F10,F2 ADD.D F16,F14,F2 ▫ Unrolled 4 iterations S.D F12,-16(R1) S.D F4,0(R1) L.D F14,-24(R1) S.D F8,-8(R1) • Combination 3,5 cycles per element ADD.D F16,F14,F2 DADDUI R1,R1,#-32 ▫ Avoids stalls entirely S.D F16,-24(R1) S.D F12,-16(R1) DADDUI R1,R1,#-32 S.D F16,-24(R1) BNE R1,R2,Loop Compiler reduced execution time by 61% BNE R1,R2,Loop Avoids stall after: L.D(1), ADD.D(2), DADDUI(1) Loop unrolling in practice • Do not usually know upper bound of loop • Suppose it is n, and we would like to unroll the loop to make k copies of the body • Instead of a single unrolled loop, we generate a pair of consecutive loops: ▫ 1st executes (n mod k) times and has a body that is the original loop ▫ 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times • For large values of n, most of the execution time will be spent in the unrolled loop
  • 21. Review • Name real-world examples of pipelining • Does pipelining lower instruction latency? • What is the advantage of pipelining? • What are some disadvantages of pipelining? TDT 4260 • What can a compiler do to avoid processor Chap 2, Chap 3 stalls? Instruction Level Parallelism (cont) • What are the three types of data dependences? • What are the three types of pipeline hazards? Getting CPI below 1 Contents • CPI ≥ 1 if issue only 1 instruction every clock cycle • Multiple-issue processors come in 3 flavors: • Very Large Instruction Word Chap 2.7 1. Statically-scheduled superscalar processors ▫ IA-64 and EPIC • In-order execution • Instruction fetching Chap 2.9 • Varying number of instructions issued (compiler) 2. Dynamically-scheduled superscalar processors • Limits to ILP Chap 3.1/2 • Out-of-order execution • Multi-threading Chap 3.5 • Varying number of instructions issued (CPU) 3. VLIW (very long instruction word) processors • In-order execution • Fixed number of instructions issued VLIW: Very Large Instruction Word (2/2) VLIW: Very Large Instruction Word (1/2) • Assume 2 load/store, 2 fp, 1 int/branch ▫ VLIW with 0-5 operations. • Each VLIW has explicit coding for multiple ▫ Why 0? operations ▫ Several instructions combined into packets • Important to avoid empty instruction slots ▫ Possibly with parallelism indicated ▫ Loop unrolling ▫ Local scheduling • Tradeoff instruction space for simple decoding ▫ Global scheduling ▫ Room for many operations Scheduling across branches ▫ Independent operations => execute in parallel ▫ E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 • Difficult to find all dependencies in advance branch ▫ Solution1: Block on memory accesses ▫ Solution2: CPU detects some dependencies
  • 22. Loop: L.D F0,0(R1) Recall: L.D F6,-8(R1) Loop Unrolling in VLIW Unrolled Loop L.D L.D F10,-16(R1) F14,-24(R1) Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch that minimizes ADD.D F4,F0,F2 L.D F0,0(R1) L.D F6,-8(R1) 1 ADD.D F8,F6,F2 L.D F10,-16(R1) L.D F14,-24(R1) 2 stalls for Scalar ADD.D F12,F10,F2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F16,F14,F2 Source code: ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D F4,0(R1) S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6 for (i = 1000; i >0; i=i-1) S.D -16(R1),F12 S.D -24(R1),F16 7 S.D F8,-8(R1) x[i] = x[i] + s; S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8 DADDUI R1,R1,#-32 S.D -0(R1),F28 BNEZ R1,LOOP 9 S.D F12,-16(R1) Register mapping: Unrolled 7 iterations to avoid delays S.D F16,-24(R1) 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) s F2 BNE R1,R2,Loop Average: 2.5 ops per clock, 50% efficiency i R1 Note: Need more registers in VLIW (15 vs. 6 in SS) Problems with 1st Generation VLIW VLIW Tradeoffs • Increase in code size • Advantages ▫ Loop unrolling ▫ “Simpler” hardware because the HW does not have to ▫ Partially empty VLIW identify independent instructions. • Operated in lock-step; no hazard detection HW • Disadvantages ▫ A stall in any functional unit pipeline causes entire processor to ▫ Relies on smart compiler stall, since all functional units must be kept synchronized ▫ Code incompatibility between generations ▫ Compiler might predict function units, but caches hard to predict ▫ There are limits to what the compiler can do (can’t move ▫ Moder VLIWs are “interlocked” (identify dependences between loads above branches, can’t move loads above stores) bundles and stall). • Binary code compatibility • Common uses ▫ Strict VLIW => different numbers of functional units and unit ▫ Embedded market where hardware simplicity is latencies require different versions of the code important, applications exhibit plenty of ILP, and binary compatibility is a non-issue. IA-64 and EPIC Instruction bundle (VLIW) • 64 bit instruction set architecture ▫ Not a CPU, but an architecture ▫ Itanium and Itanium 2 are CPUs based on IA-64 • Made by Intel and Hewlett-Packard (itanium 2 and 3 designed in Colorado) • Uses EPIC: Explicitly Parallel Instruction Computing • Departure from the x86 architecture • Meant to achieve out-of-order performance with in- order HW + compiler-smarts ▫ Stop bits to help with code density ▫ Support for control speculation (moving loads above branches) ▫ Support for data speculation (moving loads above stores) Details in Appendix G.6
  • 23. Functional units and template • Functional units: Code example (1/2) ▫ I (Integer), M (Integer + Memory), F (FP), B (Branch), L + X (64 bit operands + special inst.) • Template field: ▫ Maps instruction to functional unit ▫ Indicates stops: Limitations to ILP Control Speculation Code example 2/2 • Can the compiler schedule an independent load above a branch? Bne R1, R2, TARGET Ld R3, R4(0) • What are the problems? • EPIC provides speculative loads Ld.s R3, R4(0) Bne R1, R2, TARGET Check R4(0) Data Speculation EPIC Conclusions • Goal of EPIC was to maintain advantages of VLIW, but • Can the compiler schedule an independent load achieve performance of out-of-order. above a store? • Results: St R5, R6(0) ▫ Complicated bundling rules saves some space, but Ld R3, R4(0) makes the hardware more complicated • What are the problems? ▫ Add special hardware and instructions for scheduling • EPIC provides “advanced loads” and an ALAT loads above stores and branches (new complicated (Advanced Load Address Table) hardware) Ld.a R3, R4(0) creates entry in ALAT ▫ Add special hardware to remove branch penalties St R5, R6(0) looks up ALAT, if match, jump to (predication) fixup code ▫ End result is a machine as complicated as an out-of- order, but now also requiring a super-sophisticated compiler.