TDT 4260 – lecture 1 – 2011                                                                              Course goal• Cour...
Preliminary reading list,                              subject to change!!!                  People involved• Chap.1: Fund...
The End of Moore’s law                                                                                                  Mo...
Reduced Increase in Clock Frequency                                                            Solution: Multicore archite...
Parallel Computing Laboratory, U.C. Berkeley,(Slide adapted from Dave Patterson )                                         ...
Computer Architecture isDesign and Analysis                                                                               ...
COST and COTS                                                                     Speedup                     Superlinear ...
1         TDT4260 Computer architecture                 Mini-project    PhD candidate Alexandru Ciprian Iordan    Institut...
2    What is it…? How much…?    • The mini-project is the exercise part of TDT4260      course    • This year the students...
3    What will you work with…    • Modified version of M5 (for development and      evaluation)    • Computing time on Kon...
4    M5    • Initially developed by the University of Michigan    • Enjoys a large community of users and developers    • ...
5    Team work…    • You need to work in groups of 2-4 students    • Grade is based on written paper AND oral      present...
6    Time Schedule and Deadlines              More on It’s learning
7    Web page presentation
Contents                                                     • Instruction level parallelism                Chap 2        ...
Example: MIPS64 (2/2)Example: MIPS64 (1/2)                                                                      Time (cloc...
Dependencies                                                                                  Hazards      • (True) data d...
Forwarding                                                                                                                ...
Example  • Assume branch conditionals are evaluated in the EX                                                             ...
Loop:   L.D           F0,0(R1)      Loop:   L.D         F0,0(R1)        ADD.D         F4,F0,F2              L.D         F6...
Review                                                             • Name real-world examples of pipelining               ...
Loop:    L.D             F0,0(R1)Recall:                                      L.D             F6,-8(R1)                   ...
Functional units and template • Functional units:                                                               Code examp...
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
tdt4260
Upcoming SlideShare
Loading in...5
×

tdt4260

2,245

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,245
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "tdt4260"

  1. 1. TDT 4260 – lecture 1 – 2011 Course goal• Course introduction • To get a general and deep understanding of the – course goals organization of modern computers and the – staff motivation for different computer architectures. Give – contents a base for understanding of research themes within – evaluation the field. – web, ITSL • High level• Textbook • Mostly HW and low-level SW – Computer Architecture, A Quantitative Approach, Fourth • HW/SW interplay Edition • Parallelism • by John Hennessy & David Patterson (HP90 - 96 – 03) - 06 • Principles, not details• Today: Introduction (Chapter 1) – Partly covered  inspire to learn more 1 Lasse Natvig 2 Lasse Natvig Contents TDT-4260 / DT8803 • Recommended background• Computer architecture fundamentals, trends, measuring – Course TDT4160 Computer Fundamentals, or performance, quantitative principles. Instruction set equivalent. architectures and the role of compilers. Instruction-level • http://www.idi.ntnu.no/emner/tdt4260/ parallelism, thread-level parallelism, VLIW. – And Its Learning• Memory hierarchy design, cache. Multiprocessors, shared • Friday 1215-1400 memory architectures, vector processors, NTNU/Notur – And/or some Thursdays 1015-1200 supercomputers, supercomputers distributed shared memory memory, – 12 lectures planned synchronization, multithreading. – some exceptions may occur• Interconnection networks, topologies • Evaluation• Multicores,homogeneous and heterogeneous, principles and – Obligatory exercise (counts 20%). Written product examples exam counts 80%. Final grade (A to F) given at end of semester. If there is a re-sit• Green computing (introduction) examination, the examination form may• Miniproject - prefetching change from written to oral. 3 Lasse Natvig 4 Lasse NatvigLecture plan Subject to change EMECS, new European Masters Date  and lecturer  Topic Course in Embedded Computing Systems 1:  14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge 2:  21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 2 3: 28 Jan (IB) ILP, Chapter 2; TLP, Chapter 3 4: 4 Feb (LN) Multiprocessors, Chapter 4  5: 11 Feb MG(?)) Prefetching + Energy Micro guest lecture 6: 18 Feb (LN) Multiprocessors continued  7: 25 Feb (IB) Piranha CMP + Interconnection networks  8: 4 Mar (IB) Memory and cache, cache coherence  (Chap. 5) 9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl  multicore ... Fedorova ... assymetric multicore ... 10: 18 Mar (IB) Memory consistency (4.6) + more on memory 11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers   (2) Green  computing 12: 1 Apr (IB/LN) Wrap up lecture, remaining stuff 13: 8 Apr  Slack – no lecture planned  5 Lasse Natvig 6 Lasse Natvig 1
  2. 2. Preliminary reading list, subject to change!!! People involved• Chap.1: Fundamentals, sections 1.1 - 1.12 (pages 2-54)• Chap.2: ILP, sections 2.1 - 2.2 and parts of 2.3 (pages 66-81), section 2.7 (pages 114-118), parts of section 2.9 (pages 121-127, stop at speculation), Lasse Natvig section 2.11 - 2.12 (pages 138 - 141). (Sections 2.4 - 2.6 are covered by similar material in our computer design course) Course responsible, lecturer• Chap.3: Limits on ILP, section 3.1 and parts of section 3.2 (pages 154 -159), lasse@idi.ntnu.no section 3.5 - 3.8 (pages 172-185).• Chap.4: Multiprocessors and TLP, sections 4.1 - 4.5, 4.8 - 4.10• Chap.5: Memory hierachy, section 5.1 - 5.3 (pages 288 - 315). Ian Bratt• App A: section A 1 (Expected to be repetition from other courses) A.1 Lecturer (Also t Til (Al at Tilera.com) )• Appendix E, interconnection networks, pages E2-E14, E20-E25, E29-E37 ianbra@idi.ntnu.no and E45-E51.• App. F: Vector processors, sections F1 - F4 and F8 (pages F-2 - F-32, F- 44 - F-45) Alexandru Iordan• Data prefetch mechanisms (ACM Computing Survey) Teaching assistant (Also PhD-student)• Piranha, (To be announced) iordan@idi.ntnu.no• Multicores (New bookchapter) (To be announced)• (App. D; embedded systems?)  see our new course TDT4258 Mikrokontroller systemdesign http://www.idi.ntnu.no/people/ 7 Lasse Natvig 8 Lasse Natvig research.idi.ntnu.no/multicore Prefetching ---pfjudge Some few highlights: - Green computing, 2xPhD + master students - Multicore memory systems, 3 x PhD theses - Multicore programming and parallel computing - Cooperation with industry 9 Lasse Natvig 10 Lasse Natvig”Computational computer architecture” Experiment Infrastructure • Computational science and engineering (CSE) • Stallo compute cluster – Computational X, X = comp.arch. – 60 Teraflop/s peak • Simulates new multicore architectures – 5632 processing cores – Last level, shared cache fairness (PhD-student M. Jahre) – 12 TB total memory – Bandwidth aware prefetching (PhD-student M. Grannæs) – 128 TB centralized disk • Complex cycle-accurate simulators – Weighs 16 tons – 80 000 lines C++ 20 000 lines python C++, – Open source, Linux-based • Multi-core research • Design space exploration (DSE) – About 60 CPU years allocated per – one dimension for each arch. parameter year to our projects – DSE sample point = specific multicore configuration – Typical research paper uses 5 to – performance of a selected set of configurations evaluated by 12 CPU years for simulation simulating the execution of a set of workloads (extensive, detailed design space exploration) 11 Lasse Natvig 12 Lasse Natvig 2
  3. 3. The End of Moore’s law Motivational backgroundfor single-core microprocessors • Why multicores – in all market segments from mobile phones to supercomputers • The ”end” of Moores law • The power wall • The memory wall • The bandwith problem • ILP limitations • The complexity wall But Moore’s law still holds for FPGA, memory and multicore processors 13 Lasse Natvig 14 Lasse Natvig Energy & Heat Problems The Memory Wall 1000 • Large power “Moore’s Law” consumption 100 CPU 60%/year – Costly Performance P-M gap grows 50% / year – Heat problems 10 – Restricted battery DRAM operation time 9%/year 9%/ 1 • Google ”Open House Trondheim 1980 1990 2000 2006” • The Processor Memory Gap – ”Performance/Watt is the only flat • Consequence: deeper memory hierachies trend line” – P – Registers – L1 cache – L2 cache – L3 cache – Memory - - - – Complicates understanding of performance • cache usage has an increasing influence on performance 15 Lasse Natvig 16 Lasse Natvig The I/O pin or Bandwidth problem The limitations of ILP (Instruction Level Parallelism) • # I/O signaling pins in Applications – limited by physical tecnology 30 3   – speeds have not 2.5  25 increased at the same Fraction of total cycles (%) rate as processor clock 20 2 rates  dup Speed 1.5 • Projections 15 – from ITRS (International 10 1  Technology Roadmap for Semiconductors) 5 0.5 0 0 [Huh, Burger and Keckler 2001] 0 1 2 3 4 5 6+ 0 5 10 15 Number of instructions issued Instructions issued per cycle 17 Lasse Natvig 18 Lasse Natvig 3
  4. 4. Reduced Increase in Clock Frequency Solution: Multicore architectures (also called Chip Multi-processors - CMP) • More power-efficient – Two cores with clock frequency f/2 can potentially achieve the same speed as one at frequency f with 50% reduction in total energy consumption [Olukotun & Hammond 2005] • Exploits Thread Level Parallelism (TLP) – in addition to ILP – requires multiprogramming or parallel programming • Opens new possibilities for architectural innovations 19 Lasse Natvig 20 Lasse Natvig Why heterogeneous multicores? CPU – GPU – convergence• Specialized HW is (Performance – Programmability) Cell BE processor faster than general HW Processors: Larrabee, Fermi, … – Math co-processor Languages: CUDA, OpenCL, … – GPU, DSP, etc…• Benefits of customization – Similar to ASIC vs. general purpose programmable HW• Amdahl’s law – Parallel speedup limited by serial fraction •  1 super-core 21 Lasse Natvig 22 Lasse Natvig Parallel processing – conflicting Multicore programming challenges goals • Instability, diversity, conflicting goals … what to do? Performance • What kind of parallel programming? The P6-model: Parallel Processing – Homogeneous vs. heterogeneous challenges: Performance, Portability, – DSL vs. general languages Programmability and Power efficiency – Memory locality Portability • What to teach? – Teaching should be founded on active research Programmability Powerefficiency • Two layers of programmers y p g – The Landscape of Parallel Computing Research: A View from• Examples; Berkeley [Asan+06] – Performance tuning may reduce portability • Krste Asanovic presentation at ACACES Summerschool 2007 • Eg. Datastructures adapted to cache block size – 1) Programmability layer (Productivity layer) (80 - 90%) • ”Joe the programmer” – New languages for higher programmability may reduce performance and increase power consumption – 2) Performance layer (Efficiency layer) (10 - 20%) • Both layers involved in HPC • Programmability an issue also at the performance-layer 23 Lasse Natvig 24 Lasse Natvig 4
  5. 5. Parallel Computing Laboratory, U.C. Berkeley,(Slide adapted from Dave Patterson ) Classes of computers Easy to write correct programs that run efficiently on manycore • Servers – storage servers Personal Image Hearing, Parallel – compute servers (supercomputers) Speech Health Retrieval Music Browser – web servers Design Patterns/Motifs – high availability Composition & Coordination Language (C&CL) – scalability – throughput oriented (response time of less importance) ormance C&CL Compiler/Interpreter • Desktop (price 3000 NOK – 50 000 NOK) – the largest market g Diagnosing Power/Perfo Parallel P ll l Libraries Parallel Frameworks – price/performance focus – latency oriented (response time) • Embedded systems Efficiency Languages Sketching – the fastest growing market (”everywhere”) Autotuners – TDT 4258 Microcontroller system design Legacy Communication & Synch. – ATMEL, Nordic Semic., ARM, EM, ++ Schedulers Code Primitives Efficiency Language Compilers OS Libraries & Services Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore 25 Lasse Natvig 26 Lasse Natvig 25 Borgar  FXI Technologies Falanx (Mali) ARM ”An idependent compute platform to gather the Norway fragmented mobile space and thus help accelerate the prolifitation of content and applications eco- systems (I.e build an ARM based SoC, put it ,p in a memory card, connect it to the web- and voila, you got iPhone for the masses ).” • http://www.fxitech.com/ – ”Headquartered in Trondheim • But also an office in Silicon Valley …” 27 Lasse Natvig 28 Lasse Natvig Trends Comp. Arch. is an Integrated Approach • For technology, costs, use • What really matters is the functioning of the • Help predicting the future complete system • Product development time – hardware, runtime system, compiler, operating system, and – 2-3 years application –  design for the next technology – In networking, this is called the “End to End argument” • Computer architecture is not just about – Why should an architecture live longer than a product? transistors(not at all), individual instructions, or particular implementations – E.g., Original RISC projects replaced complex instructions with a compiler + simple instructions 29 Lasse Natvig 30 Lasse Natvig 5
  6. 6. Computer Architecture isDesign and Analysis TDT4260 Course Focus Architecture is an iterative process: Understanding the design techniques, machine • Searching the space of possible designs Design • At all levels of computer systems structures, technology factors, evaluationAnalysis methods that will determine the form of computers in 21st Century Technology Parallelism ProgrammingCreativityC ti it Languages Applications Interface Design Cost / Computer Architecture: (ISA) Performance • Organization Analysis • Hardware/Software Boundary Compilers Good Ideas Operating Measurement & Systems Evaluation History Mediocre Ideas Bad Ideas 31 Lasse Natvig 32 Lasse Natvig Moore’s Law: 2X transistors /Holistic approach “year” e.g., to programmability Parallel & concurrent programming Operating System & system software Multicore, interconnect, memory • “Cramming More Components onto Integrated Circuits” – Gordon Moore, Electronics, 1965 • # of transistors / cost-effective integrated circuit double every N months (12 ≤ N ≤ 24) 33 Lasse Natvig 34 Lasse Natvig Tracking Technology Latency Lags Bandwidth (last ~20 years) Performance Trends 10000 • 4 critical implementation technologies: CPU high, • Performance Milestones Processor – Disks, Memory low • Processor: ‘286, ‘386, ‘486, Pentium, – Memory, (“Memory Pentium Pro, Pentium 4 (21x,2250x) Wall”) 1000 – Network, • Ethernet: 10Mb, 100Mb, 1000Mb, Network – Processors 10000 Mb/s (16x,1000x) Relative Memory • Compare for Bandwidth vs. Latency BW 100 Disk • Memory Module: 16bit plain DRAM, Page Mode DRAM 32b 64b SDRAM, P M d DRAM, 32b, 64b, SDRAM improvements in performance over time Improve ment DDR SDRAM (4x,120x) • Bandwidth: number of events per unit time • Disk : 3600, 5400, 7200, 10000, 15000 – E.g., M bits/second over network, M bytes / second from 10 RPM (8x, 143x) disk (Processor latency = typical # of pipeline-stages * time • Latency: elapsed time for a single event (Latency improvement = Bandwidth improvement) pr. clock-cycle) – E.g., one-way network delay in microseconds, 1 average disk access time in milliseconds 1 10 100 Relative Latency Improvement 35 Lasse Natvig 36 Lasse Natvig 6
  7. 7. COST and COTS Speedup Superlinear speedup ?• Cost • General definition: Performance (p processors) – to produce one unit Speedup (p processors) = Performance (1 processor) – include (development cost / # sold units) – benefit of large volume• COTS • For a fixed problem size (input data set), – commodity off the shelf dit ff th h lf performance = 1/time – Speedup Time (1 processor) fixed problem (p processors) = Time (p processors) • Note: use best sequential algorithm in the uni-processor solution, not the parallel algorithm with p = 1 37 Lasse Natvig 38 Lasse Natvig Amdahl’s Law (1967) (fixed problem size) Gustafson’s “law” (1987) (scaled problem size, fixed execution time)• “If a fraction s of a (uniprocessor) • Total execution time on computation is inherently parallel computer with n serial, the speedup is at processors is fixed most 1/s” – serial fraction s’• Total work in computation – parallel fraction p’ – serial fraction s – s’ + p’ = 1 (100%) – parallel fraction p p • S (n) Time’(1)/Time’(n) S’(n) = Time (1)/Time (n) – s + p = 1 (100%) = (s’ + p’n)/(s’ + p’)• S(n) = Time(1) / Time(n) = s’ + p’n = s’ + (1-s’)n = (s + p) / [s +(p/n)] = n +(1-n)s’ • Reevaluating Amdahls law, = 1 / [s + (1-s) / n] John L. Gustafson, CACM May 1988, pp 532-533. ”Not a new = n / [1 + (n - 1)s] law, but Amdahl’s law with changed assumptions”• ”pessimistic and famous” 39 Lasse Natvig 40 Lasse Natvig How the serial fraction limits speedup• Amdahl’s law• Work hard to reduce the serial part of the application – remember IO – think different (than traditionally  = serial fraction or sequentially) 41 Lasse Natvig 7
  8. 8. 1 TDT4260 Computer architecture Mini-project PhD candidate Alexandru Ciprian Iordan Institutt for datateknikk og informasjonsvitenskap
  9. 9. 2 What is it…? How much…? • The mini-project is the exercise part of TDT4260 course • This year the students will need to develop and evaluate a PREFETCHER • The mini-project accounts for 20 % of the final grade in TDT4260 • 80 % for report • 20 % for oral presentation
  10. 10. 3 What will you work with… • Modified version of M5 (for development and evaluation) • Computing time on Kongull cluster (for benchmarking) • More at: http://dm-ark.idi.ntnu.no/
  11. 11. 4 M5 • Initially developed by the University of Michigan • Enjoys a large community of users and developers • Flexible object-oriented architecture • Has support for 3 ISA: ALPHA, SPARC and MIPS
  12. 12. 5 Team work… • You need to work in groups of 2-4 students • Grade is based on written paper AND oral presentation (chose you best speaker)
  13. 13. 6 Time Schedule and Deadlines More on It’s learning
  14. 14. 7 Web page presentation
  15. 15. Contents • Instruction level parallelism Chap 2 • Pipelining (repetition) App ATDT 4260 ▫ Basic 5-step pipeline • Dependencies and hazards Chap 2.1App A.1, Chap 2 ▫ Data, name, control, structuralInstruction Level Parallelism • Compiler techniques for ILP Chap 2.2 • (Static prediction Chap 2.3) ▫ Read this on your own • Project introduction PipeliningInstruction level parallelism (ILP) (1/3)• A program is sequence of instructions typically written to be executed one after the other• Poor usage of CPU resources! (Why?)• Better: Execute instructions in parallel ▫ 1: Pipeline Partial overlap of instruction execution ▫ 2: Multiple issue Total overlap of instruction execution• Today: PipeliningPipelining (2/3) Pipelining (3/3)• Multiple different stages executed in parallel • Good Utilization: All stages are ALWAYS in use ▫ Laundry in 4 different stages ▫ Washing, drying, folding, ... ▫ Wash / Dry / Fold / Store ▫ Great usage of resources!• Assumptions: • Common technique, used everywhere ▫ Task can be split into stages ▫ Manufacturing, CPUs, etc ▫ Storage of temporary data • Ideal: time_stage = time_instruction / stages ▫ But stages are not perfectly balanced ▫ Stages synchronized ▫ But transfer between stages takes time ▫ Next operation known before last finished? ▫ But pipeline may have to be emptied ▫ ...
  16. 16. Example: MIPS64 (2/2)Example: MIPS64 (1/2) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7• RISC • Pipeline I ALU• Load/store ▫ IF: Instruction fetch n Ifetch Reg DMem Reg s• Few instruction formats ▫ ID: Instruction decode / t register fetch r. ALU• Fixed instruction length ▫ EX: Execute / effective Ifetch Reg DMem Reg• 64-bit address (EA) O r ALU ▫ DADD = 64 bits ADD ▫ MEM: Memory access d Ifetch Reg DMem Reg ▫ LD = 64 bits L(oad) ▫ WB: Write back (reg) e r• 32 registers (R0 = 0) ALU Ifetch Reg DMem Reg• EA = offset(Register)Big Picture: Big Picture (continued):• What are some real world examples of • Computer Architecture is the study of design pipelining? tradeoffs!!!!• Why do we pipeline? • There is no “philosophy of architecture” and no• Does pipelining increase or decrease instruction “perfect architecture”. This is engineering, not throughput? science.• Does pipelining increase or decrease instruction • What are the costs of pipelining? latency? • For what types of devices is pipelining not a good choice?Improve speedup? Dependencies and hazards• Why not perfect speedup? • Dependencies ▫ Sequential programs ▫ Parallel instructions can be executed in parallel ▫ Dependent instructions are not parallel ▫ One instruction dependent on another I1: DADD R1, R2, R3 ▫ Not enough CPU resources I2: DSUB R4, R1, R5• What can be done? ▫ Property of the instructions ▫ Forwarding (HW) • Hazards ▫ Situation where a dependency causes an instruction to ▫ Scheduling (SW / HW) give a wrong result ▫ Prediction (SW / HW) ▫ Property of the pipeline• Both hardware (dynamic) and compiler (static) ▫ Not all dependencies give hazards can help Dependencies must be close enough in the instruction stream to cause a hazard
  17. 17. Dependencies Hazards • (True) data dependencies • Data hazards ▫ One instruction reads what an earlier has written ▫ Overlap will give different result from sequential • Name dependencies ▫ RAW / WAW / WAR ▫ Two instructions use the same register / mem loc • Control hazards ▫ But no flow of data between them ▫ Branches ▫ Two types: Anti and output dependencies ▫ Ex: Started executing the wrong instruction • Control dependencies • Structural hazards ▫ Instructions dependent on the result of a branch ▫ Pipeline does not support this combination of instr. • Again: Independent of pipeline implementation ▫ Ex: Register with one port, two stages want to read Data dependency Hazard? Figure A.6, Page A-16 Data Hazards (1/3) • Read After Write (RAW)I InstrJ tries to read operand before InstrI writes ALU Reg add r1,r2,r3 Ifetch Reg DMemn its ALUt sub r4,r1,r3 Ifetch Reg DMem Reg I: add r1,r2,r3r. J: sub r4,r1,r3 ALU Ifetch Reg DMem RegO and r6,r1,r7r • Caused by a true data dependencyd • This hazard results from an actual need for ALU Ifetch Reg DMem Rege or r8,r1,r9r communication. ALU Ifetch Reg DMem Reg xor r10,r1,r11 Data Hazards (2/3) Data Hazards (3/3) • Write After Write (WAW) • Write After Read (WAR) InstrJ writes operand before InstrI writes it. InstrJ writes operand before InstrI reads it I: sub r1,r4,r3 I: sub r4,r1,r3 J: add r1,r2,r3 J: add r1,r2,r3 • Caused by an output dependency • Caused by an anti dependency This results from reuse of the name “r1” • Can’t happen in MIPS 5 stage pipeline because: ▫ All instructions take 5 stages, and • Can’t happen in MIPS 5 stage pipeline because: ▫ Writes are always in stage 5 ▫ All instructions take 5 stages, and • WAR and WAW can occur in more ▫ Reads are always in stage 2, and ▫ Writes are always in stage 5 complicated pipes
  18. 18. Forwarding Can all data hazards be solved via Figure A.7, Page A-18 forwarding??? IF ID/RF EX MEM WB IF ID/RF EX MEM WBI I ALU ALU Reg Reg add r1,r2,r3 Ifetch Reg DMem Ld r1,r2 Ifetch Reg DMemn ns s ALU ALUt sub r4,r1,r3 Ifetch Reg DMem Reg t add r4,r1,r3 Ifetch Reg DMem Regr. r. ALU ALU Ifetch Reg DMem Reg Ifetch Reg DMem RegO and r6,r1,r7 O and r6,r1,r7r rd d ALU ALU Ifetch Reg DMem Reg Ifetch Reg DMem Rege or r8,r1,r9 e or r8,r1,r9r r ALU ALU Ifetch Reg DMem Reg Ifetch Reg DMem Reg xor r10,r1,r11 xor r10,r1,r11 Structural Hazards (Memory Port) Hazards, Bubbles (Similar to Figure A.5, Page A-15) Figure A.4, Page A-14 Time (clock cycles) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I Load Ifetch Reg DMem Reg ALU I Load Ifetch Reg DMem Reg n n s ALU Reg s t Instr 1 Ifetch Reg DMem ALU Reg t Instr 1 Ifetch Reg DMem r. r. ALU Ifetch Reg DMem Reg Ld r1, r2 ALU Ifetch Reg DMem Reg Instr 2 O O r r Stall Bubble Bubble Bubble Bubble Bubble d ALU Ifetch Reg DMem Reg d Instr 3 e e ALU r Add r1, r1, r1 Ifetch Reg DMem Reg ALU r Instr 4 Ifetch Reg DMem Reg How do you “bubble” the pipe? How can we avoid this hazard? Control hazards (1/2) Control hazards (2/2) • Sequential execution is predictable, • What can be done? (conditional) branches are not ▫ Always stop (previous slide) • May have fetched instructions that should not be executed Also called freeze or flushing of the pipeline • Simple solution (figure): Stall the pipeline (bubble) ▫ Assume no branch (=assume sequential) ▫ Performance loss depends on number of branches in the program Must not change state before branch instr. is complete and pipeline implementation ▫ Branch penaltyC ▫ Assume branch Only smart if the target address is ready early ▫ Delayed branch Execute a different instruction while branch is evaluated Static techniques (fixed rule or compiler) Possibly wrong instruction Correct instruction
  19. 19. Example • Assume branch conditionals are evaluated in the EX Dynamic scheduling stage, and determine the fetch address for the following cycle. • So far: Static scheduling • If we always stall, how many cycles are bubbled? ▫ Instructions executed in program order • Assume branch not taken, how many bubbles for an ▫ Any reordering is done by the compiler incorrect assumption? • Is stalling on every branch ok? • Dynamic scheduling • What optimizations could be done to improve stall ▫ CPU reorders to get a more optimal order penalty? Fewer hazards, fewer stalls, ... ▫ Must preserve order of operations where reordering could change the result ▫ Covered by TDT 4255 Hardware design Example Compiler techniques for ILP Source code: Notice: for (i = 1000; i >0; i=i-1) • Lots of dependencies • For a given pipeline and superscalarity • No dependencies between iterations x[i] = x[i] + s; ▫ How can these be best utilized? • High loop overhead ▫ As few stalls from hazards as possible Loop unrolling • Dynamic scheduling MIPS: ▫ Tomasulo’s algorithm etc. (TDT4255) Loop: L.D F0,0(R1) ; F0 = x[i] ▫ Makes the CPU much more complicated ADD.D F4,F0,F2 ; F2 = s • What can be done by the compiler? S.D F4,0(R1) ; Store x[i] + s ▫ Has ”ages” to spend, but less knowledge DADDUI R1,R1,#-8 ; x[i] is 8 bytes ▫ Static scheduling, but what else? BNE R1,R2,Loop ; R1 = R2? Loop: L.D F0,0(R1) Static scheduling Loop unrolling ADD.D F4,F0,F2 S.D F4,0(R1)Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) L.D F6,-8(R1) stopp DADDUI R1,R1,#-8 ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F4,F0,F2 ADD.D F4,F0,F2 S.D F4,0(R1) S.D F8,-8(R1) stopp stopp DADDUI R1,R1,#-8 L.D F10,-16(R1) stopp stopp BNE R1,R2,Loop ADD.D F12,F10,F2 S.D F4,0(R1) S.D F4,8(R1) S.D F12,-16(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop L.D F14,-24(R1) stopp • Reduced loop overhead ADD.D F16,F14,F2 BNE R1,R2,Loop • Requires number of iterations S.D F16,-24(R1) divisible by n (here n=4) DADDUI R1,R1,#-32 • Register renaming BNE R1,R2,Loop • Offsets have changed Result: From 9 cycles per iteration to 7 • Stalls not shown (Delays from table in figure 2.2)
  20. 20. Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) ADD.D F4,F0,F2 L.D F6,-8(R1) S.D F4,0(R1) L.D F10,-16(R1) Loop unrolling: Summary L.D F6,-8(R1) L.D F14,-24(R1) ADD.D F8,F6,F2 ADD.D F4,F0,F2 • Original code 9 cycles per element S.D F8,-8(R1) ADD.D F8,F6,F2 • Scheduling 7 cycles per element L.D F10,-16(R1) ADD.D F12,F10,F2 • Loop unrolling 6,75 cycles per element ADD.D F12,F10,F2 ADD.D F16,F14,F2 ▫ Unrolled 4 iterations S.D F12,-16(R1) S.D F4,0(R1) L.D F14,-24(R1) S.D F8,-8(R1) • Combination 3,5 cycles per element ADD.D F16,F14,F2 DADDUI R1,R1,#-32 ▫ Avoids stalls entirely S.D F16,-24(R1) S.D F12,-16(R1) DADDUI R1,R1,#-32 S.D F16,-24(R1) BNE R1,R2,Loop Compiler reduced execution time by 61% BNE R1,R2,Loop Avoids stall after: L.D(1), ADD.D(2), DADDUI(1) Loop unrolling in practice • Do not usually know upper bound of loop • Suppose it is n, and we would like to unroll the loop to make k copies of the body • Instead of a single unrolled loop, we generate a pair of consecutive loops: ▫ 1st executes (n mod k) times and has a body that is the original loop ▫ 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times • For large values of n, most of the execution time will be spent in the unrolled loop
  21. 21. Review • Name real-world examples of pipelining • Does pipelining lower instruction latency? • What is the advantage of pipelining? • What are some disadvantages of pipelining? TDT 4260 • What can a compiler do to avoid processor Chap 2, Chap 3 stalls? Instruction Level Parallelism (cont) • What are the three types of data dependences? • What are the three types of pipeline hazards? Getting CPI below 1 Contents • CPI ≥ 1 if issue only 1 instruction every clock cycle • Multiple-issue processors come in 3 flavors: • Very Large Instruction Word Chap 2.7 1. Statically-scheduled superscalar processors ▫ IA-64 and EPIC • In-order execution • Instruction fetching Chap 2.9 • Varying number of instructions issued (compiler) 2. Dynamically-scheduled superscalar processors • Limits to ILP Chap 3.1/2 • Out-of-order execution • Multi-threading Chap 3.5 • Varying number of instructions issued (CPU) 3. VLIW (very long instruction word) processors • In-order execution • Fixed number of instructions issued VLIW: Very Large Instruction Word (2/2)VLIW: Very Large Instruction Word (1/2) • Assume 2 load/store, 2 fp, 1 int/branch ▫ VLIW with 0-5 operations.• Each VLIW has explicit coding for multiple ▫ Why 0? operations ▫ Several instructions combined into packets • Important to avoid empty instruction slots ▫ Possibly with parallelism indicated ▫ Loop unrolling ▫ Local scheduling• Tradeoff instruction space for simple decoding ▫ Global scheduling ▫ Room for many operations Scheduling across branches ▫ Independent operations => execute in parallel ▫ E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 • Difficult to find all dependencies in advance branch ▫ Solution1: Block on memory accesses ▫ Solution2: CPU detects some dependencies
  22. 22. Loop: L.D F0,0(R1)Recall: L.D F6,-8(R1) Loop Unrolling in VLIWUnrolled Loop L.D L.D F10,-16(R1) F14,-24(R1) Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branchthat minimizes ADD.D F4,F0,F2 L.D F0,0(R1) L.D F6,-8(R1) 1 ADD.D F8,F6,F2 L.D F10,-16(R1) L.D F14,-24(R1) 2stalls for Scalar ADD.D F12,F10,F2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4 ADD.D F16,F14,F2Source code: ADD.D F20,F18,F2 ADD.D F24,F22,F2 5 S.D F4,0(R1) S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6for (i = 1000; i >0; i=i-1) S.D -16(R1),F12 S.D -24(R1),F16 7 S.D F8,-8(R1) x[i] = x[i] + s; S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8 DADDUI R1,R1,#-32 S.D -0(R1),F28 BNEZ R1,LOOP 9 S.D F12,-16(R1)Register mapping: Unrolled 7 iterations to avoid delays S.D F16,-24(R1) 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)s F2 BNE R1,R2,Loop Average: 2.5 ops per clock, 50% efficiencyi R1 Note: Need more registers in VLIW (15 vs. 6 in SS) Problems with 1st Generation VLIW VLIW Tradeoffs• Increase in code size • Advantages ▫ Loop unrolling ▫ “Simpler” hardware because the HW does not have to ▫ Partially empty VLIW identify independent instructions.• Operated in lock-step; no hazard detection HW • Disadvantages ▫ A stall in any functional unit pipeline causes entire processor to ▫ Relies on smart compiler stall, since all functional units must be kept synchronized ▫ Code incompatibility between generations ▫ Compiler might predict function units, but caches hard to predict ▫ There are limits to what the compiler can do (can’t move ▫ Moder VLIWs are “interlocked” (identify dependences between loads above branches, can’t move loads above stores) bundles and stall).• Binary code compatibility • Common uses ▫ Strict VLIW => different numbers of functional units and unit ▫ Embedded market where hardware simplicity is latencies require different versions of the code important, applications exhibit plenty of ILP, and binary compatibility is a non-issue. IA-64 and EPIC Instruction bundle (VLIW) • 64 bit instruction set architecture ▫ Not a CPU, but an architecture ▫ Itanium and Itanium 2 are CPUs based on IA-64 • Made by Intel and Hewlett-Packard (itanium 2 and 3 designed in Colorado) • Uses EPIC: Explicitly Parallel Instruction Computing • Departure from the x86 architecture • Meant to achieve out-of-order performance with in- order HW + compiler-smarts ▫ Stop bits to help with code density ▫ Support for control speculation (moving loads above branches) ▫ Support for data speculation (moving loads above stores) Details in Appendix G.6
  23. 23. Functional units and template • Functional units: Code example (1/2) ▫ I (Integer), M (Integer + Memory), F (FP), B (Branch), L + X (64 bit operands + special inst.) • Template field: ▫ Maps instruction to functional unit ▫ Indicates stops: Limitations to ILP Control SpeculationCode example 2/2 • Can the compiler schedule an independent load above a branch? Bne R1, R2, TARGET Ld R3, R4(0) • What are the problems? • EPIC provides speculative loads Ld.s R3, R4(0) Bne R1, R2, TARGET Check R4(0)Data Speculation EPIC Conclusions • Goal of EPIC was to maintain advantages of VLIW, but • Can the compiler schedule an independent load achieve performance of out-of-order. above a store? • Results: St R5, R6(0) ▫ Complicated bundling rules saves some space, but Ld R3, R4(0) makes the hardware more complicated • What are the problems? ▫ Add special hardware and instructions for scheduling • EPIC provides “advanced loads” and an ALAT loads above stores and branches (new complicated (Advanced Load Address Table) hardware) Ld.a R3, R4(0) creates entry in ALAT ▫ Add special hardware to remove branch penalties St R5, R6(0) looks up ALAT, if match, jump to (predication) fixup code ▫ End result is a machine as complicated as an out-of- order, but now also requiring a super-sophisticated compiler.

×