1. TDT 4260 – lecture 1 – 2011 Course goal
• Course introduction • To get a general and deep understanding of the
– course goals organization of modern computers and the
– staff motivation for different computer architectures. Give
– contents a base for understanding of research themes within
– evaluation the field.
– web, ITSL
• High level
• Textbook
• Mostly HW and low-level SW
– Computer Architecture, A
Quantitative Approach, Fourth • HW/SW interplay
Edition • Parallelism
• by John Hennessy & David Patterson
(HP90 - 96 – 03) - 06 • Principles, not details
• Today: Introduction (Chapter 1)
– Partly covered inspire to learn more
1 Lasse Natvig 2 Lasse Natvig
Contents TDT-4260 / DT8803
• Recommended background
• Computer architecture fundamentals, trends, measuring – Course TDT4160 Computer Fundamentals, or
performance, quantitative principles. Instruction set equivalent.
architectures and the role of compilers. Instruction-level • http://www.idi.ntnu.no/emner/tdt4260/
parallelism, thread-level parallelism, VLIW. – And Its Learning
• Memory hierarchy design, cache. Multiprocessors, shared • Friday 1215-1400
memory architectures, vector processors, NTNU/Notur – And/or some Thursdays 1015-1200
supercomputers,
supercomputers distributed shared memory
memory, – 12 lectures planned
synchronization, multithreading. – some exceptions may occur
• Interconnection networks, topologies • Evaluation
• Multicores,homogeneous and heterogeneous, principles and – Obligatory exercise (counts 20%). Written
product examples exam counts 80%. Final grade (A to F) given
at end of semester. If there is a re-sit
• Green computing (introduction) examination, the examination form may
• Miniproject - prefetching change from written to oral.
3 Lasse Natvig 4 Lasse Natvig
Lecture plan Subject to change
EMECS, new European Master's
Date and lecturer Topic Course in Embedded Computing Systems
1: 14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge
2: 21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 2
3: 28 Jan (IB) ILP, Chapter 2; TLP, Chapter 3
4: 4 Feb (LN) Multiprocessors, Chapter 4
5: 11 Feb MG(?)) Prefetching + Energy Micro guest lecture
6: 18 Feb (LN) Multiprocessors continued
7: 25 Feb (IB) Piranha CMP + Interconnection networks
8: 4 Mar (IB) Memory and cache, cache coherence (Chap. 5)
9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl
multicore ... Fedorova ... assymetric multicore ...
10: 18 Mar (IB) Memory consistency (4.6) + more on memory
11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers (2) Green
computing
12: 1 Apr (IB/LN) Wrap up lecture, remaining stuff
13: 8 Apr Slack – no lecture planned
5 Lasse Natvig 6 Lasse Natvig
1
2. Preliminary reading list, subject to change!!! People involved
• Chap.1: Fundamentals, sections 1.1 - 1.12 (pages 2-54)
• Chap.2: ILP, sections 2.1 - 2.2 and parts of 2.3 (pages 66-81), section 2.7
(pages 114-118), parts of section 2.9 (pages 121-127, stop at speculation), Lasse Natvig
section 2.11 - 2.12 (pages 138 - 141). (Sections 2.4 - 2.6 are covered by similar
material in our computer design course) Course responsible, lecturer
• Chap.3: Limits on ILP, section 3.1 and parts of section 3.2 (pages 154 -159), lasse@idi.ntnu.no
section 3.5 - 3.8 (pages 172-185).
• Chap.4: Multiprocessors and TLP, sections 4.1 - 4.5, 4.8 - 4.10
• Chap.5: Memory hierachy, section 5.1 - 5.3 (pages 288 - 315). Ian Bratt
• App A: section A 1 (Expected to be repetition from other courses)
A.1 Lecturer (Also t Til
(Al at Tilera.com)
)
• Appendix E, interconnection networks, pages E2-E14, E20-E25, E29-E37 ianbra@idi.ntnu.no
and E45-E51.
• App. F: Vector processors, sections F1 - F4 and F8 (pages F-2 - F-32, F-
44 - F-45) Alexandru Iordan
• Data prefetch mechanisms (ACM Computing Survey)
Teaching assistant (Also PhD-student)
• Piranha, (To be announced)
iordan@idi.ntnu.no
• Multicores (New bookchapter) (To be announced)
• (App. D; embedded systems?) see our new course TDT4258
Mikrokontroller systemdesign http://www.idi.ntnu.no/people/
7 Lasse Natvig 8 Lasse Natvig
research.idi.ntnu.no/multicore Prefetching ---pfjudge
Some few highlights:
- Green computing, 2xPhD + master students
- Multicore memory systems, 3 x PhD theses
- Multicore programming and parallel computing
- Cooperation with industry
9 Lasse Natvig 10 Lasse Natvig
”Computational computer architecture” Experiment Infrastructure
• Computational science and engineering (CSE) • Stallo compute cluster
– Computational X, X = comp.arch. – 60 Teraflop/s peak
• Simulates new multicore architectures – 5632 processing cores
– Last level, shared cache fairness (PhD-student M. Jahre) – 12 TB total memory
– Bandwidth aware prefetching (PhD-student M. Grannæs) – 128 TB centralized disk
• Complex cycle-accurate simulators – Weighs 16 tons
– 80 000 lines C++ 20 000 lines python
C++,
– Open source, Linux-based
• Multi-core research
• Design space exploration (DSE)
– About 60 CPU years allocated per
– one dimension for each arch. parameter year to our projects
– DSE sample point = specific multicore configuration – Typical research paper uses 5 to
– performance of a selected set of configurations evaluated by 12 CPU years for simulation
simulating the execution of a set of workloads (extensive, detailed design space
exploration)
11 Lasse Natvig 12 Lasse Natvig
2
3. The End of Moore’s law
Motivational background
for single-core microprocessors
• Why multicores
– in all market segments from mobile phones to supercomputers
• The ”end” of Moores law
• The power wall
• The memory wall
• The bandwith problem
• ILP limitations
• The complexity wall
But Moore’s law still holds for
FPGA, memory and
multicore processors
13 Lasse Natvig 14 Lasse Natvig
Energy & Heat Problems The Memory Wall
1000
• Large power “Moore’s Law”
consumption 100 CPU
60%/year
– Costly
Performance
P-M gap grows 50% / year
– Heat problems 10
– Restricted battery
DRAM
operation time 9%/year
9%/
1
• Google ”Open
House Trondheim 1980 1990 2000
2006” • The Processor Memory Gap
– ”Performance/Watt
is the only flat
• Consequence: deeper memory hierachies
trend line” – P – Registers – L1 cache – L2 cache – L3 cache – Memory - - -
– Complicates understanding of performance
• cache usage has an increasing influence on performance
15 Lasse Natvig 16 Lasse Natvig
The I/O pin or Bandwidth problem The limitations of ILP
(Instruction Level Parallelism)
• # I/O signaling pins in Applications
– limited by physical
tecnology 30 3
– speeds have not 2.5
25
increased at the same
Fraction of total cycles (%)
rate as processor clock 20 2
rates
dup
Speed
1.5
• Projections 15
– from ITRS (International 10 1
Technology Roadmap
for Semiconductors) 5 0.5
0 0
[Huh, Burger and Keckler 2001] 0 1 2 3 4 5 6+ 0 5 10 15
Number of instructions issued Instructions issued per cycle
17 Lasse Natvig 18 Lasse Natvig
3
4. Reduced Increase in Clock Frequency Solution: Multicore architectures
(also called Chip Multi-processors - CMP)
• More power-efficient
– Two cores with clock frequency f/2
can potentially achieve the same
speed as one at frequency f with 50%
reduction in total energy consumption
[Olukotun & Hammond 2005]
• Exploits Thread Level
Parallelism (TLP)
– in addition to ILP
– requires multiprogramming or
parallel programming
• Opens new possibilities for
architectural innovations
19 Lasse Natvig 20 Lasse Natvig
Why heterogeneous multicores? CPU – GPU – convergence
• Specialized HW is (Performance – Programmability)
Cell BE processor
faster than general
HW Processors: Larrabee,
Fermi, …
– Math co-processor Languages: CUDA,
OpenCL, …
– GPU, DSP, etc…
• Benefits of
customization
– Similar to ASIC vs. general
purpose programmable
HW
• Amdahl’s law
– Parallel speedup limited by
serial fraction
• 1 super-core
21 Lasse Natvig 22 Lasse Natvig
Parallel processing – conflicting Multicore programming challenges
goals • Instability, diversity, conflicting goals … what to do?
Performance • What kind of parallel programming?
The P6-model: Parallel Processing – Homogeneous vs. heterogeneous
challenges: Performance, Portability, – DSL vs. general languages
Programmability and Power efficiency – Memory locality
Portability • What to teach?
– Teaching should be founded on
active research
Programmability Powerefficiency • Two layers of programmers
y p g
– The Landscape of Parallel Computing Research: A View from
• Examples; Berkeley [Asan+06]
– Performance tuning may reduce portability • Krste Asanovic presentation at ACACES Summerschool 2007
• Eg. Datastructures adapted to cache block size – 1) Programmability layer (Productivity layer) (80 - 90%)
• ”Joe the programmer”
– New languages for higher programmability may reduce performance
and increase power consumption – 2) Performance layer (Efficiency layer) (10 - 20%)
• Both layers involved in HPC
• Programmability an issue also at the performance-layer
23 Lasse Natvig 24 Lasse Natvig
4
5. Parallel Computing Laboratory, U.C. Berkeley,
(Slide adapted from Dave Patterson ) Classes of computers
Easy to write correct programs that run efficiently on manycore • Servers
– storage servers
Personal Image Hearing, Parallel – compute servers (supercomputers)
Speech
Health Retrieval Music Browser
– web servers
Design Patterns/Motifs – high availability
Composition & Coordination Language (C&CL) – scalability
– throughput oriented (response time of less importance)
ormance
C&CL Compiler/Interpreter
• Desktop (price 3000 NOK – 50 000 NOK)
– the largest market
g
Diagnosing Power/Perfo
Parallel
P ll l
Libraries
Parallel Frameworks – price/performance focus
– latency oriented (response time)
• Embedded systems
Efficiency Languages Sketching
– the fastest growing market (”everywhere”)
Autotuners – TDT 4258 Microcontroller system design
Legacy Communication & Synch. – ATMEL, Nordic Semic., ARM, EM, ++
Schedulers
Code Primitives
Efficiency Language Compilers
OS Libraries & Services
Legacy OS
Hypervisor
Multicore/GPGPU RAMP Manycore
25 Lasse Natvig 26 Lasse Natvig
25
Borgar FXI Technologies
Falanx (Mali)
ARM ”An idependent compute
platform to gather the
Norway fragmented mobile space
and thus help accelerate the
prolifitation of content and
applications eco- systems (I.e
build an ARM based SoC, put it
,p
in a memory card, connect it to
the web- and voila, you got
iPhone for the masses ).”
• http://www.fxitech.com/
– ”Headquartered in Trondheim
• But also an office in Silicon Valley …”
27 Lasse Natvig 28 Lasse Natvig
Trends Comp. Arch. is an
Integrated Approach
• For technology, costs, use
• What really matters is the functioning of the
• Help predicting the future complete system
• Product development time – hardware, runtime system, compiler, operating system, and
– 2-3 years application
– design for the next technology – In networking, this is called the “End to End argument”
• Computer architecture is not just about
– Why should an architecture live longer than a product?
transistors(not at all), individual instructions, or
particular implementations
– E.g., Original RISC projects replaced complex instructions with a
compiler + simple instructions
29 Lasse Natvig 30 Lasse Natvig
5
6. Computer Architecture is
Design and Analysis TDT4260 Course Focus
Architecture is an iterative process: Understanding the design techniques, machine
• Searching the space of possible designs
Design
• At all levels of computer systems structures, technology factors, evaluation
Analysis
methods that will determine the form of
computers in 21st Century
Technology Parallelism
Programming
Creativity
C ti it Languages
Applications Interface Design
Cost / Computer Architecture: (ISA)
Performance • Organization
Analysis • Hardware/Software Boundary
Compilers
Good Ideas Operating Measurement &
Systems Evaluation History
Mediocre Ideas
Bad Ideas
31 Lasse Natvig 32 Lasse Natvig
Moore’s Law: 2X transistors /
Holistic approach “year”
e.g., to programmability
Parallel & concurrent programming
Operating System & system software
Multicore, interconnect, memory
• “Cramming More Components onto Integrated Circuits”
– Gordon Moore, Electronics, 1965
• # of transistors / cost-effective integrated circuit double
every N months (12 ≤ N ≤ 24)
33 Lasse Natvig 34 Lasse Natvig
Tracking Technology
Latency Lags Bandwidth (last ~20 years)
Performance Trends
10000
• 4 critical implementation technologies: CPU high, • Performance Milestones
Processor
– Disks, Memory low • Processor: ‘286, ‘386, ‘486, Pentium,
– Memory, (“Memory Pentium Pro, Pentium 4 (21x,2250x)
Wall”) 1000
– Network, • Ethernet: 10Mb, 100Mb, 1000Mb,
Network
– Processors 10000 Mb/s (16x,1000x)
Relative Memory
• Compare for Bandwidth vs. Latency BW
100
Disk • Memory Module: 16bit plain DRAM,
Page Mode DRAM 32b 64b SDRAM,
P M d DRAM, 32b, 64b, SDRAM
improvements in performance over time Improve
ment DDR SDRAM (4x,120x)
• Bandwidth: number of events per unit time • Disk : 3600, 5400, 7200, 10000, 15000
– E.g., M bits/second over network, M bytes / second from 10 RPM (8x, 143x)
disk
(Processor latency = typical # of pipeline-stages * time
• Latency: elapsed time for a single event (Latency improvement
= Bandwidth improvement)
pr. clock-cycle)
– E.g., one-way network delay in microseconds, 1
average disk access time in milliseconds 1 10 100
Relative Latency Improvement
35 Lasse Natvig 36 Lasse Natvig
6
7. COST and COTS Speedup Superlinear speedup ?
• Cost • General definition:
Performance (p processors)
– to produce one unit Speedup (p processors) = Performance (1 processor)
– include (development cost / # sold units)
– benefit of large volume
• COTS • For a fixed problem size (input data set),
– commodity off the shelf
dit ff th h lf performance = 1/time
– Speedup
Time (1 processor)
fixed problem (p processors) =
Time (p processors)
• Note: use best sequential algorithm in the uni-processor
solution, not the parallel algorithm with p = 1
37 Lasse Natvig 38 Lasse Natvig
Amdahl’s Law (1967) (fixed problem size) Gustafson’s “law” (1987)
(scaled problem size, fixed execution time)
• “If a fraction s of a
(uniprocessor) • Total execution time on
computation is inherently parallel computer with n
serial, the speedup is at processors is fixed
most 1/s” – serial fraction s’
• Total work in computation – parallel fraction p’
– serial fraction s – s’ + p’ = 1 (100%)
– parallel fraction p
p • S (n) Time’(1)/Time’(n)
S’(n) = Time (1)/Time (n)
– s + p = 1 (100%)
= (s’ + p’n)/(s’ + p’)
• S(n) = Time(1) / Time(n) = s’ + p’n = s’ + (1-s’)n
= (s + p) / [s +(p/n)] = n +(1-n)s’
• Reevaluating Amdahl's law,
= 1 / [s + (1-s) / n] John L. Gustafson, CACM May
1988, pp 532-533. ”Not a new
= n / [1 + (n - 1)s] law, but Amdahl’s law with
changed assumptions”
• ”pessimistic and famous”
39 Lasse Natvig 40 Lasse Natvig
How the serial fraction limits speedup
• Amdahl’s law
• Work hard to
reduce the
serial part of
the application
– remember IO
– think different
(than traditionally = serial fraction
or sequentially)
41 Lasse Natvig
7
8. 1
TDT4260 Computer architecture
Mini-project
PhD candidate Alexandru Ciprian Iordan
Institutt for datateknikk og informasjonsvitenskap
9. 2
What is it…? How much…?
• The mini-project is the exercise part of TDT4260
course
• This year the students will need to develop and
evaluate a PREFETCHER
• The mini-project accounts for 20 % of the final grade
in TDT4260
• 80 % for report
• 20 % for oral presentation
10. 3
What will you work with…
• Modified version of M5 (for development and
evaluation)
• Computing time on Kongull cluster (for
benchmarking)
• More at: http://dm-ark.idi.ntnu.no/
11. 4
M5
• Initially developed by the University of Michigan
• Enjoys a large community of users and developers
• Flexible object-oriented architecture
• Has support for 3 ISA: ALPHA, SPARC and MIPS
12. 5
Team work…
• You need to work in groups of 2-4 students
• Grade is based on written paper AND oral
presentation (chose you best speaker)
13. 6
Time Schedule and Deadlines
More on It’s learning
15. Contents
• Instruction level parallelism Chap 2
• Pipelining (repetition) App A
TDT 4260 ▫ Basic 5-step pipeline
• Dependencies and hazards Chap 2.1
App A.1, Chap 2
▫ Data, name, control, structural
Instruction Level Parallelism
• Compiler techniques for ILP Chap 2.2
• (Static prediction Chap 2.3)
▫ Read this on your own
• Project introduction
Pipelining
Instruction level parallelism (ILP) (1/3)
• A program is sequence of instructions typically
written to be executed one after the other
• Poor usage of CPU resources! (Why?)
• Better: Execute instructions in parallel
▫ 1: Pipeline
Partial overlap of instruction execution
▫ 2: Multiple issue
Total overlap of instruction execution
• Today: Pipelining
Pipelining (2/3) Pipelining (3/3)
• Multiple different stages executed in parallel • Good Utilization: All stages are ALWAYS in use
▫ Laundry in 4 different stages ▫ Washing, drying, folding, ...
▫ Wash / Dry / Fold / Store ▫ Great usage of resources!
• Assumptions: • Common technique, used everywhere
▫ Task can be split into stages ▫ Manufacturing, CPUs, etc
▫ Storage of temporary data • Ideal: time_stage = time_instruction / stages
▫ But stages are not perfectly balanced
▫ Stages synchronized
▫ But transfer between stages takes time
▫ Next operation known before last finished?
▫ But pipeline may have to be emptied
▫ ...
16. Example: MIPS64 (2/2)
Example: MIPS64 (1/2) Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
• RISC • Pipeline I
ALU
• Load/store ▫ IF: Instruction fetch n Ifetch Reg DMem Reg
s
• Few instruction formats ▫ ID: Instruction decode / t
register fetch r.
ALU
• Fixed instruction length ▫ EX: Execute / effective
Ifetch Reg DMem Reg
• 64-bit address (EA) O
r
ALU
▫ DADD = 64 bits ADD ▫ MEM: Memory access d
Ifetch Reg DMem Reg
▫ LD = 64 bits L(oad) ▫ WB: Write back (reg) e
r
• 32 registers (R0 = 0)
ALU
Ifetch Reg DMem Reg
• EA = offset(Register)
Big Picture: Big Picture (continued):
• What are some real world examples of • Computer Architecture is the study of design
pipelining? tradeoffs!!!!
• Why do we pipeline? • There is no “philosophy of architecture” and no
• Does pipelining increase or decrease instruction “perfect architecture”. This is engineering, not
throughput? science.
• Does pipelining increase or decrease instruction • What are the costs of pipelining?
latency? • For what types of devices is pipelining not a
good choice?
Improve speedup? Dependencies and hazards
• Why not perfect speedup? • Dependencies
▫ Sequential programs ▫ Parallel instructions can be executed in parallel
▫ Dependent instructions are not parallel
▫ One instruction dependent on another I1: DADD R1, R2, R3
▫ Not enough CPU resources I2: DSUB R4, R1, R5
• What can be done? ▫ Property of the instructions
▫ Forwarding (HW) • Hazards
▫ Situation where a dependency causes an instruction to
▫ Scheduling (SW / HW) give a wrong result
▫ Prediction (SW / HW) ▫ Property of the pipeline
• Both hardware (dynamic) and compiler (static) ▫ Not all dependencies give hazards
can help Dependencies must be close enough in the instruction
stream to cause a hazard
17. Dependencies Hazards
• (True) data dependencies • Data hazards
▫ One instruction reads what an earlier has written ▫ Overlap will give different result from sequential
• Name dependencies ▫ RAW / WAW / WAR
▫ Two instructions use the same register / mem loc • Control hazards
▫ But no flow of data between them ▫ Branches
▫ Two types: Anti and output dependencies ▫ Ex: Started executing the wrong instruction
• Control dependencies • Structural hazards
▫ Instructions dependent on the result of a branch ▫ Pipeline does not support this combination of instr.
• Again: Independent of pipeline implementation ▫ Ex: Register with one port, two stages want to read
Data dependency Hazard?
Figure A.6, Page A-16
Data Hazards (1/3)
• Read After Write (RAW)
I InstrJ tries to read operand before InstrI writes
ALU
Reg
add r1,r2,r3 Ifetch Reg DMem
n it
s
ALU
t sub r4,r1,r3 Ifetch Reg DMem Reg I: add r1,r2,r3
r. J: sub r4,r1,r3
ALU
Ifetch Reg DMem Reg
O and r6,r1,r7
r • Caused by a true data dependency
d
• This hazard results from an actual need for
ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r communication.
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Data Hazards (2/3) Data Hazards (3/3)
• Write After Write (WAW)
• Write After Read (WAR) InstrJ writes operand before InstrI writes it.
InstrJ writes operand before InstrI reads it
I: sub r1,r4,r3
I: sub r4,r1,r3 J: add r1,r2,r3
J: add r1,r2,r3
• Caused by an output dependency
• Caused by an anti dependency
This results from reuse of the name “r1” • Can’t happen in MIPS 5 stage pipeline because:
▫ All instructions take 5 stages, and
• Can’t happen in MIPS 5 stage pipeline because: ▫ Writes are always in stage 5
▫ All instructions take 5 stages, and • WAR and WAW can occur in more
▫ Reads are always in stage 2, and
▫ Writes are always in stage 5
complicated pipes
18. Forwarding Can all data hazards be solved via
Figure A.7, Page A-18 forwarding???
IF ID/RF EX MEM WB IF ID/RF EX MEM WB
I I
ALU
ALU
Reg Reg
add r1,r2,r3 Ifetch Reg DMem
Ld r1,r2 Ifetch Reg DMem
n n
s s
ALU
ALU
t sub r4,r1,r3 Ifetch Reg DMem Reg
t add r4,r1,r3 Ifetch Reg DMem Reg
r. r.
ALU
ALU
Ifetch Reg DMem Reg Ifetch Reg DMem Reg
O and r6,r1,r7 O and r6,r1,r7
r r
d d
ALU
ALU
Ifetch Reg DMem Reg Ifetch Reg DMem Reg
e or r8,r1,r9 e or r8,r1,r9
r r
ALU
ALU
Ifetch Reg DMem Reg Ifetch Reg DMem Reg
xor r10,r1,r11 xor r10,r1,r11
Structural Hazards (Memory Port) Hazards, Bubbles (Similar to Figure A.5, Page A-15)
Figure A.4, Page A-14
Time (clock cycles)
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
ALU
I Load Ifetch Reg DMem Reg
ALU
I Load Ifetch Reg DMem Reg
n
n s
ALU
Reg
s t
Instr 1 Ifetch Reg DMem
ALU
Reg
t
Instr 1 Ifetch Reg DMem
r.
r.
ALU
Ifetch Reg DMem Reg
Ld r1, r2
ALU
Ifetch Reg DMem Reg
Instr 2 O
O r
r Stall Bubble Bubble Bubble Bubble Bubble
d
ALU
Ifetch Reg DMem Reg
d Instr 3
e
e
ALU
r Add r1, r1, r1 Ifetch Reg DMem Reg
ALU
r Instr 4 Ifetch Reg DMem Reg
How do you “bubble” the pipe? How can we avoid this hazard?
Control hazards (1/2)
Control hazards (2/2)
• Sequential execution is predictable, • What can be done?
(conditional) branches are not
▫ Always stop (previous slide)
• May have fetched instructions that should not be
executed Also called freeze or flushing of the pipeline
• Simple solution (figure): Stall the pipeline (bubble) ▫ Assume no branch (=assume sequential)
▫ Performance loss depends on number of branches in the program Must not change state before branch instr. is complete
and pipeline implementation
▫ Branch penaltyC ▫ Assume branch
Only smart if the target address is ready early
▫ Delayed branch
Execute a different instruction while branch is evaluated
Static techniques (fixed rule or compiler)
Possibly wrong instruction Correct instruction
19. Example
• Assume branch conditionals are evaluated in the EX
Dynamic scheduling
stage, and determine the fetch address for the following
cycle. • So far: Static scheduling
• If we always stall, how many cycles are bubbled? ▫ Instructions executed in program order
• Assume branch not taken, how many bubbles for an ▫ Any reordering is done by the compiler
incorrect assumption?
• Is stalling on every branch ok? • Dynamic scheduling
• What optimizations could be done to improve stall ▫ CPU reorders to get a more optimal order
penalty? Fewer hazards, fewer stalls, ...
▫ Must preserve order of operations where
reordering could change the result
▫ Covered by TDT 4255 Hardware design
Example
Compiler techniques for ILP Source code: Notice:
for (i = 1000; i >0; i=i-1) • Lots of dependencies
• For a given pipeline and superscalarity • No dependencies between iterations
x[i] = x[i] + s;
▫ How can these be best utilized? • High loop overhead
▫ As few stalls from hazards as possible Loop unrolling
• Dynamic scheduling MIPS:
▫ Tomasulo’s algorithm etc. (TDT4255) Loop: L.D F0,0(R1) ; F0 = x[i]
▫ Makes the CPU much more complicated ADD.D F4,F0,F2 ; F2 = s
• What can be done by the compiler? S.D F4,0(R1) ; Store x[i] + s
▫ Has ”ages” to spend, but less knowledge DADDUI R1,R1,#-8 ; x[i] is 8 bytes
▫ Static scheduling, but what else? BNE R1,R2,Loop ; R1 = R2?
Loop: L.D F0,0(R1)
Static scheduling Loop unrolling ADD.D F4,F0,F2
S.D F4,0(R1)
Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) L.D F6,-8(R1)
stopp DADDUI R1,R1,#-8 ADD.D F4,F0,F2 ADD.D F8,F6,F2
ADD.D F4,F0,F2 ADD.D F4,F0,F2 S.D F4,0(R1) S.D F8,-8(R1)
stopp stopp DADDUI R1,R1,#-8 L.D F10,-16(R1)
stopp stopp BNE R1,R2,Loop ADD.D F12,F10,F2
S.D F4,0(R1) S.D F4,8(R1) S.D F12,-16(R1)
DADDUI R1,R1,#-8 BNE R1,R2,Loop
L.D F14,-24(R1)
stopp • Reduced loop overhead
ADD.D F16,F14,F2
BNE R1,R2,Loop • Requires number of iterations S.D F16,-24(R1)
divisible by n (here n=4)
DADDUI R1,R1,#-32
• Register renaming BNE R1,R2,Loop
• Offsets have changed
Result: From 9 cycles per iteration to 7 • Stalls not shown
(Delays from table in figure 2.2)
20. Loop: L.D F0,0(R1) Loop: L.D F0,0(R1)
ADD.D F4,F0,F2 L.D F6,-8(R1)
S.D F4,0(R1) L.D F10,-16(R1) Loop unrolling: Summary
L.D F6,-8(R1) L.D F14,-24(R1)
ADD.D F8,F6,F2 ADD.D F4,F0,F2 • Original code 9 cycles per element
S.D F8,-8(R1) ADD.D F8,F6,F2 • Scheduling 7 cycles per element
L.D F10,-16(R1) ADD.D F12,F10,F2 • Loop unrolling 6,75 cycles per element
ADD.D F12,F10,F2 ADD.D F16,F14,F2
▫ Unrolled 4 iterations
S.D F12,-16(R1) S.D F4,0(R1)
L.D F14,-24(R1) S.D F8,-8(R1) • Combination 3,5 cycles per element
ADD.D F16,F14,F2 DADDUI R1,R1,#-32 ▫ Avoids stalls entirely
S.D F16,-24(R1) S.D F12,-16(R1)
DADDUI R1,R1,#-32 S.D F16,-24(R1)
BNE R1,R2,Loop
Compiler reduced execution time by 61%
BNE R1,R2,Loop
Avoids stall after: L.D(1), ADD.D(2), DADDUI(1)
Loop unrolling in practice
• Do not usually know upper bound of loop
• Suppose it is n, and we would like to unroll the loop
to make k copies of the body
• Instead of a single unrolled loop, we generate a pair
of consecutive loops:
▫ 1st executes (n mod k) times and has a body that is the
original loop
▫ 2nd is the unrolled body surrounded by an outer loop
that iterates (n/k) times
• For large values of n, most of the execution time will
be spent in the unrolled loop
21. Review
• Name real-world examples of pipelining
• Does pipelining lower instruction latency?
• What is the advantage of pipelining?
• What are some disadvantages of pipelining?
TDT 4260 • What can a compiler do to avoid processor
Chap 2, Chap 3 stalls?
Instruction Level Parallelism (cont) • What are the three types of data dependences?
• What are the three types of pipeline hazards?
Getting CPI below 1
Contents • CPI ≥ 1 if issue only 1 instruction every clock cycle
• Multiple-issue processors come in 3 flavors:
• Very Large Instruction Word Chap 2.7
1. Statically-scheduled superscalar processors
▫ IA-64 and EPIC • In-order execution
• Instruction fetching Chap 2.9 • Varying number of instructions issued (compiler)
2. Dynamically-scheduled superscalar processors
• Limits to ILP Chap 3.1/2 • Out-of-order execution
• Multi-threading Chap 3.5 • Varying number of instructions issued (CPU)
3. VLIW (very long instruction word) processors
• In-order execution
• Fixed number of instructions issued
VLIW: Very Large Instruction Word (2/2)
VLIW: Very Large Instruction Word (1/2)
• Assume 2 load/store, 2 fp, 1 int/branch
▫ VLIW with 0-5 operations.
• Each VLIW has explicit coding for multiple
▫ Why 0?
operations
▫ Several instructions combined into packets • Important to avoid empty instruction slots
▫ Possibly with parallelism indicated ▫ Loop unrolling
▫ Local scheduling
• Tradeoff instruction space for simple decoding
▫ Global scheduling
▫ Room for many operations
Scheduling across branches
▫ Independent operations => execute in parallel
▫ E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 • Difficult to find all dependencies in advance
branch ▫ Solution1: Block on memory accesses
▫ Solution2: CPU detects some dependencies
22. Loop: L.D F0,0(R1)
Recall: L.D F6,-8(R1)
Loop Unrolling in VLIW
Unrolled Loop L.D
L.D
F10,-16(R1)
F14,-24(R1)
Memory Memory FP FP Int. op/ Clock
reference 1 reference 2 operation 1 op. 2 branch
that minimizes ADD.D F4,F0,F2 L.D F0,0(R1) L.D F6,-8(R1) 1
ADD.D F8,F6,F2 L.D F10,-16(R1) L.D F14,-24(R1) 2
stalls for Scalar ADD.D F12,F10,F2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3
L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4
ADD.D F16,F14,F2
Source code: ADD.D F20,F18,F2 ADD.D F24,F22,F2 5
S.D F4,0(R1) S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6
for (i = 1000; i >0; i=i-1) S.D -16(R1),F12 S.D -24(R1),F16 7
S.D F8,-8(R1)
x[i] = x[i] + s; S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8
DADDUI R1,R1,#-32
S.D -0(R1),F28 BNEZ R1,LOOP 9
S.D F12,-16(R1)
Register mapping: Unrolled 7 iterations to avoid delays
S.D F16,-24(R1)
7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
s F2 BNE R1,R2,Loop
Average: 2.5 ops per clock, 50% efficiency
i R1 Note: Need more registers in VLIW (15 vs. 6 in SS)
Problems with 1st Generation VLIW VLIW Tradeoffs
• Increase in code size • Advantages
▫ Loop unrolling ▫ “Simpler” hardware because the HW does not have to
▫ Partially empty VLIW identify independent instructions.
• Operated in lock-step; no hazard detection HW • Disadvantages
▫ A stall in any functional unit pipeline causes entire processor to ▫ Relies on smart compiler
stall, since all functional units must be kept synchronized
▫ Code incompatibility between generations
▫ Compiler might predict function units, but caches hard to predict
▫ There are limits to what the compiler can do (can’t move
▫ Moder VLIWs are “interlocked” (identify dependences between
loads above branches, can’t move loads above stores)
bundles and stall).
• Binary code compatibility
• Common uses
▫ Strict VLIW => different numbers of functional units and unit ▫ Embedded market where hardware simplicity is
latencies require different versions of the code important, applications exhibit plenty of ILP, and binary
compatibility is a non-issue.
IA-64 and EPIC Instruction bundle (VLIW)
• 64 bit instruction set architecture
▫ Not a CPU, but an architecture
▫ Itanium and Itanium 2 are CPUs
based on IA-64
• Made by Intel and Hewlett-Packard (itanium 2 and 3
designed in Colorado)
• Uses EPIC: Explicitly Parallel Instruction Computing
• Departure from the x86 architecture
• Meant to achieve out-of-order performance with in-
order HW + compiler-smarts
▫ Stop bits to help with code density
▫ Support for control speculation (moving loads above
branches)
▫ Support for data speculation (moving loads above stores)
Details in Appendix G.6
23. Functional units and template
• Functional units:
Code example (1/2)
▫ I (Integer), M (Integer + Memory), F (FP), B (Branch),
L + X (64 bit operands + special inst.)
• Template field:
▫ Maps instruction to functional unit
▫ Indicates stops: Limitations to ILP
Control Speculation
Code example 2/2 • Can the compiler schedule an independent load
above a branch?
Bne R1, R2, TARGET
Ld R3, R4(0)
• What are the problems?
• EPIC provides speculative loads
Ld.s R3, R4(0)
Bne R1, R2, TARGET
Check R4(0)
Data Speculation EPIC Conclusions
• Goal of EPIC was to maintain advantages of VLIW, but
• Can the compiler schedule an independent load achieve performance of out-of-order.
above a store? • Results:
St R5, R6(0) ▫ Complicated bundling rules saves some space, but
Ld R3, R4(0) makes the hardware more complicated
• What are the problems? ▫ Add special hardware and instructions for scheduling
• EPIC provides “advanced loads” and an ALAT loads above stores and branches (new complicated
(Advanced Load Address Table) hardware)
Ld.a R3, R4(0) creates entry in ALAT ▫ Add special hardware to remove branch penalties
St R5, R6(0) looks up ALAT, if match, jump to (predication)
fixup code ▫ End result is a machine as complicated as an out-of-
order, but now also requiring a super-sophisticated
compiler.