Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Advanced Computer
Architectures
– HB49 –
Part 2.2
Vincenzo De Florio
K.U.Leuven / ESAT / ELECTA
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

IS  DLX Architecture
• How good is the DLX architecture?

• DLX is a RISC a...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

IS  DLX Architecture
• RISC = restricted IS architecture
 Key architecture...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining
Time
6 PM

Computer
Design

7

8

9

10

11

12

1

2 AM

30 30 3...
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining
6 PM

7

8

9

10

11

12

1

2 AM

30 30 30 30 30 30 30
Computer...
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining
6 PM

7

8

9

10

11

12

1

2 AM

30 30 30 30 30 30 30
Computer...
© V. De Florio
KULeuven 2002

Pipelining
6 PM

7

9

8

10

11

12

1

2 AM

Basic
Concepts

30 30 30 30 30
…A
Computer
De...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
Memory and special purpose
registers in DLX

© V. De Florio
KULeuven 2002

…

…

…

52
52
45
…
…

71
71
71
…
…

73
75

10
...
Executing DLX Instructions:
Phase 1: Instruction Fetch (IF)

© V. De Florio
KULeuven 2002

…

…

52
52
45
…
…

71
71
71
…
...
Executing DLX Instructions:
Phase 1: Instruction Fetch (IF)

© V. De Florio
KULeuven 2002

…

…

52
52
45
…
…

71
71
71
…
...
© V. De Florio
KULeuven 2002

…

52
52
45
…
…

71
71
71
…
…

73
75

10
52

BEQ R1, R3, eq3
BEQ R1, R5, eq5

00
…
…

96
…
…...
© V. De Florio
KULeuven 2002

…

52
52
45
…
…

71
71
71
…
…

73
75

10
52

BEQ R1, R3, eq3
BEQ R1, R5, eq5

00
…
…

96
…
…...
© V. De Florio
KULeuven 2002

…

…

52
52
45
…
…

71
71
71
…
…

73
75

10
52

BEQ R1, R3, eq3
BEQ R1, R5, eq5

00
…
…

96
...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

…

52
52
45
…
…

71
71
71
…
…

73
75

10
52

BEQ R1, R3, eq3
BEQ R1, R5, eq5

00
…
…

96
…
…...
© V. De Florio
KULeuven 2002

Basic
Concepts

Executing DLX Instructions
•
•

Computer
Design

Computer
Architectures
for ...
© V. De Florio T1
KULeuven 2002

F1

T3

T4

D1
F2

E1
D2
F3

W1
E2
D3
F4

T5
W2
E3
D4
F5

T6

Pipelined

W3
E4
D5
F6

Cac...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining
Stage Actions and pipeline registers
IF
bwIFandI...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining
Stage Actions and pipeline registers
IF
bwIFandI...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining
Stage Actions and pipeline registers
IF
bwIFandI...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining
Stage Actions and pipeline registers
EX bwEXandM...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining 
Quantitative measurements
• Average Instruction Execution Time ...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining  Structural Hazards
• Structural Hazards are re...
© V. De Florio
KULeuven 2002

Pipelining  Structural Hazards
Cycles
1

Computer
Design

LOAD

Basic
Concepts

2

3

4

5
...
© V. De Florio
KULeuven 2002

Pipelining  Structural Hazards

Basic
Concepts

Computer
Design

Computer
Architectures
for...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Pipelining  Structural Haza...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining  Data Hazards
• Pipelining overlaps the executi...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining  Data Hazards
• Pipelining overlaps the executi...
Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

ADD R1, R2, R3

© V. D...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining  Minimizing or
Avoiding Data Hazards
• Let us c...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Pipelining  Minimizing or
A...
Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

ADD R1, R2, R3

© V. D...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
Pipelining  Classification of
Data Hazards

© V. De Florio
KULeuven 2002

Basic
Concepts

1. RAW HAZARD (Read-After-Write...
Pipelining  Classification of
Data Hazards

© V. De Florio
KULeuven 2002

Basic
Concepts

2. WAW HAZARD (Write-After-Writ...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining  Classification of
Data Hazards
• WAW hazards m...
Pipelining  Classification of
Data Hazards

© V. De Florio
KULeuven 2002

Basic
Concepts

3. WAR HAZARD (Write-After-Read...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Computer
Design

• In some cases, forwarding and subcycling
can prevent a stall
ADD R1, R2, ...
© V. De Florio
KULeuven 2002

Computer
Design

• In some cases, forwarding and subcycling
cannot prevent a stall
LW R1, 0(...
© V. De Florio
KULeuven 2002

Computer
Design

• A special HW, called the pipeline
interlock, detects the hazard and stall...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining  Control Hazards
• The problem with branches is...
DLX Branch
1: IF (1/2)

© V. De Florio
KULeuven 2002

…

…

52
52
45
…
…

71
71
71
…
…

73
75

10
52

BEQ R1, R3, eq3
BEQ ...
DLX Branch:
1: IF (2/2)

© V. De Florio
KULeuven 2002

…

…

52
52
45
…
…

71
71
71
…
…

73
75

10
52

BEQ R1, R3, eq3
BEQ...
© V. De Florio
KULeuven 2002

…

52
52
45
…
…

71
71
71
…
…

73
75

10
52

BEQ R1, R3, eq3
BEQ R1, R5, eq5

00
…
…

96
…
…...
© V. De Florio
KULeuven 2002

…

52
52
45
…
…

71
71
71
…
…

73
75

10
52

BEQ R1, R3, eq3
BEQ R1, R5, eq5

00
…
…

96
…
…...
© V. De Florio
KULeuven 2002

…

…

52
52
45
…
…

71
71
71
…
…

73
75

10
52

BEQ R1, R3, eq3
BEQ R1, R5, eq5

00
…
…

96
...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining  Control Hazards
• How to deal with branch pena...
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining  Control Hazards 
Predict not taken
Untaken branch IF ID EX MEM...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining  Control Hazards 
Delayed branch
Untaken branch IF ID EX MEM WB...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
Pipelining  Control Hazards 
Slot schedule

© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Arch...
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining  Control Hazards 
Slot schedule
• From target
INSTR1

Computer
...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Pipelining  Control Hazards...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Pipelining  Control Hazards 
Slot schedule
• Again, the i...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining  Control Hazards
•

Improvements are possible :
Cancelling branc...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Pipelining  Static Branch
Prediction & Compiler Support

Basic
Concepts

Computer
Design

C...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Performance of the DLX Integer
Pipelining System

Basic
Concepts

Computer
Design

Computer
...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Pr...
Upcoming SlideShare
Loading in …5
×

Advanced Computer Architectures – Part 2.2

438 views

Published on

Part 2.2 of the slides I wrote for the course "Advanced Computer Architectures", which I taught in the framework of the Advanced Masters Programme in Artificial Intelligence of the Catholic University of Leuven, Leuven (B)

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

Advanced Computer Architectures – Part 2.2

  1. 1. Advanced Computer Architectures – HB49 – Part 2.2 Vincenzo De Florio K.U.Leuven / ESAT / ELECTA
  2. 2. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/2 Course contents • Basic Concepts Computer Design • Computer Architectures for AI • Computer Architectures in Practice
  3. 3. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/3 Computer Design  IS • IS Classification • Role of the compilers DLX
  4. 4. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/4 IS  DLX Architecture • An example RISC architecture designed by Patterson and Hennessey • Simple register-register (load-store) instruction set • Designed for efficiency  From HW viewpoint  From compiler viewpoint • Useful as an example of good IS design
  5. 5. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice IS  DLX Architecture • Registers:  32 registers called R0 = 0, R1, …, R31  32 single-precision floating point registers or 16 double-precision floating point registers F0, F2, …, F30 • Data types  Like in C: 1 byte, 2 byte, 4 byte integers and 4 byte and 8 byte floats • Addressing modes: just 2  Immediate (example: Add R4, #3)  Displacement (example: Add R4, 100(R1))  16-bit fields  Register deferred: Add R4, 0(R1)  Absolute: Add R4, 100(R0) 2.2/5
  6. 6. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/6 IS  DLX Architecture • Big endian • DLX instruction format  Just two modes  easily to encode in the opcode  All instructions have the same length and start with a 6 bit opcode  easier decoding algorithm  faster processing  shorter cycle is possible • Layout: P&H p.99
  7. 7. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/7 IS  DLX Architecture • Mnemonics: L=load S=store followed by  B=byte H=half word W=word F=float D=double  Examples LB R1, 50(R9) SF 50(R0), F2 • ADD…(Arithmetic op’s), • SL... (shift left, logical op’s), • J…, B… (jump and branch op’s)
  8. 8. © V. De Florio KULeuven 2002 Basic Concepts IS  DLX Architecture • How good is the DLX architecture? • DLX is a RISC architecture Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/8 • What’s a RISC architecture, and what’s the difference between a RISC and a nonRISC architecture?
  9. 9. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/9 IS  DLX Architecture • CISC = complex IS architecture  Architecture of the ’70s  Axioms: (1) the IS must be easy to program with (2) the IS must be easy to compile for  IS not too far away from a HLL  IS includes high level constructs  Loop instructions vs. gotoes  Complex CALL instructions preserving the register file  Case/switch instructions  Large set of addressing modes  All addressing modes are available with all the instructions  Key requirement of the ’70s: Minimize code size
  10. 10. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/10 IS  DLX Architecture • Why? • Because, in the ’70s, RAM memories were 1000 times smaller than today • Code space was a key factor
  11. 11. © V. De Florio KULeuven 2002 Basic Concepts IS  DLX Architecture • RISC = restricted IS architecture  Key architecture today  Axioms: (1) the IS must be simple, Computer Design (2) easy to implement in HW, Computer Architectures for AI (3) should match with clever design solutions (e.g., pipelining) (4) should be a good target for nowadays optimising compilers Computer Architectures In Practice 2.2/11
  12. 12. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice IS  DLX Architecture • RISC = restricted IS architecture  Simple instructions  A few simple addressing modes  Fixed-length instructions  “Many” general purpose registers  Key goal: Help the machine go fast  Recall: CPUTIME(p) = IC(p)  CPI(p) clock rate  In general, RISCs increase the number of instructions executed (IC)…  …but at the same time they decrease CPI 2.2/12  The decrease rate of CPI is higher than the increase rate of IC  shorter CPUTIME
  13. 13. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/13 IS  DLX Architecture
  14. 14. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/14 IS  DLX Architecture • • • • • Clock cycles: assumed to be the same Results: ICMIPS @ 2 x ICVAX CPIMIPS @ CPIVAX / 6 The performance of the MIPS M2000 is about 3 times the performance of the VAX 8700
  15. 15. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/15 Computer Design • Quantitative assessments • Instruction sets Pipelining • Parallelism
  16. 16. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/16 Pipelining • Pipelining = “an implementation technique whereby multiple instruction are overlapped in execution” (P&H) • An assembly line: Different steps (pipe stages) … are completing different parts … of different instructions … in parallel
  17. 17. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Pipelining ° Four persons (A, B, C, and D) have to perform a certain job on 4 sets of items. The job consists of 4 phases. ° Phase 1 (washing) takes 30’ ° Phase 2 (drying), another 30’ ° Phase 3 (packaging), other 30’ ° Phase 4 (delivering) also takes 30’ 2.2/17 A B C D
  18. 18. © V. De Florio KULeuven 2002 Basic Concepts Pipelining Time 6 PM Computer Design 7 8 9 10 11 12 1 2 AM 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 A Computer Architectures for AI Computer Architectures In Practice B C D Doing the job sequentially takes 8 hours 2.2/18
  19. 19. © V. De Florio KULeuven 2002 Basic Concepts Pipelining 6 PM 7 8 9 10 11 12 1 2 AM 30 30 30 30 30 30 30 Computer Design A B Computer Architectures for AI Computer Architectures In Practice 2.2/19 C D Key idea: one starts a new phase as soon as possible The whole job is now finished in just 3.5 hours
  20. 20. © V. De Florio KULeuven 2002 Basic Concepts Pipelining 6 PM 7 8 9 10 11 12 1 2 AM 30 30 30 30 30 30 30 Computer Design A B Computer Architectures for AI Computer Architectures In Practice 2.2/20 C D Between 7.30 and 8pm, each person is busy What if they had more job to do?
  21. 21. © V. De Florio KULeuven 2002 Pipelining 6 PM 7 9 8 10 11 12 1 2 AM Basic Concepts 30 30 30 30 30 …A Computer Design Computer Architectures In Practice 2.2/21 C … D  B … Computer Architectures for AI …      Between 7.30 and 9.30pm, a whole job is completed every 30’ During that period, each worker is permanently at work… …but a new input must arrive within 30’
  22. 22. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/22 Pipelining • Important issues in this example • Each phase has the same complexity  Each phase takes the same amount of time! • In the sequential processing example, the requirement was: a new input must be ready for processing every four phases • Now, a new input must be available every phase time!  The means that brings the input needs to be fourfold as fast  One gets more from the system; though one also asks more to it
  23. 23. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/23 Pipelining • Also in the execution of, e.g., DLX instructions, we distinguish a number of distinct phases – we call them cycles, because each one takes one clock cycle time • In DLX, an instructions is completed in at most five cycles • A number of special purpose registers are used for this: PC (program counter) = address of the instruction to be executed IR (instruction register) = instruction to be executed = *(PC) NPC (next program counter), etc.
  24. 24. Memory and special purpose registers in DLX © V. De Florio KULeuven 2002 … … … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … 108 10C … PC Computer Architectures In Practice … … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts ALUOUT COND LMD NPC TMP1 IMM 2.2/24 IR TMP2
  25. 25. Executing DLX Instructions: Phase 1: Instruction Fetch (IF) © V. De Florio KULeuven 2002 … … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 COND IR 52 71 73 10 LMD 108 10C … … … Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts ALUOUT NPC IMM 2.2/25 TMP1 TMP2
  26. 26. Executing DLX Instructions: Phase 1: Instruction Fetch (IF) © V. De Florio KULeuven 2002 … … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 IR 52 71 73 10 00 00 01 04 108 10C … IMM 2.2/26 … … NPC Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts ALUOUT COND +4 LMD TMP1 TMP2
  27. 27. © V. De Florio KULeuven 2002 … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 COND IR 52 71 73 10 LMD 00 00 01 04 TMP1 (R1) 00 00 00 10 TMP2 (R3) 108 10C … … IMM … 2.2/27 … NPC Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts Executing DLX Instructions: Phase 2: Instruction Decode and Register Fetch (ID) ALUOUT
  28. 28. © V. De Florio KULeuven 2002 … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 COND IR 52 71 73 10 LMD 00 00 01 04 TMP1 (R1) 00 00 00 10 TMP2 (R3) 108 10C … … IMM … 2.2/28 … NPC Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts Executing DLX Instructions: Phase 3: Execution (EX, branch) ALUOUT + 00 00 01 14
  29. 29. © V. De Florio KULeuven 2002 … … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 COND IR 52 71 73 10 LMD 00 00 01 04 TMP1 (R1) 00 00 00 10 TMP2 (R3) 108 10C … IMM 2.2/29 … … NPC Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts Executing DLX Instructions: Phase 3: Execution (EX, branch) ALUOUT = 00 00 01 (R1) == (R3) 14
  30. 30. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/30 Executing DLX Instructions: Phase 3: Execution • An instruction only enters an active phase when it reaches state EX • At that point, the instruction is said to have issued or to have committed • The machine state is only changed when an instruction has committed
  31. 31. © V. De Florio KULeuven 2002 … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 14 COND IR 52 71 73 10 LMD 00 00 01 04 TMP1 (R1) 00 00 00 10 TMP2 (R3) 108 10C … … IMM … 2.2/31 … NPC Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts Executing DLX Instructions: Phase 4: Memory access/branch completion (MEM, branch) ALUOUT 00 00 01 (R1) == (R3) 14
  32. 32. © V. De Florio KULeuven 2002 Basic Concepts Executing DLX Instructions • • Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/32 • DLX branch instructions have only 4 phases The fifth phase is the write-back (WR), in which registers are loaded with an output from the ALU (ALUOUT) or from LMD (see P&H Chapter 3) For instance, when the instruction is LW R1, 100(R0) phases 3 – 5 are as follows: 3. ALUOUT  TMP1 + IMM 4. LMD  Mem[ALUOUT] 5. R1  LMD /* i.e., R0 + 100 */
  33. 33. © V. De Florio T1 KULeuven 2002 F1 T3 T4 D1 F2 E1 D2 F3 W1 E2 D3 F4 T5 W2 E3 D4 F5 T6 Pipelined W3 E4 D5 F6 Cache/ memory Fetch unit Instr. 1 5 4 3 2 Decode unit Instr. 1 4 3 2 Execute unit Computer Design Computer Architectures for AI Computer Architectures In Practice Instr. 1 3 2 2.2/33 Write back Reg file Decode Execute Instr. 1 6 5 4 3 2 Fetch Instr 1 Instr 2 Instr 3 Instr Basic 4 Concepts Instr 5 Instr 6 T2
  34. 34. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/34 Pipelining • With respect to a non-pipelined machine, the memory system must deliver n times that bandwidth (n being the number of pipeline stages) • In pipelined operation, n instructions are concurrently being processed: on average n memory accesses per clock cycle  This worsens the memory bottleneck: even apart from technological advances, this architectural modification increases the number of memory accesses per clock cycle
  35. 35. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/35 Pipelining • In DLX, each instruction takes 5 clock cycles to complete… • …but during each clock cycle, the HW initiates a new instruction and is executing some part of 5 different instructions
  36. 36. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/36 Pipelining • Clearly pipelining increases the complexity of the HW  Each stage involves a set of HW resources; we need to guarantee that the same HW resource be scheduled for execution in at most one pipeline stage  When the pipelined is in steady state, in each cycle the register file is accessed twice: in ID (for reading), in WB (for writing) Each clock cycle, we need to perform two reads and one write  We need to guarantee consistent operation even when we read from and write to, e.g., the same register
  37. 37. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/37 Pipelining  In order to realize the pipeline, values and control information must “move through” the pipeline from one stage to the next  Special registers, called pipeline registers or pipeline latches, convey that information  This because, instead of having, e.g., a single NPC register, we need to have NPC’, NPC’’, NPC’’’… representing the values of NPC during the different stages of different instructions  For instance, bwIDandEX.NPC  bwIFandID.NPC
  38. 38. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining Stage Actions and pipeline registers IF bwIFandID.IR  *PC if (bwEXandMEM.COND == TRUE) bwIFandID.NPC  bwEXandMEM.NPC else bwIFandID.NPC  PC + 4 BEQ R1, R3, eq3 52 Computer Architectures for AI ID Computer Architectures In Practice 2.2/38 71 73 bwIDandEX.TMP1  RbwIFandID.IR[1] bwIDandEX.TMP2  RbwIFandID.IR[2] 10
  39. 39. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining Stage Actions and pipeline registers IF bwIFandID.IR  *PC if (bwEXandMEM.COND == TRUE) bwIFandID.NPC  bwEXandMEM.NPC else bwIFandID.NPC  *PC + 4 New reg  old reg Computer Architectures for AI ID Computer Architectures In Practice 2.2/39 bwIDandEX.TMP1  RbwIFandID.IR[1] bwIDandEX.TMP2  RbwIFandID.IR[2] bwIDandEX.NPC  bwIFandID.NPC bwIDandEX.IR  bwIFandID.IR
  40. 40. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining Stage Actions and pipeline registers IF bwIFandID.IR  *PC if (bwEXandMEM.COND == TRUE) bwIFandID.NPC  bwEXandMEM.NPC else bwIFandID.NPC  *PC + 4 BEQ R1, R3, eq3 52 Computer Architectures for AI ID Computer Architectures In Practice 2.2/40 71 73 10 bwIDandEX.TMP1  RbwIFandID.IR[1] bwIDandEX.TMP2  RbwIFandID.IR[2] bwIDandEX.NPC  bwIFandID.NPC bwIDandEX.IR  bwIFandID.IR bwIDandEX.IMM bwIFandID.IR[3]
  41. 41. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining Stage Actions and pipeline registers EX bwEXandMEM.ALUOUT  bwIDandEX.NPC + bwIDandEX.Imm bwEXandMEM.cond  bwIDandEX.TMP1 rel bwIDandEX.TMP2 Computer Architectures for AI …and so forth (see P&H, p.136) Computer Architectures In Practice 2.2/41
  42. 42. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/42 Pipelining • More registers are required  a more complex design is to be carried out • More complex algorithm  takes more time to complete
  43. 43. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/43 Pipelining • Indeed, implementing an instruction pipeline increases the instruction throughput (average number of instructions completed in one time unit)… …though it slightly increases the execution time of each instruction  Overhead for controlling the pipeline  Overhead for avoiding “hazards” (to be discussed later on)
  44. 44. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/44 Pipelining  Quantitative measurements • U be an unpipelined machine • Clock cycle of U = ccU = 10 ns • Cycle distribution of U is as follows:  ALU instructions (40%) take 4 cycles  Branches (20%) take 4 cycles  Memory operations (40%) take 5 cycles • P = pipelined version of U • Clock cycle of P = ccP = 11 ns (overhead: 1 ns per cycle) • How fast is P w.r.t. U? (Assumption: continuous flow is available, no pipeline stalls...)
  45. 45. © V. De Florio KULeuven 2002 Basic Concepts Pipelining  Quantitative measurements • Average Instruction Execution Time = T • TU = ccU x average CPI MEM Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/45 takes 5 cycles = 10 ns x ( (40% + 20%) x 4 + 40% x 5 ) ALU BRANCH take 4 cycles = 10 ns x 4.4 = 44 ns • TP = ccP x average CPI = ccP x 1 • Speedup = TU / TP = 44 ns / 11 ns = 4
  46. 46. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/46 Pipelining  Hazards • Ideally, pipelines should continuously “crunch” instructions without being interrupted • This way, the speedup is maximum • In reality there exist three classes of impediments that prevent the next instruction from being executed:  Structural Hazards  Data Hazards  Control hazards to be described in what follows
  47. 47. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/47 Pipelining  Hazards • Hazards are a problem because they require to stall the pipeline (see later) • Later on we will show some techniques for hazard prevention
  48. 48. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining  Structural Hazards • Structural Hazards are resource conflicts • Not every combination of instructions is allowed  because not every functional unit is fully pipelined  Or because of other resource conflicts  A problem of cost-effectiveness Computer Architectures for AI Computer Architectures In Practice 2.2/48  Consequence: a stall (“bubble”) floats through the pipeline
  49. 49. © V. De Florio KULeuven 2002 Pipelining  Structural Hazards Cycles 1 Computer Design LOAD Basic Concepts 2 3 4 5 6 7 Mem Instr2 Computer Architectures for AI Instr3 Computer Architectures In Practice 2.2/49 Instr4 Mem If the machine has just one memory port, this is a structural hazard 8
  50. 50. © V. De Florio KULeuven 2002 Pipelining  Structural Hazards Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/50 2 3 4 LOAD Cycles 1 Instr2 Instr3 Instr4 bubble 5 6 7 8
  51. 51. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/51 Pipelining  Structural Hazards • One of the keywords of computer design: make the common case fast, and the rare case correct • If a particular structural hazard does not occur very frequently, it may not be worth the cost to avoid it
  52. 52. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Pipelining  Structural Hazards • Avoiding a conflict has a cost due to the extra redundancy, but also a cost due to extra control • Compare for instance fig. 3.1 and fig.3.4 of P&H • One must be careful so that this overhead does not trigger a need for a higher clock cycle  lower clock rate Computer Architectures In Practice  Recall: CPUTIME(p) = 2.2/52 IC(p)  CPI(p) clock rate
  53. 53. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining  Data Hazards • Pipelining overlaps the execution of a set of instructions • Data Hazards are hazards due to data dependencies between these overlapped executions 2.2/53 R2, R3 SUB R4, R5, R1 R6, R1, R7 OR R8, R1, R9 XOR Computer Architectures In Practice R1, AND Computer Architectures for AI ADD R10, R1, R11 ADD requires 5 cycles to complete! SUB may use the wrong value!
  54. 54. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining  Data Hazards • Pipelining overlaps the execution of a set of instructions • Data Hazards are hazards due to data dependencies between these overlapped executions ADD 2.2/54 R3 R4, R5, R1 AND R6, R1, R7 OR R8, R1, R9 XOR Computer Architectures In Practice XOR is “far” enough R2, SUB Computer Architectures for AI R1, R10, R1, R11 ADD requires 5 cycles to complete! SUB, AND, and OR require R1 sooner
  55. 55. Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice ADD R1, R2, R3 © V. De Florio KULeuven 2002 Pipelining  Data Hazards Cycles 1 3 4 5 6 7 8 NOT A DATA HAZARD SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 2.2/55 2 DATA HAZARDS
  56. 56. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining  Minimizing or Avoiding Data Hazards • Let us consider again ADD R1, R2, R3 • “ADD requires 5 cycles to complete” means “the sum of R2 and R3 will be stored into R1 only at the 5th cycle”  Why should we wait for this to happen? Computer Architectures for AI Computer Architectures In Practice 2.2/56  Forwarding: using a pipeline register that holds the right value SUB R4, R1, R5 becomes SUB R4, bwEXandMEM.ALUOUT , R5
  57. 57. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/57 Pipelining  Minimizing or Avoiding Data Hazards • How forwarding is realized? • By propagating the result of the ALU directly to an input latch of the ALU • A custom circuit selects the right value to be input to the ALU: the named register or the propagated value
  58. 58. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Pipelining  Minimizing or Avoiding Data Hazards • Sometimes forwarding can be avoided by very simple techniques • For instance, let us assume that, during each cycle, writes into the register file occur in the first half of the cycle, while reads occur in the second half W Computer Architectures In Practice R 2.2/58
  59. 59. Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice ADD R1, R2, R3 © V. De Florio KULeuven 2002 Pipelining  Minimizing or Cycles Avoiding Data Hazards 1 SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 2.2/59 2 3 4 5 6 7 8 3, 4: Forwarding 5: F. Avoidance
  60. 60. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/60 Pipelining  Classification of Data Hazards • Let ( Ik)1  k  IC(p) be the ordered series of instructions executed during the run of program p • Let i < j two integers, 1  i < j  IC(p) • So Ii occurs before Ij • Let us represent predicate “instruction i writes in memory location v” as Ii  v • Let us represent predicate “instruction i reads from location v” as Ii  v
  61. 61. Pipelining  Classification of Data Hazards © V. De Florio KULeuven 2002 Basic Concepts 1. RAW HAZARD (Read-After-Write hazard) Ii  v t Computer Design Ij  v • RAW  data dependency on an operand that needs first to be written by Ii, and then read by Ij • If, due to pipelining, Ij reads v before Ii writes it, a RAW hazard occurs : Ij erroneously gets a stale value Computer Architectures for AI Computer Architectures In Practice 2.2/61
  62. 62. Pipelining  Classification of Data Hazards © V. De Florio KULeuven 2002 Basic Concepts 2. WAW HAZARD (Write-After-Write hazard) Ii  v t Computer Design Ij  v • WAW  data dependency on an operand that must be written in a certain order while it is written in the wrong one • If, due to pipelining, Ij writes v before Ii writes it, a WAW hazard occurs : the wrong value gets stored in v Computer Architectures for AI Computer Architectures In Practice 2.2/62
  63. 63. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining  Classification of Data Hazards • WAW hazards may happen in pipelines such that the write-back stage happens in different positions LW R1, 0(R2) ADD R1,R2,R3 Computer Architectures for AI Computer Architectures In Practice IF ID EX MEM1 MEM2 IF ID EX WB WB • This cannot happen with instruction sets such as, e.g., DLX, where each instruction takes the same amount of cycles • Less tricky design  less complexity to handle  less pitfalls 2.2/63
  64. 64. Pipelining  Classification of Data Hazards © V. De Florio KULeuven 2002 Basic Concepts 3. WAR HAZARD (Write-After-Read hazard) Ii  v t Computer Design Ij  v • WAR  data dependency on an operand that needs first to be read by Ii, and then written by Ij • If, due to pipelining, Ij writes v before Ii reads it, a WAR hazard occurs : the wrong value is read from v • Ii erroneously gets the NEW value of v, the one produced by Ij Computer Architectures for AI Computer Architectures In Practice 2.2/64
  65. 65. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/65 Pipelining  Classification of Data Hazards • WAR hazards occur when there are instructions that write results early in the instruction pipeline, as well as instructions that read a source late in the pipeline • For instance, this may happen with the autoincrement addressing mode • This cannot happen with instruction sets such as, e.g., DLX, where all reads are early (ID stage) and all writes are late (WB stage)
  66. 66. © V. De Florio KULeuven 2002 Computer Design • In some cases, forwarding and subcycling can prevent a stall ADD R1, R2, R3 Basic Concepts Pipelining  Hazards Cycles 1 Computer Architectures for AI SUB R4, R1, R5 Computer Architectures In Practice AND R6, R1, R7 OR R8, R1, R9 2.2/66 2 3 4 5 6 7 8
  67. 67. © V. De Florio KULeuven 2002 Computer Design • In some cases, forwarding and subcycling cannot prevent a stall LW R1, 0(R2) Basic Concepts Pipelining  Hazards Cycles 1 Computer Architectures for AI SUB R4, R1, R5 Computer Architectures In Practice AND R6, R1, R7 OR R8, R1, R9 2.2/67 2 3 4 5 6 7 8 IMPOSSIBLE!
  68. 68. © V. De Florio KULeuven 2002 Computer Design • A special HW, called the pipeline interlock, detects the hazard and stalls the pipeline until the hazard is cleared LW R1, 0(R2) Basic Concepts Pipelining  Hazards 1 2 3 4 Computer Architectures for AI SUB R4, R1, R5 bubble Computer Architectures In Practice AND R6, R1, R7 bubble OR R8, R1, R9 2.2/68 bubble 5 6 7 8
  69. 69. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/69 Pipelining  Hazards • Pipeline interlock penalty: one or more clock cycles • Consequences: the CPI for the stalled instruction increases by the length of the stall
  70. 70. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/70 Pipelining  Pipeline Scheduling • Classical solution: pipeline scheduling • The compiler re-arranges the instructions in order to (try to) avoid stalls • Example: the compiler tries to avoid generating code like LW x, … INSTR …, x that is, a load followed by the immediate use of the load destination register
  71. 71. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Pipelining  Pipeline Scheduling 1. Generate DLX code for the expressions a=b+c d=e–f LW R1, b LW R2, c ADD R3, R1, R2 SW a, R3 LW R4, e LW R5, f SUB R6, R4, R5 SW d, R6 Basic block 2.2/71
  72. 72. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/72 Pipelining  Pipeline Scheduling 2. We make a graph of the dependences among the instructions and we order the instructions so as to minimize the stalls LW R1, b LW R2, c ADD R3, R1, R2 SW a, R3 LW R4, e LW R5, f SUB R6, R4, R5 SW d, R6 LW R1, b LW R2, c LW R4, e ADD R3, R1, R2 LW R5, f SW a, R3 SUB R6, R4, R5 SW d, R6
  73. 73. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/73 Pipelining  Control Hazards • Control hazards are hazards due to the execution of branches • Let us call TAKEN BRANCH a branch that sets the PC to its target address • Let us call UNTAKEN BRANCH a branch that does not force the PC to be set; as far as PC is concerned, it behaves like a NOP
  74. 74. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining  Control Hazards • The problem with branches is that their nature is only known at run-time • Simplest method to deal with branches: as soon as we detect a branch, we stall the pipeline • What does exactly mean “as soon as”? Computer Architectures for AI Computer Architectures In Practice 2.2/74
  75. 75. DLX Branch 1: IF (1/2) © V. De Florio KULeuven 2002 … … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 COND IR 52 71 73 10 LMD 108 10C … … … Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts ALUOUT NPC IMM 2.2/75 TMP1 TMP2
  76. 76. DLX Branch: 1: IF (2/2) © V. De Florio KULeuven 2002 … … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 IR 52 71 73 10 00 00 01 04 108 10C … IMM 2.2/76 … … NPC Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts At this point, we’ve just fetched an instruction; but we don’t know yet WHICH ONE! ALUOUT COND +4 LMD TMP1 TMP2
  77. 77. © V. De Florio KULeuven 2002 … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 COND IR 52 71 73 10 LMD 00 00 01 04 TMP1 (R1) 00 00 00 10 TMP2 (R3) 108 10C … … IMM … 2.2/77 … NPC Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts DLX Branch 2: ID At this point, we’ve decoded the instruction and found that it’s indeed a branch ALUOUT
  78. 78. © V. De Florio KULeuven 2002 … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 COND IR 52 71 73 10 LMD 00 00 01 04 TMP1 (R1) 00 00 00 10 TMP2 (R3) 108 10C … … IMM … 2.2/78 … NPC Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts DLX Branch 3: EX (1/2) Here we get the next PC of the taken branch ALUOUT + 00 00 01 14
  79. 79. © V. De Florio KULeuven 2002 … … 52 52 45 … … 71 71 71 … … 73 75 10 52 BEQ R1, R3, eq3 BEQ R1, R5, eq5 00 … … 96 … … BGT R1, #0, positive 114 118 … … … … … … … … … … … … PC 00 00 01 00 COND IR 52 71 73 10 LMD 00 00 01 04 TMP1 (R1) 00 00 00 10 TMP2 (R3) 108 10C … IMM 2.2/79 … … NPC Computer Architectures In Practice … 110 Computer Architectures for AI … 100 104 Computer Design … Basic Concepts DLX Branch 3: EX (2/2) Only at this point we now the nature of the branch: brnch = (cond)? Taken:Untaken; ALUOUT = 00 00 01 (R1) == (R3) 14
  80. 80. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Pipelining  Control Hazards • The problem with branches is that their nature is only known at run-time • Simplest method to deal with branches: as soon as we detect a branch, we stall the pipeline 1. “As soon as” means after the IF stage, during stage ID  IF: first stall 2. Then we need to reach the EX stage to know the address where to branch to  ID: second stall 3. The nature of a branch is revealed at the end of EX, in MEM  EX: third stall 2.2/80 • At this point, the pipeline restarts
  81. 81. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/81 Pipelining  Control Hazards • With a 30% branch frequency and an ideal CPI of 1, three clock cycles of penalty means that the machine only achieves about HALF the ideal speedup from pipelining • What can we do to reduce the three cycle penalty?
  82. 82. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/82 Pipelining  Control Hazards 1. Uncover the nature of the branch earlier in the pipeline: in DLX, this means adding a test to the ID stage 2. Compute the taken PC earlier: at the cost of an additional adder, we can anticipate the addition that gives the taken PC 3. (For untaken branches): do not repeat the IF stage • These strategies can reduce the branch penalty to one clock cycle
  83. 83. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining  Control Hazards • How to deal with branch penalties • Four simple compile-time schemes  Static, fixed, per-branch predictions  Compile-time guesses • Simplest: freezing or flushing the pipeline  Penalty: one clock cycle Computer Architectures for AI Computer Architectures In Practice 2.2/83 • Predict not taken:  The HW continues as if the branch was not taken (next IR = *(PC + 4))  If the branch is taken, the fetched instruction is invalidated (turned into a NOP)  Penalty: no penalty if untaken, one cycle if taken
  84. 84. © V. De Florio KULeuven 2002 Basic Concepts Pipelining  Control Hazards  Predict not taken Untaken branch IF ID EX MEM WB i+1 Computer Design EX MEM WB ID EX MEM WB IF ID EX MEM WB IF i+2 ID IF IF ID EX i+3 i+4 Computer Architectures for AI Computer Architectures In Practice Taken branch IF ID EX MEM WB i+1 Branch target Branch target + 1 Branch target + 2 2.2/84 MEM WB IF idle idle idle idle IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB
  85. 85. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/85 Pipelining  Control Hazards • Predict taken:  Hypothesis: the taken branch address is known very early, long before the outcome of the branch is known  The HW assumes the branch is taken  Penalty: no penalty if taken, one cycle if untaken  Due to loops, taken branches are more than untaken branch
  86. 86. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/86 Pipelining  Control Hazards • Delayed branch • Hypothesis: a branch implies a delay that adds up to the time required to execute n instructions • The branch delay slot is then filled in with instructions that would be executed whatever the outcome of the branch test be • In DLX, n = 1
  87. 87. © V. De Florio KULeuven 2002 Basic Concepts Pipelining  Control Hazards  Delayed branch Untaken branch IF ID EX MEM WB Branch delay Computer Design EX MEM WB ID EX MEM WB IF ID EX MEM WB IF i+2 ID IF IF ID EX i+3 i+4 Computer Architectures for AI Computer Architectures In Practice Taken branch IF ID EX MEM WB Branch delay Branch target Branch target + 1 Branch target + 2 2.2/87 MEM WB IF ID IF EX MEM WB ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB
  88. 88. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Pipelining  Control Hazards  Slot schedule • Problem: how to schedule the branch-delay slot • Three ways • Best choice: an independent instruction from before the branch INSTR1 INSTR2 INSTR1 IF TEST THEN IF TEST THEN Delay slot INSTR2 … … INSTR N INSTR N • Penalty: none 2.2/88
  89. 89. Pipelining  Control Hazards  Slot schedule © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/89 • If the best choice is not possible, e.g., due to a dependency, then one may choose among the following two methods: 1. From target : If it is not possible to select an independent instruction from before the branch (a sure one!), then you must guess: If the chance that the branch is taken is felt as higher, then you fill the delay slot with an instruction from the target of the branch
  90. 90. © V. De Florio KULeuven 2002 Basic Concepts Pipelining  Control Hazards  Slot schedule • From target INSTR1 Computer Architectures for AI Computer Architectures In Practice 2.2/90 INSTR2 … INSTR2 … IF TEST THEN Computer Design INSTR1 IF TEST THEN Delay slot INSTR 1 • Penalty: none if the branch is a taken one, 1 clock cycle if it’s untaken • Assumption: no side effect from executing INSTR 1 when branch is mispredicted (no undo required!)
  91. 91. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Pipelining  Control Hazards  Slot schedule 2. From fall through : If it is not possible to select an independent instruction from before the branch (a sure one!), and if the chance that the branch is not taken is felt as higher, then you fill the delay slot with the instruction at PC+4 IF TEST THEN Delay slot Computer Architectures In Practice IF TEST THEN INSTR1 INSTR1 INSTR2 … INSTR2 … INSTR N 2.2/91 INSTR N
  92. 92. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Pipelining  Control Hazards  Slot schedule • Again, the instruction selected to be placed in the delay slot must be side effect free • That instruction must be such that no undo is required if the branch goes in the unexpected direction Computer Architectures for AI BEQ R2, R3, Skip LW R1, #100 Computer Architectures In Practice ... Skip LW R1, #200 ... 2.2/92 The second load overwrites the first one
  93. 93. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/93 Pipelining  Control Hazards • The LW example is clearly an ideal one. In reality, it is very difficult to select an instruction for the delay slot • Furthermore, these schemes are compiletime predictions that may be found to be false at run-time
  94. 94. © V. De Florio KULeuven 2002 Basic Concepts Pipelining  Control Hazards • Improvements are possible : Cancelling branches : the branch instructions include a prediction bit (taken vs. untaken). If the prediction bit is false, the branch instruction “cancels” the instruction in the delay slot by writing the NOP bit(s) • This makes it easier to select instructions for the delay slot: the side-effect free requirement can be relaxed Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/94
  95. 95. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/95 Pipelining  Control Hazards  The use of delayed and cancelling branches resulted in no penalty in 70% of the time on average with 10 programs of the SPECint92 benchmarks (5 int., 5 f.p.)  Delayed branches have an extra cost: an interrupt may occur also during the execution of the instruction in the branch delay slot (BDSI). If the branch was taken, then both the address of the BDSI and that of the branch target need to be preserved and restored when the interrupt has been served
  96. 96. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/96 Pipelining  Control Hazards • The longer the pipeline, the more pipeline stages are required (1) to uncover the current branch target address and (2) to tell the nature of the current branch • In DLX, one clock cycle (very small) • In R4000, it is 3 clock cycles (1) and 1 clock cycle (2)
  97. 97. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/97 Pipelining  Static Branch Prediction & Compiler Support • The effectiveness of delayed branch depends on the truth value of our guess • Static branch prediction: predicting the outcome of a branch at compile time (vs. dynamic prediction: prediction based on runtime program behaviour) • Static prediction method 1: observing and analysing the program behaviour • Static prediction method 2: using profile information collected from earlier runs of the program
  98. 98. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/98 Pipelining  Static Branch Prediction & Compiler Support • Static prediction method 1: observing and analysing the program behaviour • Observations (10 SPECint92 benchmark programs) show that most branches are taken  On average, 62% in integer programs, 70% in f.p. programs (total @ 67%)  Of taken branches, backward branches are at least 1.5 times more than forward branches  Loop unrolling is a reason for this
  99. 99. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/99 Pipelining  Static Branch Prediction & Compiler Support • Simplest method: predict-as-taken (1.1) • In our benchmark, a minority of these predictions is wrong (34%) • Note: On the average! Worst misprediction is 59%, best is 9% (in the worst case, predict-as-untaken would give better performance!)
  100. 100. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/100 Pipelining  Static Branch Prediction & Compiler Support • Method 1.2: predict-bw-as-taken predict-fw-as-untaken • For some programs and compilers, n (fw branches)  50% • In this case only, M1.2 is better than M1.1 • This is not true for the 10 SPECint92 programs and in most cases
  101. 101. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/101 Pipelining  Static Branch Prediction & Compiler Support • Static prediction method 2: using profile information collected from earlier runs of the program • You see what happened in the past and consider this as a good model for the future • Per branch prediction • Key observation and principle: “often,” a given branch has a high-probability behaviour  A privileged attribute  It is most likely a taken or an untaken branch
  102. 102. © V. De Florio KULeuven 2002 Pipelining  Static Branch Prediction & Compiler Support Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/102 • Average # of instructions between mispredictions: 20 vs 110 • St.dev: 27 vs. 85 (very large)
  103. 103. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/103 Performance of the DLX Integer Pipelining System • Assumptions:  No misses  No clock overhead  Basic delayed branch + cancelling delayed branch (1 cycle delay each) • Results:  (Exercising five SPECint92 programs:)  9% – 23% of the instructions cause a 1 cycle loss
  104. 104. © V. De Florio KULeuven 2002 Performance of the DLX Integer Pipelining System Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/104 • Colors: branch / load stalls • DLX average CPI : 1.11 • Speedup(5 SPECint92 prgs) = 5/1.1 = 4.5
  105. 105. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/105 Pipelining  Exceptions • An exception is an event that is triggered at run time due to the interaction with the environment and results in a (temporary or permanent) suspension of the current application so to manage the event • Examples:  A key has been pressed (interrupt)  The user invokes a service of the OS  A breakpoint is encountered  A division-by-zero condition is encountered  An overflow or underflow condition  A NaN float  Misalignments  Access to protected or non existing memory areas  Power failures…
  106. 106. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/106 Pipelining  Exceptions • What happens to the pipeline when an exception takes place? • With pipelining, instructions are no more “atomic” • An instruction is further subdivided into “stages” • The instruction is only completed at the end of the last stage • If an interrupt occurs in the middle of a committed instruction, the result may be a half-finished instruction
  107. 107. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/107 Pipelining  Exceptions • Interrupt  An external event asks for immediate attention (service) by raising an input line (the INT line)  The main program is interrupted wherever it is  A jump is made to the interrupt service routine (ISR)  After processing the ISR, the main program resumes where it was broken off • A pipeline (or machine) is said to be restartable if it can handle an exception (e.g. an interrupt), save the state, and restart without affecting the execution of the program being interrupted
  108. 108. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Pipelining  Exceptions • Precise exceptions: a property of a pipelined machine such that instructions just before the exceptions are completed and instructions after the exceptions can be restarted from scratch • Often precise exceptions imply a huge penalty • The IBM PowerPc and others adopts two modes:  Precise exceptions mode (slow, for debugging)  Performance mode (inprecise, fast) 2.2/108
  109. 109. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/109 Pipelining  Exceptions • In the DLX integer pipeline no instruction updates the machine state before the end of the MEM stage • This makes realising precise exceptions very easy • The instructions later in the pipeline have not committed yet • This is not true, e.g., for the autodecrement mode instructions of the VAX, which cause the update of registers in the middle of the execution of an instruction
  110. 110. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/110 Pipelining  Exceptions • If such an instruction is aborted due to an exception, the machine state would be left altered • Machines with these instructions often have the ability to back out any state change before the instruction has committed • If an exception occurs, the machine uses this feature to reset the state of the machine to its value before the interrupted instruction started
  111. 111. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.2/111 Pipelining  Exceptions • On VAX and the 360 family, special instructions use the general purpose registers as working storage • In such machines, g.p. registers are always saved on exception and restored after the exception • The state of partially completed instructions lies in these registers, which makes the exceptions precise

×