SlideShare a Scribd company logo
1 of 166
Download to read offline
Advanced Computer
Architectures
– HB49 –
Part 2.3
Vincenzo De Florio
K.U.Leuven / ESAT / ELECTA
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/2

Course contents
• Basic Concepts
Computer Design
• Computer Architectures for AI
• Computer Architectures in Practice
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/3

Computer Design
• Quantitative assessments
• Instruction sets
• Pipelining
Parallelism
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Parallelism
• Introduction to parallel processing
• Instruction level parallelism
• (Data level parallelism)
 Part 3

• (Task level parallelism)
 Part 3

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/4
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/5

Parallelism
• Introduction to parallel processing
 Basic concepts: granularity, program,
process, thread, language aspects
 Types of parallelism

• Instruction level parallelism
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/6

Parallelism
• Introduction to parallel processing
 Basic concepts: granularity, program,
process, thread
 Types of parallelism

• Instruction level parallelism
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Granularity
• Definition:
 granularity is the complexity/grain size of
some item
 e.g. computation item (instruction),
data item (scalar, array, struct),
communication item (token granularity),
hardware building block (gate,
RTL component)
Granularity

Low

CISC (e.g. ld *a0++,r1)

Computer
Architectures
In Practice

High Level Languages HLLs
(e.g. x = sin(y))
High

2.3/7

RISC (e.g. add r1,r2,r4)

Application-specific
(e.g. edge-det.invert.image)
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/8

Granularity
• Deciding the granularity is an important
design choice
• E.g. grain size for the communication
tokens in a parallel computer:
 coarse grain: less communication overhead
 fine grain: less time penalty when two
communication packets compete for
transmission over the same channel and collide
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/14

Parallelism
• Introduction to parallel processing
 Basic concepts: granularity, program, process,
thread
 Types of parallelism

• Instruction level parallelism
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/15

Types of parallelism
• Functional parallelism

Important for
the exam!

 Different computations have to be performed
on the same or different data
 E.g. Multiple users submit jobs to the same
computer or a single user submits multiple jobs
to the same computer
 this is functional parallelism at the process level
 taken care of at run-time by the OS
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/18

Types of parallelism
• Data parallelism

Important for
the exam!

 Same computations have to be performed on a
whole set of data
 E.g. 2D convolution of an image
 This is data parallelism at the loop level:
consecutive loop iterations are candidates for
parallel execution, subject to inter-iteration data
dependencies
 Leads often to massive amount of parallelism
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Levels of parallelism
• Instruction level parallel (ILP)
 Functional parallelism at the instruction level
 Example: pipelining

• Data level parallel (DLP)
 Data parallelism at the loop level

• Process & thread level parallel (TLP)
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/19

 Functional parallelism at the thread and
process level
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/20

Parallelism
• Introduction to parallel processing
• Instruction level parallelism
 Introduction
 VLIW
 Advanced pipelining techniques
 Super scalar
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/21

Parallelism
• Introduction to parallel processing
• Instruction level parallelism
 Introduction
 VLIW
 Advanced pipelining techniques
 Super scalar
© V. De Florio
KULeuven 2002

Basic
Concepts

Type of Instruction Level
Parallelism utilization
• Sequential instruction issuing, sequential
instruction execution
 von Neumann processors

Computer
Design

Instruction word
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/22

EU
© V. De Florio
KULeuven 2002

Basic
Concepts

Type of Instruction Level
Parallelism utilization
• Sequential instruction issuing, parallel
instruction execution
 pipelined processors

Computer
Design

Instruction word
EU1

Computer
Architectures
for AI

Computer
Architectures
In Practice

EU2
EU3

EU4

2.3/23
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Type of Instruction Level
Parallelism utilization
• Parallel instruction issuing –
compile-time determined by compiler,
parallel instruction execution
 VLIW processors:
Very Long Instruction Word
Instruction word

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/24

EU1

EU2

EU3

EU4
Type of Instruction Level
Parallelism utilization

© V. De Florio
KULeuven 2002

Basic
Concepts

• Parallel instruction issuing – run-time
determined by HW dispatch unit,
parallel instruction execution
 super-scalar processors (to be seen later)

Computer
Design

Instruction
window

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/25

EU1

EU2

EU3

EU4
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/26

Type of Instruction Level
Parallelism utilization
• Most processors provide sequential
execution semantics
 regardless how the processor actually
executes the instructions (sequential or
parallel, in-order or out-of-order), the result
is the same as sequential execution in the
order they were written

• VLIW and IA-64 provide parallel
execution semantics
 explicit indication in ASM which
instructions are executed in parallel
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/27

Parallelism
• Introduction to parallel processing
• Instruction level parallelism
 Introduction
 VLIW
 Advanced pipelining techniques
 Super scalar
© V. De Florio
KULeuven 2002

VLIW

Main instruction
memory
Basic
Concepts

128 bit

Instruction Cache
Computer
Design

128 bit

Instruction Register
32 bit each

Dec
Computer
Architectures
for AI

Dec
256 decoded bits each

EU

EU

EU

Register file

EU
32 bit each; 8 read ports, 4 write ports

32 bit each; 2 read ports, 1 write port

32 bit;
1 bi-directional port
2.3/28

Dec

Cache/
RAM

Computer
Architectures
In Practice

Dec

Main data
memory
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/29

VLIW
• Properties
 Multiple Execution Units: multiple instructions
issued in one clock cycle
 Every EU requires 2 operands and delivers one
result every clock cycle: high data memory
bandwidth needed
 Careful design of data memory hierarchy
 Register file with many ports
 Large register file: 64-256 registers
 Carefully balanced cache/RAM hierarchy with
decreasing number of ports and increasing
memory size and access time for the higher
levels (IMEC research: DTSE)
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/32

VLIW
• Properties
 Compiler should determine which instructions
can be issued in a single cycle without control
dependency conflict nor data dependency
conflict
 Deterministic utilization of parallelism: good for
hard-real-time
 Compile-time analysis of source code: worst case
analysis instead of actual case
 Very sophisticated compilers, especially when
the EUs are pipelined! Perform well since early
2000
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/33

VLIW
• Properties
 Compiler should determine which instructions
can be issued in a single cycle without control
dependency conflict nor data dependency
conflict
 Very difficult to write assembly:
programmer should resolve all control flow conflicts
all data flow conflicts
all pipelining conflicts
and at the same time fit data accesses into the
available data memory bandwidth
 and all program accesses into the available program
memory bandwidth
 e.g. 2 weeks for a sum-of-products (3 lines of Ccode)





 All high end DSP processors since 1999 are
VLIW processors (examples: Philips Trimedia -high end TV, TI TMS320C6x -- GSM base
stations and ISP modem arrays)
© V. De Florio
KULeuven 2002

Low power DSP

Basic
Concepts

Computer
Design

Main instruction
memory
Too much power
dissipation in
fetching wide
instructions

128 bit

Instruction Cache
128 bit

Computer
Architectures
for AI

Instruction Register
32 bit each

Dec
Computer
Architectures
In Practice

Dec

Dec

Dec
256 decoded bits each

EU

EU

EU

Register file

EU
32 bit each; 8 read ports, 4 write ports

32 bit each; 2 read ports, 1 write port
2.3/34
© V. De Florio
KULeuven 2002

Main
IMem

Low power DSP

Basic
Concepts

24 bit

ICache

Computer
Design

24 bit

E.g. ADD4 is expanded into
ADD || ADD || ADD || ADD

Instruction
expansion
128 bit

Computer
Architectures
for AI

Instruction Register
32 bit each

Dec
Computer
Architectures
In Practice

Dec

Dec

Dec
256 decoded bits each

EU

EU

EU

Register file

EU
32 bit each; 8 read ports, 4 write ports

32 bit each; 2 read ports, 1 write port
2.3/35
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/36

Low power DSP
• Properties
 Power consumption in program memory is
reduced by specializing the instructions for the
application
 Not all combinations of all instructions for the
EUs are possible, but only a limited set, i.e.
those combinations that lead to a substantial
speed-up of the application
 Those relevant combinations are represented
by the smallest possible amount of bits to
reduce program memory width and hence
program memory power consumption
 Can only be done for embedded DSP
applications: processor is specialized for 1
application (examples: TI TMS320C54x -- GSM
mobile phones, TI TMS320C55x -- UMTS mobile
phones)
Low power DSP
for interactive
multimedia

© V. De Florio
KULeuven 2002

Main
IMem

Basic
Concepts

ICache

Computer
Design

24 bit

Run-time reconfiguration
allows to adapt specialization
to changing application
requirements

24 bit

Reconfigurable
Instruction expansion
128 bit

Computer
Architectures
for AI

Instruction Register
32 bit each

Dec
Computer
Architectures
In Practice

Dec

Dec

Dec
256 decoded bits each

REU

REU

REU

Register file

REU
32 bit each; 8 read ports, 4 write ports

32 bit each; 2 read ports, 1 write port
2.3/37
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/39

Parallelism
• Introduction to parallel processing
• Instruction level parallelism
 Introduction
 VLIW
 Advanced pipelining techniques
 Super scalar
© V. De Florio
KULeuven 2002

Basic
Concepts

Advanced Pipelining
• Pipeline CPI is the result of many
components


CPUTIME(p) = IC(p)  CPI(p)

clock rate

Computer
Design

• A number of techniques act on one or
more of these components:
Computer
Architectures
for AI

Computer
Architectures
In Practice

 Loop unrolling
 Scoreboarding
 Dynamic branch prediction
 Speculation
…

• To be seen later
2.3/40
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/41

Advanced Pipelining
• Till now, Instruction-level parallelism was
searched within the boundaries of a basic
block (BB)
• A BB is 6-7 instructions on average
 too small to reach the expected
performance
• What is worse, there’s a big chance that
these instructions have dependencies
 Even less performance can be expected
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/42

Advanced Pipelining
• To obtain more, we need to go beyond the
BB limitation:
• We must exploit ILP across multiple BB’s
• Simplest way: loop level parallelism (LLP):
 Exploiting the parallelism among iterations of a
loop

• Converting LLP into ILP
 Loop unrolling
 Statically (compiler-based)
 Dynamically (HW-based)

• Using vector instructions
 Does not require LLP -> ILP conversion
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/43

Advanced Pipelining
• The efficiency of the conversion depends
 On the amount of ILP available
 On latencies of the functional units in the
pipeline
 On the ability to avoid pipeline stalls by
separating dependent instructions by a
“distance” (in terms of stages) equal to the
latency peculiar to the source instruction

LW x, …

INSTR …, x
a load must not be followed by the
immediate use of the load destination
register
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/44

Advanced Pipelining 
Loop unrolling
Assumptions and steps
1. We assume the following latencies
Consumer
Instruction

Producer
Instruction

Latency

FP ALU OP

FP ALU OP

3

FP ALU OP

S ORE DBL
T

2

LOAD DBL

FP ALU OP

1

LOAD DBL

S ORE DBL
T

0
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/45

Advanced Pipelining 
Loop unrolling
2. We assume to work with a simple loop
such as
for (I=1; I<=1000; I++)
x[I] = X[I] + s;

• Note: each iteration is independent of
the others
 Very simple case
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/46

Advanced Pipelining 
Loop unrolling
3. Translated in DLX, this simple loop looks
like this:
; assumptions: R1 = &x[1000]
;
F2 = s
Loop: LD
F0, 0(R1) ; F0 = x[I]
ADDD F4, F0, F2 ; F4 = F0 + s
SD
0(R1), F4 ; store result
SUBI R1, R1, #8 ; R1 = R1 - 1
BNEZ R1, Loop ; if (R1)
; goto Loop

W

O
Advanced Pipelining 
Loop unrolling

© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

4. Tracing the loop (no scheduling!):
Loop: LD
stall
ADDD
stall
stall
SD
SUBI
BNEZ
stall
•

2.3/47

F0, 0(R1) ;

F4, F0, F2 ;


0(R1), F4
R1, R1, #8
R1, Loop


;
;
;
;

1
2
3
4
5
6
7
8
9

9 clock cycles per iteration, with 4 stalls
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/48

Advanced Pipelining 
Loop unrolling
5. With scheduling, we move from
Loop: LD
ADDD
SD
SUBI
BNEZ
to

F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #8
R1, Loop

Loop: LD
ADDD
SUBI
BNEZ
SD

F0, 0(R1)
F4, F0, F2
R1, R1, #8
R1, Loop
8
8(R1), F4

whose trace shows that less cycles are
wasted:
Advanced Pipelining 
Loop unrolling

© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/49

6. Tracing the loop (with scheduling!):
Loop: LD
stall
ADDD
SUBI
BNEZ
SD
•
•
•
•

F0, 0(R1)

F4, F0, F2
R1, R1, 8
R1, Loop
8(R1), F4

;
;
;
;
;

1
2
3
4
5
6

O
O

6 clock cycles per iteration, with 1 stall
3 stalls less!
Still the useful cycles are just 3
How to gain more?
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/50

Advanced Pipelining 
Loop unrolling
7. With loop unrolling:
replicating the body of loop multiple
times
Loop: LD
F0, 0(R1)
ADDD F4, F0, F2
SD
0(R1), F4
; skip SUBI and BNEZ
LD
F6, -8(R1)
; F6 vs. F0
ADDD F8, F6, F2
; F8 vs. F4
SD
-8(R1), F8
; skip SUBI and BNEZ
LD
F10, -16(R1) ; F10 vs. F0
ADDD F12, F10, F2 ; F12 vs. F4
SD
-16(R1), F12 ; skip SUBI and BNEZ
LD
F14, -24(R1) ; F14 vs. F0
ADDD F16, F14, F2 ; F16 vs. F4
SD
-24(R1), F16 ; skip SUBI and BNEZ
SUBI R1, R1, #32 ; R1 = R1 – 4
BNEZ R1, Loop
• Spared 3 x (SUBI + BNEZ)
© V. De Florio
KULeuven 2002

Basic
Concepts

Advanced Pipelining 
Loop unrolling
• Loop unrolling:
replicating the body of loop multiple times
 Some branches are eliminated
 The ratio w/o increases

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

 The BB artificially increases its size
 Higher probability of optimal scheduling

 Requires a wider set of registers and
adjusting values of load and store
registers
 (In the given example,) Every operation is
followed by a dependent instruction
 Will cause a stall
 Trace of unscheduled unrolled loop: 27 cycles
 2 per LD, 3 per ADD, 2 per branch, 1 per any other

2.3/51

 6.8 clock cycles per iteration
 Pure scheduling is better! (6 cycles)
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/52

Advanced Pipelining 
Loop unrolling
• Unrolled loop plus scheduling
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ

F0, 0(R1)
F4, F0, F2
0(R1), F4
F6, -8(R1)
F8, F6, F2
-8(R1), F8
F10, -16(R1)
F12, F10, F2
-16(R1), F12
F14, -24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop

; skip SUBI and BNEZ
; F6 vs. F0
; F8 vs. F4
; skip SUBI and BNEZ
; F10 vs. F0
; F12 vs. F4
; skip SUBI and BNEZ
; F14 vs. F0
; F16 vs. F4
; skip SUBI and BNEZ
; R1 = R1 – 4
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/53

Advanced Pipelining 
Loop unrolling
• Unrolled loop plus scheduling
Loop: LD
LD
ADDD
SD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ

F0, 0(R1)
F6, -8(R1)
F4, F0, F2
0(R1), F4
F8, F6, F2
-8(R1), F8
F10, -16(R1)
F12, F10, F2
-16(R1), F12
F14, -24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop

; F6 vs. F0
; skip SUBI and BNEZ
; F8 vs. F4
; skip SUBI and BNEZ
; F10 vs. F0
; F12 vs. F4
; skip SUBI and BNEZ
; F14 vs. F0
; F16 vs. F4
; skip SUBI and BNEZ
; R1 = R1 – 4
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/54

Advanced Pipelining 
Loop unrolling
• Unrolled loop plus scheduling
Loop: LD
LD
LD
ADDD
SD
ADDD
SD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ

F0, 0(R1)
F6, -8(R1)
F10, -16(R1)
F4, F0, F2
0(R1), F4
F8, F6, F2
-8(R1), F8
F12, F10, F2
-16(R1), F12
F14, -24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop

; F6 vs. F0
; F10 vs. F0
; skip SUBI and BNEZ
; F8 vs. F4
; skip SUBI and BNEZ
; F12 vs. F4
; skip SUBI and BNEZ
; F14 vs. F0
; F16 vs. F4
; skip SUBI and BNEZ
; R1 = R1 – 4
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/55

Advanced Pipelining 
Loop unrolling
• Unrolled loop plus scheduling
Loop: LD
LD
LD
LD
ADDD
SD
ADDD
SD
ADDD
SD
ADDD
SD
SUBI
BNEZ

F0, 0(R1)
F6, -8(R1)
F10, -16(R1)
F14, -24(R1)
F4, F0, F2
0(R1), F4
F8, F6, F2
-8(R1), F8
F12, F10, F2
-16(R1), F12
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop

; F6 vs. F0
; F10 vs. F0
; F14 vs. F0
; skip SUBI and BNEZ
; F8 vs. F4
; skip SUBI and BNEZ
; F12 vs. F4
; skip SUBI and BNEZ
; F16 vs. F4
; skip SUBI and BNEZ
; R1 = R1 – 4
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/56

Advanced Pipelining 
Loop unrolling
• Unrolled loop plus scheduling

Enough
distance
to prevent
the
dependency
to turn
into a
hazard

Loop: LD
F0, 0(R1)
LD
F6, -8(R1)
; F6 vs. F0
LD
F10, -16(R1) ; F10 vs. F0
LD
F14, -24(R1) ; F14 vs. F0
ADDD F4, F0, F2
ADDD F8, F6, F2
; F8 vs. F4
ADDD F12, F10, F2 ; F12 vs. F4
ADDD F16, F14, F2 ; F16 vs. F4
SD
0(R1), F4
; skip SUBI and BNEZ
SD
-8(R1), F8
; skip SUBI and BNEZ
SD
-16(R1), F12 ; skip SUBI and BNEZ
SD
-24(R1), F16 ; skip SUBI and BNEZ
SUBI R1, R1, #32 ; R1 = R1 – 4
BNEZ R1, Loop
• 14 clock cycles, or 3.5 clock cycles / iteration
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/57

Advanced Pipelining 
Loop unrolling
• Unrolling the loop exposes more
computation that can be scheduled to
minimize the stalls
• Unrolling increases the BB; as a result, a
better choice can be done for scheduling
• A useful technique with two key
requirements:
 Understanding how an instruction depends on
another
 Understanding how to change or reorder the
instructions, given the dependencies

• In what follows we concentrate on .
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Loop unrolling: . dependencies
•

Again, let ( Ik)1  k  IC(p) be the ordered
series of instructions executed during
the run of program p
• Given two instructions, Ii and Ij, with i<j,
we say that
Ij is dependent on Ii
(Ii  Ij)
iff

 R(Ii)  D(Ij)

 R is the range and D the domain of a given
instruction
 Ii produces a result which is consumed by Ij

or


2.3/58

$ n  { 1,…,IC(p)} and $ k1 < k2 < … < kn
such that Ii  Ik1  Ik2  .. Ikn  Ij
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/59

Loop unrolling: . dependencies
• (Ii , Ik1 , Ik2 , … Ikn , Ij) is called a
dependency (transitive) chain
• Note that a dependency chain can be as
long as the entire execution of p
• A hazard implies dependency
• Dependency does not imply a hazard!
• Scheduling tries to place dependent
instructions in places where no hazard can

occur
© V. De Florio
KULeuven 2002

Loop unrolling: . dependencies

Basic
Concepts

• For instance:
SUBI R1, R1, #8

Computer
Design

BNEZ R1, Loop
• This is clearly a dependence, but it does
not result in a hazard
 Forwarding eliminates the hazard

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/60

• Another example:
LD
F0, 0(R1)
ADDD F4, F0, F2
• This is a data dependency which does
lead to a hazard and a stall
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/61

Loop unrolling: . dependencies
• Dealing with data dependencies
• Two classes of methods:
1. Keeping the dependence though avoiding
the hazard (via scheduling)
2. Eliminating a dependence by
transforming the code
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/62

Loop unrolling: . dependencies
• Class 2 implies more work
• These are optimization methods used by
the compilers
• Detecting dependencies when only using
registers is easy; the difficulties come
from detecting dependencies in memory:
• For instance 100(R4) and 20(R6) may
point to the same memory location
• Also the opposite situation may take
place:
LD 20(R4), R2
…
ADD R3, R1, 20(R4)
• If R4 changes, this is no dependency
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/63

Loop unrolling: . dependencies
• Ii  Ij means that
Ii produces a result that is consumed by Ij
• When there is no such production, e.g.,
Ii and Ij are both loads or stores, we call
this a name dependency
• Two types of name dependencies:
 Antidependence
Corresponds to WAR hazards
Ij  x ; Ii  x (reordering implies an error)
 Output dependence
Corresponds to WAW hazards
Ij  x ; Ii  x (reordering implies an error)

• No value is transferred between the
instructions
• Register renaming solves the problem
© V. De Florio
KULeuven 2002

Basic
Concepts

Loop unrolling: . dependencies
•
•

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/64

•

Register renaming: if the register name is
changed, the conflict disappears
This technique can be either static (and
done by the compiler) or dynamic (done
by the HW)

Let us consider again the following loop:
Loop: LD
F0, 0(R1)
ADDD F4, F0, F2
SD
0(R1), F4
SUBI R1, R1, #8
BNEZ R1, Loop
• Let us perform unrolling w/o renaming:
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/65

Loop unrolling: . dependencies
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ

F0, 0(R1)
F4, F0, F2
0(R1), F4
F0, -8(R1)
F4, F0, F2
-8(R1), F4
F0, -16(R1)
F4, F0, F2
-16(R1), F4
F0, -24(R1)
F4, F0, F2
-24(R1), F0
R1, R1, #32
R1, Loop

The yellow arrows
are name dependencies. To solve
them, we perform
renaming
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/66

Loop unrolling: . dependencies
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ

F0, 0(R1)
F4, F0, F2
0(R1), F4
F6, -8(R1)
F8, F6, F2
-8(R1), F8
F0, -16(R1)
F4, F0, F2
-16(R1), F4
F0, -24(R1)
F4, F0, F2
-24(R1), F0
R1, R1, #32
R1, Loop
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/67

Loop unrolling: . dependencies
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ

F0, 0(R1)
F4, F0, F2
0(R1), F4
F6, -8(R1)
F8, F6, F2
-8(R1), F8
F10, -16(R1)
F12, F10, F2
-16(R1), F12
F14,
-24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/68

Loop unrolling: . dependencies
Loop: LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
LD
ADDD
SD
SUBI
BNEZ

F0, 0(R1)
F4, F0, F2
0(R1), F4
F6, -8(R1)
F8, F6, F2
-8(R1), F8
F10, -16(R1)
F12, F10, F2
-16(R1), F12
F14,
-24(R1)
F16, F14, F2
-24(R1), F16
R1, R1, #32
R1, Loop

The yellow arrows
are data dependencies. To solve
them, we reorder
the instructions
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/69

Loop unrolling: . dependencies
• A third class of dependencies is the one
of control dependencies
• Examples:
if (p1) s1;
if (p2) s2;
then
p1 c s1 (s1 is control dependent on p1)
p2 c s2 (s2 is control dependent on p2)
• Clearly  (p1 c s2) , that is,
s2 is not control dependent on p1
© V. De Florio
KULeuven 2002

Basic
Concepts

Loop unrolling: . dependencies
• Two properties are critical to control
dependency:
 Exception behaviour
 Data flow

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/72

• Exception behaviour: suppose we have
the following excerpt:
BEQZ
R2, L1
DIVI
R1, 8(R2)
L1: …
• We may be able to move the DIVI to
before the BEQZ without violating the
sequential semantics of the program
• Suppose the branch is taken. Normally
one would simply need to undo the DIVI
• What if DIVI triggers a DIVBYZERO
exception?
© V. De Florio
KULeuven 2002

Basic
Concepts

Loop unrolling: . dependencies
• Two properties are critical to control
dependency:
 Exception behaviour
 Data flow

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/73

• Data flow must be preserved
• Let us consider the following excerpt:
ADD
R1, R2, R3
BEQZ
R4, L
SUB
R1, R5, R6
L:
OR
R7, R1, R8
• Value of R1 depends on the control flow
• The OR depends on both ADD and SUB
• Also depends on the nature of the branch
• R1 = (taken)? ADD.. : SUB..
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/74

Loop Level Parallelism
• Let us consider the following loop:
for (I=1; I<=100; I++) {
A[I+1] = A[I] + C[I]; /* S1 */
B[I+1] = B[I] + A[I+1]; /* S2 */ }
• S1 is a loop-carried dependency (LCD):
iteration I+1 is dependent on iteration I:
A’ = f(A)
• S2 is
B’ = f(B,A’)
• If a loop has only non-LCD’s, then it is
possible to execute more than one loop
iteration in parallel – as long as the
dependencies within each iteration are
not violated
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/75

Loop Level Parallelism
• What to do in the presence of LCD’s?
• Loop transformations. Example:
for (I=1; I<=100; I++) {
A[I+1] = A[I] + B[I]; /* S1 */
B[I+1] = C[I] + D[I]; /* S2 */ }
•
A’ = f(A, B)
B’ = f(C, D)
• Note: no dependencies except LCD’s
Instructions can be swapped!
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/76

Loop Level Parallelism
• What to do in the presence of LCD’s?
• Loop transformations. Example:
for (I=1; I<=100; I++) {
A[I+1] = A[I] + B[I]; /* S1 */
B[I+1] = C[I] + D[I]; /* S2 */ }
• Note: the flow, i.e.,
A0 B0
A0 B0
C0 D0
C0 D0
A1 B1
can be
A1 B1
C1 D1 changed
into
C1 D1
A2 B2
A2 B2
C2 D2
...
...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/77

Loop Level Parallelism
for (i=1; i <= 100; i=i+1) {
A[i] = A[i] + B[i];
/* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}

becomes
A[1] = A[1] + B[1];
for (i=1; i <= 99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/78

Loop Level Parallelim
• A’ = f(A, B)
B’ = f(C, D)
B’ = f(C, D)
A’ = f(A’, B’)
• Now we have dependencies but no more
LCD’s!
It is possible to execute more than one
loop iteration in parallel – as long as the
dependencies within each iteration are
not violated
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/79

Dependency avoidance
1. “Batch” approaches: at compile time, the
compiler schedules the instructions in
order to minimize the dependencies
(static scheduling)
2. “Interactive” approaches: at run-time, the
HW rearranges the instructions in order
to minimize the stalls (dynamic
scheduling)
• Advantages of 2:
 Only approach when dependencies are only
known at run-time (pointers etc.)
 The compiler can be simpler
 Given an executable compiled for a machine
with machine-level X and pipeline organization
Y, it can run efficiently on another machine
with the same machine level but a different
pipeline organization Z
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/80

Dynamic Scheduling
• Static scheduling: compiler techniques
for scheduling (rearranging) the
instructions
 so to separate dependent instructions
 And hence minimize unsolvable hazards
causing unavoidable stalls

• Dynamic scheduling: HW-based, run-time
techniques
• A dynamically scheduled processor does
not try to remove true data dependencies
(which would be impossible): it tries to
avoid stalling when dependencies are
present
• The two techniques can be both used
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/81

Dynamic Scheduling: General Idea
• If an instruction is stalled in the pipeline,
no later instruction can proceed
• A dependence between two instructions
close to each other causes a stall
• A stall means that, even though there
may be idle functional units that could
potentially serve other instructions, those
units have to stay idle
• Example:
DIVD F0, F2, F4
ADDD F10, F0, F8
SUBD F12, F8, F14
• ADDD depends on DIVD; but SUBD does
not. Despite this, it is not issued!
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/82

Dynamic Scheduling: General Idea
• So SUBD is not issued even there might
be a functional unit ready to perform the
requested operation
• Big performance limitation!
• What are the reasons that lead to this
problem?
• In-order instruction issuing and
execution: instructions issue and execute
one at a time, one after the other
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/83

Dynamic Scheduling: General Idea
•

Example: in DLX, the issue of an
instruction occurs at ID (instruction
decode)
• In DLX, ID checks for absence of
structural hazards and waits for the
absence of data hazards
• These two steps may be made distinct
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/84

Dynamic Scheduling: General Idea
• The issue process gets divided into two
parts:
1. Checking the presence of structural
hazards
2. Waiting for the absence of a data hazard
• Instructions are issued in order, but they
execute and complete as soon as their
data operands are available
• Data flow approach
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/85

Dynamic Scheduling: General Idea
• The ID pipeline stage is divided into two
sub-stages:
• ID.1 (Issue) : decode the instruction,
check for structural hazards
• ID.2 (read operands) : wait until no data
hazards, then read operands
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/86

Dynamic Scheduling: General Idea
• In the DLX floating point pipeline, the EX
stage of instructions may take multiple
cycles
• For each issued instruction I, depending
on the resolution of structural and data
hazards, I may be be waiting for
resources or data, or in execution, or
completed
• More than a single instruction can be in
execution at the same time
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Scoreboarding
• Scorebord (CDC6600, 1964): a technique
to allow instructions to execute out of
order when there are sufficient resources
and no data dependencies
• Goal: execution rate of 1 instruction per
clock cycle in the absence of structural
hazards
• Large set of FUs:
 4 FPUs,
 5 units for memory references
 7 integer FUs

 Highly redundant (parallel) system

• Four steps replace the ID, EX, WB stages
2.3/87
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/88

Scoreboarding

Avoids WAWs

• IF (a FU is available && no active
instruction has same destination reg) {
issue I to the FU; update state;
}
• ASA (the two source operands are
available in the registers) {
read operands;
manage RAW stalls;
}
• For each FU: ASA (operands are available)
{ start EX; EOX? Alert scoreboard; }
Avoids WARs

• When at WB:
{ wait for (no WAR hazards);
store output to destination reg; }
© V. De Florio
KULeuven 2002

Basic
Concepts

Scoreboarding
• In eliminating stalls, a scoreboard is
limited by several factors:
 Amount of parallelism available among the
instructions

Computer
Design

 (in the presence of many dependencies there’s
not much that one can do…)

 Number of scoreboard entries
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/89

 (How far ahead the pipeline can look for
independent instructions)

 Number and types of FUs
 Number of WAR’s and WAW’s
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/90

Scoreboarding
• The effectiveness of the scoreboard
heavily depends on the register file
• All operands are read from registers, all
outputs go to destination registers
 The availability of registers influence the
capability to eliminate stalls
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/91

Tomasulo’s approach
• Tomasulo’s approach (IBM 360/91, 1967) :
An improvement of scoreboarding when a
limited number of registers is allowed by
a machine architecture
• Based on virtual registers
• The IBM 360/91 had two key design goals:
 To be faster than its predecessors
 To be machine level compatible with its
predecessors

• Problem: the 360 family had only 4 FP

registers

• Tomasulo combined the key ideas of
scoreboarding with register renaming
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Tomasulo’s approach
• IBM 360/91 FUs:
 3 ADDD/SUBD, 2 MULD, 6 LD, 6 SD

• Key element: the reservation station (RS):
a buffer which holds the operands of the
instructions waiting to issue
• Key concept:
 A RS fetches and buffers an operand as soon as
it is available, eliminating the need to get that
operand from a register
 Instead of tracing the source and destination
registers, we track source and destination RS’s
RSa

RSb
OP

2.3/92

RSc
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/93

Tomasulo’s approach
• A reservation station represents:
 A static data, read from a register
 A “live” data (a future data) that will be
produced by another RS and FU

• Hazard detection and execution control
are not centralised into a scoreboard
• They are distributed in each RS, which,
independently:
 Controls a FU attached to it,
 And starts that FU the moment the operands
become available
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Tomasulo’s approach
• The operands go to the FUs through the
(wide set of) RS’s, not through the (small)
register file
• This is managed through a broadcast that
makes use of a
common result-or-data bus
• All units waiting for an operand can load
it at the same time:
RSa

RSb

RSb

RSd
OP2

RSc
2.3/94

OP

RSe
© V. De Florio
KULeuven 2002

Basic
Concepts

Tomasulo’s approach
• The execution is driven by a graph of
dependencies
RSg

RSf
SUBD

Computer
Design

RSb

RSa
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/95

RSd

SUBD

MULTD

RSc

RSe

• A “live data structure” approach (similar
to LINDA): a tuple is made available in the
future, when a thread will have finished
producing it
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/100

Major Advantages of Tomasulo’s
• Distributed approach: the RS’s
independently control the FU’s
• Distributed hazard detection logic
• The CDB broadcasts results -> all pending
instructions depending on that result are
unblocked simultaneously
 The CDB, being a bus, reaches many
destinations in a single clock cycle
 If the waiting instructions get their missing
operand in that clock cycle, they can all begin
execution on the next clock cycle

• WAR and WAW are eliminated by
renaming registers using the RS’s
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/101

Reducing branch penalties
• Static Approaches
Dynamic Approaches
Reducing branch penalties:
Dynamic Branch Prediction

© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

• A branch history table

Address
.
.
.

Branch

Nature

0xA0B2DF37 BNEZ …
Computer
Architectures
for AI

Computer
Architectures
In Practice

taken

0xA0B2F02A BEQ …

taken

.
.
.
.
.
.

0xA0B30504
.
.
.

0xA0B30537

2.3/102

untaken

BNEZ …

.
.
.

taken

untaken
untaken

2A
.
.
.

un taken
BGT …

04

37
.
.
.
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/103

Dynamic Branch Prediction 
Branch History Table  Algorithm
/* before the branch is evaluated */
If (Current instruction is a branch) {
entry = PC & 0x000000FF;
predict branch as ( BHT [ entry ] );
}
/* after the branch */
If (branch was mispredicted)
BHT [ entry ] = 1 – BHT [ entry ]
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/104

Dynamic Branch Prediction 
Branch History Table  Algorithm
• Just one bit is enough for coding the
Boolean value “taken” vs. “untaken”
• Note: the function associating addresses
to entries in the BHT is not guaranteed to
be a bijection (one-to-one relationship):
• The algorithm records the most recently
behaviour of one or more branches
 For instance, entry 37 corresponds to two b.’s

• Despite this, the scheme works well…
• …though in some cases, the performance
of the scheme is not that satisfactory:
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/105

Dynamic Branch Prediction 
Branch History Table  Accuracy
• for (i=0; i<BIGN; i++)
for (j=0; j<9; j++)
{ do stg(); }
• Loop is
 taken nine times in a row
 then not taken once

• Taken 90%, Untaken 10%

• What is the prediction accuracy?
Dynamic Branch Prediction 
Branch History Table  Accuracy

© V. De Florio
KULeuven 2002

Basic
Concepts

9

Computer
Design

9
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/106

9

Taken
Taken
...
Taken
Untaken
Taken
Taken
...
Taken
Untaken
Taken
Taken
...
Taken
Untaken
Taken

U
T

0
1

T
T
U
T

1
0
0
1

2 mispredictions

T
T
U
T

1
0
0
1

2 mispredictions

T
T
U

1
0
0

8 successful
predictions

8 successful
predictions

8 successful
predictions

2 mispredictions

S.S. Prediction accuracy is just 80% !
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Dynamic Branch Prediction 
Branch History Table  Accuracy
• Loop branches (taken n-1 times in a row,
untaken once)
• Performance of this dynamic branch
predictor (based on a single-bit prediction
entry):
 Misprediction: 2 x 1 / n
 Twice rate of untaken branches

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/107
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Dynamic Branch Prediction 
Two-bit Prediction Scheme
• Use a two bit field as a “branch behaviour
recorder”
• Allow a state to change only when two
mispredictions in a row occur:

Taken

Computer
Architectures
for AI

Not taken
Predict taken

Predict taken
Taken

Computer
Architectures
In Practice

Not taken

Taken
Not taken
Predict not taken

Predict not taken
Taken
Not taken

2.3/108
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Dynamic Branch Prediction 
Branch History Table  Accuracy
Taken
Taken
Taken
Taken
...
Taken
Untaken
Taken
...
Taken
Untaken
Taken
...
Taken

U2
U
T2
T2

0
0
1
1

T2
T
T2

1
0
1

T2
T
T2

1
0
1

T2

1

2 mispredictions first
7 successful
predictions

9 successful
predictions

STEADY
STATE

9 successful
predictions

S.S. Prediction accuracy is now 90%
2.3/109
© V. De Florio
KULeuven 2002

Basic
Concepts

Dynamic Branch Prediction 
Branch History Table  Accuracy
Prediction accuracy with programs from
SPEC89 – 2-bit prediction buffer of 4096
entries

Computer
Design
nasa7
matrix300

Computer
Architectures
for AI

1%
0%

tomcatv

1%

doduc

5%

Computer
Architectures
In Practice

spice

9%

fpppp

SPEC89
benchmarks

9%

gcc

12%

espresso

5%
18%

eqntott
10%

li
0% 2%

4%

6%

8%

10% 12% 14% 16%

Frequency of mispredictions

2.3/110

18%
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/111

Dynamic Branch Prediction 
General Scheme
• In the general case, one could use an
n-bit branch behaviour recorder and a
branch history table of 2m entries
• In this case
 A change occurs every 2n-1 mispredictions
 There is a higher chance that not too many
branch addresses be associated with the same
BHT entry
 Larger memory penalty
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/112

D.B.P.  Comparing the 2-bit with
the General Case
© V. De Florio
KULeuven 2002

Basic
Concepts

Dynamic Branch Prediction
Schemes
• One-bit prediction buffer
 Good, but with limited accuracy

• Two-bit prediction buffer
Computer
Design

 Very good, greater accuracy, slightly higher
overhead

• Infinite-bit prediction buffer
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/113

 As good as the two-bit one, but with a very
large overhead

• Correlating predictors
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/114

Dynamic Branch Prediction 
Correlated predictors
• Two-level predictors
• If the behaviour of a branch is correlated
to the behaviour of another branch,
no single-level predictor would be able to
capture its behaviour
• Example:
if (aa == 2)
aa = 0;
if (bb == 2)
bb = 0;
if (aa != bb) {
…
• If we keep track of the recent behaviour
of other previous branches, our accuracy
may increase
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/115

Dynamic Branch Prediction 
Correlated predictors
• A simpler example:
if (d == 0) d = 1;
if (d == 1) …
• In DLX, this is
BNEZ
MOV
L1: SUBI
BNEZ
...
L2: . . .

R1,
R1,
R3,
R3,

L1
; b1 ( d != 0 )
#1
R1, #1
L2
; b2 ( d != 1)
Dynamic Branch Prediction 
Correlated predictors

© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

• In DLX, this is
BNEZ
R1, L1
; b1 ( d != 0 )
MOV
R1, #1
L1: SUBI
R3, R1, #1
BNEZ
R3, L2
; b2 ( d != 1)
...
L2: . . .
• Let us assume that d is 0, 1 or 2

Initial value d==0?
of d

b1

Value of d d==1?
before b2

b2

0

Untaken

1

Yes

Untaken

1
2.3/116

Yes
No

Taken

1

Yes

Untaken

2

No

Untaken

2

No

Taken
Dynamic Branch Prediction 
Correlated predictors

© V. De Florio
KULeuven 2002

Basic
Concepts

Initial value d==0?
of d

B1

Value of d d==1?
before b2

b2

0
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/117

Yes

Untaken

1

Yes

Untaken

1

No

Taken

1

Yes

Untaken

2

No

Untaken

2

No

Taken

• This means that
(B1 == untaken )  (B2 == untaken )
• A one-bit predictor may not be able to
capture this property and behave very
badly
Dynamic Branch Prediction 
Correlated predictors

© V. De Florio
KULeuven 2002

Basic
Concepts

• Let us suppose that d alternates between 2 and 0
• This is the table for the one-bit predictor:
d

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/118

b1
action

2

NT

T

T

0

Computer
Design

b1
pred

new b1
pred

T

NT

NT

2

NT

T

T

0

T

NT

NT

b2
b2
pred action
NT
T
NT
T

• ALL branches are mispredicted!

new b2
pred

T

T

NT

NT

T

T

NT

NT
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Dynamic Branch Prediction 
Correlated predictors
• Correlated predictor: example:
• Every branch, say branch number j>1, has
two separate prediction bits
 First bit: predictor used if branch j-1 was NT
 Second bit: otherwise

• At the end of branch j-1:
Behaviour_j_min_1 = (taken?) 1 : 0;
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/119

• At the beginning of branch j:
predict branch as (
BHT [ Behaviour_j_min_1 ] [ entry ] );

• At the end of branch j
If (branch was mispredicted)
BHT [ B.. ] [ entry ] = 1 – BHT [ B.. ] [ entry ]
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/120

Dynamic Branch Prediction 
Correlated predictors
• The behaviour of a branch
selects a one-bit branch predictor
• If the prediction is not OK, its state is
flipped
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/121

Dynamic Branch Prediction 
Correlated predictors
• We may also consider the last TWO
branches
 The behaviour of these two branches selects,
e.g., a one-bit predictor
 (NT NT, NT T, T NT, T T)  (0-3)  BHT [0..3]
 This is called a (2,1) predictor

 Or, the behaviour of the last two branches
selects an n-bit predictor
 This is a (2, n) predictor
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/122

Dynamic Branch Prediction 
Correlated predictors
A (2,2) predictor: A 2-bit branch history entry selects
a 2-bit predictor
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/123

Dynamic Branch Prediction 
Correlated predictors
• General case: (m, n) predictors
 Consider the last m branches and their 2m
possible values
 This m-tuple selects an n-bit predictor
 A change in the prediction only occurs after 2n-1
mispredictions
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/124

Dynamic Branch Prediction 
Branch-Target Buffer
• A run-time technique to reduce the
branch penalty
• In DLX, it is possible to “predict” the new
PC, via a branch prediction buffer, during
the second stage of the pipeline
• With a Branch-Target Buffer (BTB), the
new PC can be derived during the first
stage of the pipeline
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/125

Dynamic Branch Prediction 
Branch-Target Buffer
• The BTB is a branch-prediction cache that
stores the addresses of taken branch
• An associative array which works as
follows:
(instruction address)  (branch target address)
• In case of a hit, we know the predicted
instruction address one cycle earlier w.r.t.
the branch prediction buffer
• Fetching begins immediately at the
predicted PC
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/126

Dynamic Branch Prediction 
Branch-Target Buffer
• Design issues:
 The entire address must be used
(correspondence must be one-to-one)
 Limited number of entries in the BTB
 Most frequently used

 BTB requires a number of actions to be
executed during the first pipeline stage, also in
order to update the state of the buffer
 The pipeline management gets more complex and
the clock cycle duration may have to be
increased
Dynamic Branch Prediction 
Branch-Target Buffer

© V. De Florio
KULeuven 2002

Basic
Concepts

• Total branch penalty for a BTB
• Assumptions: penalties are as follows

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/127

Prediction

Actual
branch

Penalty
cycles

Yes

Taken

Taken

0

Yes

Computer
Design

Instruction
is in buffer

Taken

Untaken

2

No

*

Taken

2

• Prediction accuracy: 90%
• Hit rate in buffer: 90%
• Taken branch frequency: 60%
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Dynamic Branch Prediction 
Branch-Target Buffer
• Branch penalty =
90%
Percent buffer hit rate x
10%
Percent incorrect predictions x
Penalty
10%
+ (1 - Percent buffer hit rate) x
Percent taken branches x
60%
Penalty =
90%x10%x2 + 10%x60%x2 = 0.18+0.12=
0.30 clock cycles (vs. 0.50 for delayed br.)
Prediction

Actual
branch

Penalty
cycles

Yes

Taken

Taken

0

Taken

Untaken

2

No
2.3/128

Instruction
is in buffer
Yes

Computer
Architectures
In Practice

*

Taken

2
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Dynamic Branch Prediction 
Branch-Target Buffer
• The same approach can be applied to the
procedures return addresses
• Example:
0x4ABC CALL 0x30A0
0x4AC0 …
…
0x4CF4 CALL 0x30A0
0x4CF8 …
…
0x4AC0

Computer
Architectures
In Practice

2.3/129

0x30A0

0x4CF8

• Associative arrays of stacks
• If cache is large enough, all return
addresses are predicted correctly
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/130

Parallelism
• Introduction to parallel processing
• Instruction level parallelism
 Introduction
 VLIW
 Advanced pipelining techniques
 Superscalar
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Superscalar architectures
• So far, the goal was reaching the ideal
CPI = 1 goal
• Further increasing performance by having
CPI < 1 is the goal of
superscalar processors (SP)
• To reach this goal, SP issue multiple
instructions in the same clock cycle
• Multiple-issue processors
 VLIW (seen already)
 SP
 Statically scheduled (compiler)
 Dynamically scheduled (HW;
Scoreboarding/Tomasulo)

• In SP, a varying # of instructions is
issued, depending on structural limits and
dependencies
2.3/131
© V. De Florio
KULeuven 2002

Basic
Concepts

Superscalar architectures
•
•

1. One of: load, store (integer or FP), branch,
integer ALU operation
2. A FP ALU operation

Computer
Design

•
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/132

Superscalar version of DLX
At most two instructions per clock cycle
can be issued

•

IF and ID operate on 64 bits of
instructions
Multiple independent FPU are available
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/133

Superscalar architectures
• The superscalar DLX is indeed a sort of
“bidimensional pipeline”:
Integer Instr.
FP Instr.
Integer Instr.
FP Instr.
Integer Instr.
FP Instr.
Integer Instr.
FP Instr.

IF
IF

ID
ID
IF
IF

EX
EX
ID
ID
IF
IF

MEM
MEM
EX
EX
ID
ID
IF
IF

WB
WB
MEM
MEM
EX
EX
ID
ID

WB
WB
MEM
MEM
EX
EX

WB
WB
MEM WB
MEM WB
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Superscalar architectures
• Every new solution breeds new problems..
• Latencies!
• When the latency of the load is 1:
 In the “monodimensional pipeline”, one cannot
use the result of the load in the current and
next cycle:

Computer
Architectures
for AI

Computer
Architectures
In Practice

LD NOP LDc

P

 In the bidimensional pipeline of SP, this means
a loss of three cycles:
Pfp
NOP NOP LDc

LD NOP LDc’
2.3/134

Pi
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/135

Superscalar architectures
• Let us consider again the following loop:
Loop: LD
F0, 0(R1)
ADDD F4, F0, F2
SD
0(R1), F4
SUBI R1, R1, #8
BNEZ R1, Loop
• Let us perform unrolling (x5) + scheduling
on the Superscalar DLX:
© V. De Florio
KULeuven 2002

Superscalar architectures
Integer

Basic
Concepts

Loop:

FP

Cycle

LD F0, 0(R1)

1

LD F6, -8(R1)

2

LD F10, -16(R1)
LD F14, -24(R1)

ADDD F8,F6,F2

4

ADDD F12,F10,F2

5

SD 0(R1), F4
Computer
Architectures
for AI

3

LD F18, -32(R1)

Computer
Design

ADDD F4,F0,F2

ADDD F16,F14,F2

6

SD -8(R1), F8

ADDD F20,F18,F2

7
8

SD -24(R1), F16
Computer
Architectures
In Practice

SD -16(R1), F12

9

SUBI R1, R1, #40

10

BNEZ R1, Loop

11

SD -32(R1), F20

12

• 12 clock cycles per 5 iterations = 2.4 cc/i
2.3/136
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Superscalar architectures
• Superscalar = 2.4 cc/i vs normal = 3.5 cc/i
• But in the example there were not enough
FP instructions to keep the FP pipeline in
use
 From cycle 8 to cycle 12 and for the first two
cycles, each cycle holds just one instruction

• How to get more?
 Dynamic scheduling for SP

Computer
Architectures
In Practice

2.3/137

 Multicycle extension of the Tomasulo algorithm
© V. De Florio
KULeuven 2002

Basic
Concepts

Superscalar architectures and the
Tomasulo algorithm
• Idea: employing separate data structures
for the Integer and the FP registers
 Integer Reservation Stations (IRS)
 FP Reservation Stations (FRS)

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/138

• In the same cycle, issue a FP (to a FRS)
and an integer instruction (to a IRS)
• Note: issuing does not mean executing!
 Possible dependencies might serialize the two
instructions issued in parallel

• Dual issue is obtained

pipelining the instruction-issue stage
so that it runs twice as fast
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Superscalar architectures
• Multiple issue strategy’s inherent
limitations:
 The amount of ILP may be limited (see loop
p.134)
 Extra HW is required
 Multiple FPU and IU
 More complex (-> slower) design

Computer
Architectures
for AI

Computer
Architectures
In Practice

 Extra need for large memory and register-file
bandwith
 Increase in code size due to hard loop unrolling

 Recall: CPUTIME(p) =
2.3/139

IC(p)  CPI(p)
clock rate
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Superscalar architectures:
compiler support
• Symbolic loop unrolling
 The loop is not physically unrolled, though
reorganized, so to eliminate dependencies

• Software pipelining:
 Dependencies are eliminated by interleaving
instructions from different iterations of the loop
 Loop is not unrolled
<startup>
Loop: LD
ADDD
SD
SUBI
BNEZ

F0, 0(R1)
F4, F0, F2
0(R1), F4
R1, R1, #8
R1, Loop

RAW: problematic
2.3/140



Loop: SD
ADDD
LD
SUBI
BNEZ
<clean-up>

0(R1), F4
F4, F0, F2
F0, -16(R1)
R1, R1, #8
R1, Loop

WAR: HW removable
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/141

Superscalar architectures:
compiler support
• Trace scheduling
• Aim: tackling the problem of too short
basic blocks
• Method:
 Trace selection
 Trace compaction
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Superscalar architectures:
compiler support
• Trace selection:
 A number of contiguous basic blocks are put
together into a “trace”
 Using static branch prediction, the conditional
branches are chosen as taken/untaken, while
loop branches are considered as taken
A

test

Computer
Architectures
for AI

Computer
Architectures
In Practice

A

B

B

X

C
2.3/142



C
Bookkeeping
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Superscalar architectures:
compiler support
• Trace compaction:
 The resulting trace is a longer straight-line of
code
 Trace compaction: global code scheduling
A

B

Code scheduling with
a basic block whose size
is that of A + B + C

C
Bookkeeping

• Speculative movement of code
2.3/143
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/144

Superscalar architectures:
HW support
• Conditional instructions: instructions like
CMOVZ R2, R3, R1
which means
if (R1 == 0) R2 = R3;
or
(R1)? R2 = R3 : /* NOP */;
• The instruction turns into a NOP if the
condition is not met
 This also means that no exception are raised!

• Using conditional instructions we convert
a control dependence (due to a branch)
into a data dependence
• Speculative transformation in a two-issue
superscalar with conditional instructions:
© V. De Florio
KULeuven 2002

Superscalar architectures: HW
support : conditional instructions
Integer

FP

LW R1, 40(R2)

ADDD R3,R4,R5

1

ADDD R6,R3,R7

Basic
Concepts

Cycle
2

Computer
Architectures
for AI

Computer
Architectures
In Practice

BEQZ R10, L

3

LW R8, 20(R10)

4

LW R9,0(R8)

Computer
Design

5

LW R1, 40(R2)

ADDD R3,R4,R5

1

LWC R8,20(R10),R10 ADDD R6,R3,R7

2

BEQZ R10, L

3

LW R9,0(R8)

4

We speculate on the outcome of the branch. If the
condition is not met, we don’t slow down the execution,
because we had used a slot that would otherwise be lost
2.3/145
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Superscalar architectures: HW
support : conditional instructions
• Conditional instructions are useful to
implement short alternative control flows
• Their usefulness though is limited by
several factors:
 Conditional instructions that are annullated
still take execution time – unless they are
scheduled into waste slots
 They are good only in limited cases, when
there’s a simple alternative sequence
 Moving an instruction across multiple branches
would require double-conditional instructions!
LWCC R1, R2, R10, R12
(makes no sense)

 They require to do extra work w.r.t. their
“regular” version
2.3/146

 The extra time required for the test may require
more cycles than the regular versions
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Superscalar architectures: HW
support : conditional instructions
• Most architectures support a few
conditional instructions (conditional
move)
• The HP PA architecture allows any
register-register instruction to turn the
next instruction into a NOP – which
makes that a conditional instruction

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/147

• Exceptions
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/148

Superscalar architectures: HW
support : conditional instructions
• Exceptions:
 Fatal (normally causing termination; e.g.,
memory protection violation)
 Resumable exceptions (causing a delay, but no
termination; e.g., page fault exception)

• Resumable exceptions can be processed
for speculative instructions just as if they
were normal instructions
 Corresponding time penalty is not considered
as incorrect

• Fatal exceptions cannot be handled by
speculative instructions, hence must be
deferred to the next non-speculative
instructions
Superscalar architectures: HW
support : conditional instructions

© V. De Florio
KULeuven 2002

Basic
Concepts

•

Moving instructions across a branch
must not affect
 The (fatal) exception behaviour
 The data dependences

Computer
Design

•

How to obtain this?
1. All the exceptions triggered by speculative
instructions are ignored by HW and OS

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/149

The HW and OS do handle all exceptions, but
return an undefined value for any fatal
exception. The program is allowed to continue
– though this will almost certainly lead to
incorrect results
Note: scheme 1. can never cause a correct
program to fail, regardless the fact that you
used or not speculation
Superscalar architectures: HW
support : conditional instructions

© V. De Florio
KULeuven 2002

2. Poison bits: A speculative instructions does
not trigger any exception, but turns a bit on in
the involved result registers. Next “normal”
(non-speculative) instruction using those
registers will be “poisoned” -> it will cause an
exception
3. Boosting: Renaming and buffering in the HW
(similar to the Tomasulo approach)

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

•

Speculation can be used, e.g., to
optimize an if-the-else such as
if (a==0) a = b; else a = a + 4
or, equivalently,
a = (a==0)? b : a + 4

2.3/150
Superscalar architectures: HW
support : conditional instructions

© V. De Florio
KULeuven 2002

Basic
Concepts

•
•

Computer
Design

Computer
Architectures
for AI

•

Computer
Architectures
In Practice

2.3/151

•
•

Suppose A is in 0(R3) and B in 0(R2)
Example:
LW R1, 0(R3) ; load A
BNEZ R1, L1
; A != 0 ? GOTO L1
LW R1, 0(R2) ; load B
J
L2
; skip ELSE
L1:ADD R1,R1,4
; ELSE part
L2:SW 0(R3), R1 ; store A
Speculation:
LW R1, 0(R3) ; load A
LW R9, 0(R2) ; load speculatively B
BNEZ R1, L3
ADD R9, R1, 4 ; here R9 is A+4
L3: SW 0(R3), R9 ; here R9 is A+4 or B
In this case, a temporary register is used
Method 1: speculation is transparent
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/152

Superscalar architectures: HW
support : conditional instructions
• Method 2 applied to the previous code
fragment:
LW R1, 0(R3) ; load A
LW* R9, 0(R2) ; load speculatively B
BNEZ R1, L3
ADD R9, R1, 4 ; here R9 is A+4
L3: SW 0(R3), R9 ; here R9 is A+4 or B
• LW* is a speculative version of LW
• LW* an opcode that turns on the poison
bit of register R9
• Next non speculative instruction using R9
will be “poisoned”: it will cause an
exception
• If another speculative instruction uses
R9, the poison bit will be inherited
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/153

Superscalar architectures: HW
support : conditional instructions
• Combining speculation with dynamic
scheduling
 An attribute bit is added to each instruction
(1: speculative, 0: normal)
 When that bit is 1, it is allowed to execute, but
cannot enter the commit (WB) stage
 The instruction then has to wait until the end of
the speculated code
 It will be allowed to modify the register file /
memory only at end of speculative-mode

• Hence: instructions execute out-of-order,
but are forced to commit in order
• A special set of buffers holds the results
that have finished execution but have not
committed yet (reorder buffers)
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/154

Superscalar architectures: HW
support : conditional instructions
• As neither the register values nor the
memory values are actually WRITTEN
until an instruction commits,
the processor can easily undo its
speculative actions when a branch is
found to be mispredicted
• If a speculated instruction raises an
exception, this is recorded in the reorder
buffer
• In case of branch misprediction such that
a certain speculative instruction should
not have been executed, the exception is
flushed along with the instruction when
the reorder buffer is cleared
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/155

Superscalar architectures: HW
support : conditional instructions
• Reorder buffers:
 An additional set of virtual registers that hold
the result of the instructions
 That have finished execution, but
 Have not committed yet

 Issue: only when both a Reservation Station
and a reorder buffer are available
 As soon as an instruction completes, its output
goes into its reorder buffer
 Until the instruction has not committed, input
is received from the reorder buffer
(the Reservation Station is freed, the reorder
buffer is not)
 The actual updating of registers takes place
when the instruction reaches the top of the list
of reorder buffers
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/156

Superscalar architectures: HW
support : conditional instructions
• At this point the commit phase takes
place:
 Either the result is written into the register file,
 Or, in case of a mispredicted branch, the
reorder buffer is flushed and execution restarts
at the correct successor of the branch

• Assumption: when a branch with
incorrect prediction reaches the head of
the buffer, it means that the speculation
was wrong
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/157

Superscalar architectures: HW
support : conditional instructions
• This technique allows also to tackle situation like
if (cond) do_this ; else do_that ;
• One may “bet” on the outcome of the branch and
say, e.g., it will be a taken one
• Even unlikely events do happen, so sooner or later
a misprediction occurs
• Idea: let the instructions in the else part (do_that)
issue and execute, with a separate list of reorder
buffers (list2)
• This second list is simpler: we don’t check for the
current head-of-list. Elements in there need to be
explicitly removed
• In case of a misprediction, in the second list we
have already executed the do_that part, and we
just need to perform its commit
• In case of positive prediction, the ELSE part is
purged off list2
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/158

Superscalar architectures
• If a processor A has a lower CPI w.r.t
another processor B, will A always run
faster than B?
• Not always!
 A higher clock rate is indeed a deterministic
measure of the performance improvement
 A multiple issue (superscalar) architecture
cannot guarantee its improvements (stochastic
improvements)
 Pushing towards a low CPI means adapting
sophisticated (=complex) techniques… which
slows down the clock rate!
 Improving one aspect of a M.I.P. does not
necessarily lead to overall performance
improvements
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/159

Superscalar architectures
• A simple question:
“how much ILP exists in a program?”
or, in other words, “how much can we
expect from techniques that are based on
the exploitation of the ILP?”
• How to proceed:
 Delivering a set of very optimistic assumptions
and measuring how much parallelism is
available under those assumptions
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/160

Superscalar architectures
•

Assumptions (HW model of an ideal
processor):
1. Infinite # of virtual registers (-> no WAW or
WAR can suspend the pipeline)
2. All conditional branches are predicted exactly
(!!)
3. All computed jumps and returns are perfectly
predicted
4. All memory addresses are known exactly, so a
store can be moved before a load – provided
that the addresses are not identical
5. Infinite issue processor
6. No restriction about the types of instructions
to be executed in a cycle (no structural
hazards)
7. All latencies are 1
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/161

Superscalar architectures
• How to match these assumptions??
• Gambling!
• We run a program and produce a trace
with all the values of all the instances of
each branch
 Taken, Taken, Taken, Untaken, Taken, …
 Each corresponding target address is recorded
and assumed to be available
 Then we use a simulator to mimic, e.g., an
infinite virtual registers machine etc.

• Results are depicted in next picture
• Parallelism is expressed in IPC:
instruction issues per clock cycles
© V. De Florio
KULeuven 2002

Superscalar architectures

Basic
Concepts

54.8

gcc
espresso
Computer
Design

SPEC
benchmarks

li
fpppp
doduc

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/162

62.6
17.9
75.2
118.7
150.1

tomcatv
140

160

• Tomcatv reaches 150 IPC (for a particular
run)
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/163

Superscalar architectures
• Then we can diminish the above
assumptions and introduce limitations
that represent our current possibilities
with computer design techniques for ILP
 Window size: the actual range of instructions
we inspect when looking for candidates for
contemporary issuing
 Realistic branch prediction
 Finite # of registers

• See images 4-39 and 4-40
© V. De Florio
KULeuven 2002

Superscalar architectures

Basic
Concepts

160
140
120

Computer
Design

100
Instruction issues
per cycle

80
60

Computer
Architectures
for AI

40
20
0

Computer
Architectures
In Practice

Infinite

2k

512

128

32

Window size
gcc

li

fpppp

2.3/164

espresso
doduc

tomcatv

8

4
© V. De Florio
KULeuven 2002

Superscalar architectures
55
10
10

gcc

Basic
Concepts

8
4
3
63
15
13

espresso

8
4
3

Computer
Design

18
12
11
9

li
4
3

Benchmarks

75
49

Computer
Architectures
for AI

35

fpppp

14
5
3
119
16
15

doduc

Computer
Architectures
In Practice

9
4
3
150
45
34

tomcatv

14
6
3
0

20

40

60

80

100

120

Instruction issues per cycle
Infinite

2.3/165

512

8

4

128

32

140

160
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/166

Superscalar architectures:
conclusive notes
• In the next 10 years it is realistic to reach
an architecture that looks like this:
 64 instruction issues per clock cycle
 Selective predictor, 1K entries, 16-entry return
predictor
 Perfect disambiguation of memory references
 Register renaming with 64 + 64 extra registers

• Computer architectures in practice:
Section 4.8 (PowerPC 620)
Superscalar architectures:
conclusive notes

© V. De Florio
KULeuven 2002

• Reachable
performance

Basic
Concepts

60

Computer
Architectures
for AI

Computer
Architectures
In Practice

Instruction issues per cycle

Computer
Design

50
40
30
20
10
0

Infinite

256

128

64

32

16

Window size
gcc

li

fpppp
2.3/167

espresso
doduc

tomcatv

8

4
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/168

Pipelining and communications
• Suppose that N+1 processes need to
communicate a private value to all the
others
• They use all the values to produce next
output (e.g., for voting)
• Communication is fully synchronous and
needs to be repeated m times, m large

...
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/169

Pipelining and communications
•
•
•
•

Let us assume that no bus is available
Point-to-point communication
Processes are numbered p0…pN
Two instructions are available
 Send (pj, value)
 Receive (pj, &value)

• Blocking functions
• If the receiver is ready to receive, they
last one stage time, otherwise they block
the caller for a multiple of the stage time
• Sending and receiving occur at discrete
time steps
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/170

Pipelining and communications
• In each time t, processor pi may be

 Sending data (next stage pi is unblocked)
 Receiving data (next stage pi is unblocked)
 Blocked in a Receive()
 Blocked in a Send()

• Slot = time corresponding to an entire
stage time
• Each time t we have n slots (a slot per
process)
• If pi is blocked, its slot is wasted
(it’s a “bubble”)
• Otherwise the slot is used
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/171

Pipelining and communications
• In each time t, processor pi may be in
 State
 State
 State
 State

S(j) : Sending data to processor pj
R(j) : Receiving data from pj
WR(j) : Blocked in a Receive( pj, … )
WS(j) : Blocked in a Send( pj, …)

• We use formalism:
proc st proc’
to indicate that, at time t,
proc is in state s with proc’
• For instance
p1 WR(4)21 p3

means that the 21st slot of p1 is wasted
waiting for p3 to send its value to it
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining and communications
• The following algorithm is executed by
process j:
Before gaining the right to
broadcast, process j needs to go
through j couples of states (WR, R)

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Ordered
broadcast :
the k-th
message
to be sent
goes to

process

pk
Finally, process j goes through N-j
couples of states (WR, R)

2.3/172
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Pipelining and communications

• p is a vector of indices
• For process j, p can be any arrangement

of the integers 0, 1, …, j-1, j+1, … N
• Whatever the arrangement, the algorithm
works correctly
• For instance, if N = 4 (5 processes) and
j = 1, then p can be any permutation of
0, 2, 3, and 4

• p determines the order in which process j
Computer
Architectures
In Practice

2.3/173

sends its value to its neighbours

• Example: p[] = [ 3, 2, 0, 4]. Then p1
executes:
send (p3), send(p2), send(p0), send(p4)
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining and communications
• Example:

p[] = ordered permutation

 Ex: N=5 and pj 

p

[ 0, … j-1,j+1, … N ]
Duration

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Frequencies of used slots
2.3/174

Slot wasted in send
Slot wasted in receive
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Pipelining and communications
• Case N = 20,

p[] = ordered permutation

• Gray = wasted slots
• Black = used slots

• In general, duration is
• Used slots / total # of slots
• Average # used slots during
one stage time
• This image:reminds us of another one:

2.3/175
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining and communications
Time
6 PM

Computer
Design

7

8

9

10

11

12

1

30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
A

Computer
Architectures
for AI

Computer
Architectures
In Practice

B
C
D

No pipelining: Many slots are wasted!
2.3/176

2 AM
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining and communications
• Let us now consider the case in which
processor k uses

p[] = [ k+1, k+2, …, N, O, 1, …, k-1 ]
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/177
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/178

Pipelining and communications
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/179

Pipelining and communications
• Duration: first case vs. second case
© V. De Florio
KULeuven 2002

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.3/180

Pipelining and communications
• Efficiency: first case vs. second case
© V. De Florio
KULeuven 2002

Basic
Concepts

Pipelining and communications
• Algorithm of pipelined broadcast

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Every 10 slots, 5 mark the
completion of a broadcast
Beginning of steady state
Throughput = t / 2 (t = 1 slot)
A full broadcast is finished every 2 t

2.3/181

• The image may remind us of another one…
© V. De Florio
KULeuven 2002

Pipelining (slide P2.2/20)
6 PM

7

9

8

10

11

12

1

2 AM

Basic
Concepts

30 30 30 30 30
…A
Computer
Design

Computer
Architectures
In Practice

2.3/182

C

…

D



B

…
Computer
Architectures
for AI

…









Between 7.30 and 9.30pm, a whole job
is completed every 30’
During that period, each worker is
permanently at work…
…but a new input must arrive within 30’

More Related Content

What's hot

Advanced computer architecture unit 5
Advanced computer architecture  unit 5Advanced computer architecture  unit 5
Advanced computer architecture unit 5Kunal Bangar
 
Pl9ch1
Pl9ch1Pl9ch1
Pl9ch1Ved Ed
 
[2016/2017] Architectural languages
[2016/2017] Architectural languages[2016/2017] Architectural languages
[2016/2017] Architectural languagesIvano Malavolta
 
Comso c++
Comso c++Comso c++
Comso c++Mi L
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noidaEdhole.com
 
07 software design
07   software design07   software design
07 software designkebsterz
 
Programming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwaresProgramming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwaresNisarg Amin
 
The Concurrency Challenge : Notes
The Concurrency Challenge : NotesThe Concurrency Challenge : Notes
The Concurrency Challenge : NotesSubhajit Sahu
 
Software Patterns
Software PatternsSoftware Patterns
Software Patternskim.mens
 
Principles of-programming-languages-lecture-notes-
Principles of-programming-languages-lecture-notes-Principles of-programming-languages-lecture-notes-
Principles of-programming-languages-lecture-notes-Krishna Sai
 
Session01 basics programming
Session01 basics programmingSession01 basics programming
Session01 basics programmingHarithaRanasinghe
 
Software Design 1: Coupling & cohesion
Software Design 1: Coupling & cohesionSoftware Design 1: Coupling & cohesion
Software Design 1: Coupling & cohesionAttila Magyar
 
Domain-specific Modeling and Code Generation for Cross-platform Mobile and Io...
Domain-specific Modeling and Code Generation for Cross-platform Mobile and Io...Domain-specific Modeling and Code Generation for Cross-platform Mobile and Io...
Domain-specific Modeling and Code Generation for Cross-platform Mobile and Io...Università degli Studi dell'Aquila
 
Oose unit 5 ppt
Oose unit 5 pptOose unit 5 ppt
Oose unit 5 pptDr VISU P
 

What's hot (19)

Advanced computer architecture unit 5
Advanced computer architecture  unit 5Advanced computer architecture  unit 5
Advanced computer architecture unit 5
 
Pl9ch1
Pl9ch1Pl9ch1
Pl9ch1
 
[2016/2017] Architectural languages
[2016/2017] Architectural languages[2016/2017] Architectural languages
[2016/2017] Architectural languages
 
Comso c++
Comso c++Comso c++
Comso c++
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noida
 
07 software design
07   software design07   software design
07 software design
 
WEBSITE DEVELOPMENT
WEBSITE DEVELOPMENTWEBSITE DEVELOPMENT
WEBSITE DEVELOPMENT
 
C 1
C 1C 1
C 1
 
Programming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwaresProgramming languages,compiler,interpreter,softwares
Programming languages,compiler,interpreter,softwares
 
The Concurrency Challenge : Notes
The Concurrency Challenge : NotesThe Concurrency Challenge : Notes
The Concurrency Challenge : Notes
 
Assembly language
Assembly languageAssembly language
Assembly language
 
Software Patterns
Software PatternsSoftware Patterns
Software Patterns
 
Principles of-programming-languages-lecture-notes-
Principles of-programming-languages-lecture-notes-Principles of-programming-languages-lecture-notes-
Principles of-programming-languages-lecture-notes-
 
Session01 basics programming
Session01 basics programmingSession01 basics programming
Session01 basics programming
 
Software Design 1: Coupling & cohesion
Software Design 1: Coupling & cohesionSoftware Design 1: Coupling & cohesion
Software Design 1: Coupling & cohesion
 
Domain-specific Modeling and Code Generation for Cross-platform Mobile and Io...
Domain-specific Modeling and Code Generation for Cross-platform Mobile and Io...Domain-specific Modeling and Code Generation for Cross-platform Mobile and Io...
Domain-specific Modeling and Code Generation for Cross-platform Mobile and Io...
 
Languages in computer
Languages in computerLanguages in computer
Languages in computer
 
01 overview
01 overview01 overview
01 overview
 
Oose unit 5 ppt
Oose unit 5 pptOose unit 5 ppt
Oose unit 5 ppt
 

Similar to Advanced Computer Architectures – Part 2.3

Advanced Computer Architectures – Part 1
Advanced Computer Architectures – Part 1Advanced Computer Architectures – Part 1
Advanced Computer Architectures – Part 1Vincenzo De Florio
 
Advanced Computer Architectures – Part 2.1
Advanced Computer Architectures – Part 2.1Advanced Computer Architectures – Part 2.1
Advanced Computer Architectures – Part 2.1Vincenzo De Florio
 
Computer System Architecture Lecture Note 1: introduction
Computer System Architecture Lecture Note 1: introductionComputer System Architecture Lecture Note 1: introduction
Computer System Architecture Lecture Note 1: introductionBudditha Hettige
 
Intro to Microsoft.NET
Intro to Microsoft.NET Intro to Microsoft.NET
Intro to Microsoft.NET rchakra
 
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...Design World
 
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.pptL14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.pptAronBalais1
 
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.pptL14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.pptAxmedMaxamuud6
 
Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Marcirio Chaves
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdfSwatantraPrakash5
 
Computer and multimedia Week 1 Windows Architecture.pptx
Computer and multimedia Week 1 Windows Architecture.pptxComputer and multimedia Week 1 Windows Architecture.pptx
Computer and multimedia Week 1 Windows Architecture.pptxfatahozil
 
Unit 2 computer software
Unit 2 computer softwareUnit 2 computer software
Unit 2 computer softwareHardik Patel
 

Similar to Advanced Computer Architectures – Part 2.3 (20)

Advanced Computer Architectures – Part 1
Advanced Computer Architectures – Part 1Advanced Computer Architectures – Part 1
Advanced Computer Architectures – Part 1
 
Advanced Computer Architectures – Part 2.1
Advanced Computer Architectures – Part 2.1Advanced Computer Architectures – Part 2.1
Advanced Computer Architectures – Part 2.1
 
slides8.ppt
slides8.pptslides8.ppt
slides8.ppt
 
Computer System Architecture Lecture Note 1: introduction
Computer System Architecture Lecture Note 1: introductionComputer System Architecture Lecture Note 1: introduction
Computer System Architecture Lecture Note 1: introduction
 
Ch06lect1 ud
Ch06lect1 udCh06lect1 ud
Ch06lect1 ud
 
Pyconuk2011
Pyconuk2011Pyconuk2011
Pyconuk2011
 
SS UI Lecture 1
SS UI Lecture 1SS UI Lecture 1
SS UI Lecture 1
 
Ss ui lecture 1
Ss ui lecture 1Ss ui lecture 1
Ss ui lecture 1
 
Intro to Microsoft.NET
Intro to Microsoft.NET Intro to Microsoft.NET
Intro to Microsoft.NET
 
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
 
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.pptL14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
 
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.pptL14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
L14_DesignGoalsSubsystemDecompositionc_ch06lect1.ppt
 
Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1Tutorial on Parallel Computing and Message Passing Model - C1
Tutorial on Parallel Computing and Message Passing Model - C1
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Lecture 10
Lecture 10Lecture 10
Lecture 10
 
Ch05
Ch05Ch05
Ch05
 
Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdf
 
Computer and multimedia Week 1 Windows Architecture.pptx
Computer and multimedia Week 1 Windows Architecture.pptxComputer and multimedia Week 1 Windows Architecture.pptx
Computer and multimedia Week 1 Windows Architecture.pptx
 
Unit 2 computer software
Unit 2 computer softwareUnit 2 computer software
Unit 2 computer software
 

More from Vincenzo De Florio

Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...
Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...
Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...Vincenzo De Florio
 
On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...
On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...
On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...Vincenzo De Florio
 
Service-oriented Communities: A Novel Organizational Architecture for Smarter...
Service-oriented Communities: A Novel Organizational Architecture for Smarter...Service-oriented Communities: A Novel Organizational Architecture for Smarter...
Service-oriented Communities: A Novel Organizational Architecture for Smarter...Vincenzo De Florio
 
On codes, machines, and environments: reflections and experiences
On codes, machines, and environments: reflections and experiencesOn codes, machines, and environments: reflections and experiences
On codes, machines, and environments: reflections and experiencesVincenzo De Florio
 
Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...
Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...
Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...Vincenzo De Florio
 
How Resilient Are Our Societies? Analyses, Models, Preliminary Results
How Resilient Are Our Societies?Analyses, Models, Preliminary ResultsHow Resilient Are Our Societies?Analyses, Models, Preliminary Results
How Resilient Are Our Societies? Analyses, Models, Preliminary ResultsVincenzo De Florio
 
Advanced C Language for Engineering
Advanced C Language for EngineeringAdvanced C Language for Engineering
Advanced C Language for EngineeringVincenzo De Florio
 
A framework for trustworthiness assessment based on fidelity in cyber and phy...
A framework for trustworthiness assessment based on fidelity in cyber and phy...A framework for trustworthiness assessment based on fidelity in cyber and phy...
A framework for trustworthiness assessment based on fidelity in cyber and phy...Vincenzo De Florio
 
Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015
Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015
Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015Vincenzo De Florio
 
A behavioural model for the discussion of resilience, elasticity, and antifra...
A behavioural model for the discussion of resilience, elasticity, and antifra...A behavioural model for the discussion of resilience, elasticity, and antifra...
A behavioural model for the discussion of resilience, elasticity, and antifra...Vincenzo De Florio
 
Considerations and ideas after reading a presentation by Ali Anani
Considerations and ideas after reading a presentation by Ali AnaniConsiderations and ideas after reading a presentation by Ali Anani
Considerations and ideas after reading a presentation by Ali AnaniVincenzo De Florio
 
A Behavioral Interpretation of Resilience and Antifragility
A Behavioral Interpretation of Resilience and AntifragilityA Behavioral Interpretation of Resilience and Antifragility
A Behavioral Interpretation of Resilience and AntifragilityVincenzo De Florio
 
Community Resilience: Challenges, Requirements, and Organizational Models
Community Resilience: Challenges, Requirements, and Organizational ModelsCommunity Resilience: Challenges, Requirements, and Organizational Models
Community Resilience: Challenges, Requirements, and Organizational ModelsVincenzo De Florio
 
On the Behavioral Interpretation of System-Environment Fit and Auto-Resilience
On the Behavioral Interpretation of System-Environment Fit and Auto-ResilienceOn the Behavioral Interpretation of System-Environment Fit and Auto-Resilience
On the Behavioral Interpretation of System-Environment Fit and Auto-ResilienceVincenzo De Florio
 
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...Vincenzo De Florio
 
Service-oriented Communities and Fractal Social Organizations - Models and co...
Service-oriented Communities and Fractal Social Organizations - Models and co...Service-oriented Communities and Fractal Social Organizations - Models and co...
Service-oriented Communities and Fractal Social Organizations - Models and co...Vincenzo De Florio
 
Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013
Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013
Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013Vincenzo De Florio
 
TOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMING
TOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMINGTOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMING
TOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMINGVincenzo De Florio
 
A Formal Model and an Algorithm for Generating the Permutations of a Multiset
A Formal Model and an Algorithm for Generating the Permutations of a MultisetA Formal Model and an Algorithm for Generating the Permutations of a Multiset
A Formal Model and an Algorithm for Generating the Permutations of a MultisetVincenzo De Florio
 

More from Vincenzo De Florio (20)

My little grundgestalten
My little grundgestaltenMy little grundgestalten
My little grundgestalten
 
Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...
Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...
Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...
 
On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...
On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...
On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...
 
Service-oriented Communities: A Novel Organizational Architecture for Smarter...
Service-oriented Communities: A Novel Organizational Architecture for Smarter...Service-oriented Communities: A Novel Organizational Architecture for Smarter...
Service-oriented Communities: A Novel Organizational Architecture for Smarter...
 
On codes, machines, and environments: reflections and experiences
On codes, machines, and environments: reflections and experiencesOn codes, machines, and environments: reflections and experiences
On codes, machines, and environments: reflections and experiences
 
Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...
Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...
Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...
 
How Resilient Are Our Societies? Analyses, Models, Preliminary Results
How Resilient Are Our Societies?Analyses, Models, Preliminary ResultsHow Resilient Are Our Societies?Analyses, Models, Preliminary Results
How Resilient Are Our Societies? Analyses, Models, Preliminary Results
 
Advanced C Language for Engineering
Advanced C Language for EngineeringAdvanced C Language for Engineering
Advanced C Language for Engineering
 
A framework for trustworthiness assessment based on fidelity in cyber and phy...
A framework for trustworthiness assessment based on fidelity in cyber and phy...A framework for trustworthiness assessment based on fidelity in cyber and phy...
A framework for trustworthiness assessment based on fidelity in cyber and phy...
 
Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015
Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015
Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015
 
A behavioural model for the discussion of resilience, elasticity, and antifra...
A behavioural model for the discussion of resilience, elasticity, and antifra...A behavioural model for the discussion of resilience, elasticity, and antifra...
A behavioural model for the discussion of resilience, elasticity, and antifra...
 
Considerations and ideas after reading a presentation by Ali Anani
Considerations and ideas after reading a presentation by Ali AnaniConsiderations and ideas after reading a presentation by Ali Anani
Considerations and ideas after reading a presentation by Ali Anani
 
A Behavioral Interpretation of Resilience and Antifragility
A Behavioral Interpretation of Resilience and AntifragilityA Behavioral Interpretation of Resilience and Antifragility
A Behavioral Interpretation of Resilience and Antifragility
 
Community Resilience: Challenges, Requirements, and Organizational Models
Community Resilience: Challenges, Requirements, and Organizational ModelsCommunity Resilience: Challenges, Requirements, and Organizational Models
Community Resilience: Challenges, Requirements, and Organizational Models
 
On the Behavioral Interpretation of System-Environment Fit and Auto-Resilience
On the Behavioral Interpretation of System-Environment Fit and Auto-ResilienceOn the Behavioral Interpretation of System-Environment Fit and Auto-Resilience
On the Behavioral Interpretation of System-Environment Fit and Auto-Resilience
 
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
 
Service-oriented Communities and Fractal Social Organizations - Models and co...
Service-oriented Communities and Fractal Social Organizations - Models and co...Service-oriented Communities and Fractal Social Organizations - Models and co...
Service-oriented Communities and Fractal Social Organizations - Models and co...
 
Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013
Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013
Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013
 
TOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMING
TOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMINGTOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMING
TOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMING
 
A Formal Model and an Algorithm for Generating the Permutations of a Multiset
A Formal Model and an Algorithm for Generating the Permutations of a MultisetA Formal Model and an Algorithm for Generating the Permutations of a Multiset
A Formal Model and an Algorithm for Generating the Permutations of a Multiset
 

Recently uploaded

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Advanced Computer Architectures – Part 2.3

  • 1. Advanced Computer Architectures – HB49 – Part 2.3 Vincenzo De Florio K.U.Leuven / ESAT / ELECTA
  • 2. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/2 Course contents • Basic Concepts Computer Design • Computer Architectures for AI • Computer Architectures in Practice
  • 3. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/3 Computer Design • Quantitative assessments • Instruction sets • Pipelining Parallelism
  • 4. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Parallelism • Introduction to parallel processing • Instruction level parallelism • (Data level parallelism)  Part 3 • (Task level parallelism)  Part 3 Computer Architectures for AI Computer Architectures In Practice 2.3/4
  • 5. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/5 Parallelism • Introduction to parallel processing  Basic concepts: granularity, program, process, thread, language aspects  Types of parallelism • Instruction level parallelism
  • 6. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/6 Parallelism • Introduction to parallel processing  Basic concepts: granularity, program, process, thread  Types of parallelism • Instruction level parallelism
  • 7. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Granularity • Definition:  granularity is the complexity/grain size of some item  e.g. computation item (instruction), data item (scalar, array, struct), communication item (token granularity), hardware building block (gate, RTL component) Granularity Low CISC (e.g. ld *a0++,r1) Computer Architectures In Practice High Level Languages HLLs (e.g. x = sin(y)) High 2.3/7 RISC (e.g. add r1,r2,r4) Application-specific (e.g. edge-det.invert.image)
  • 8. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/8 Granularity • Deciding the granularity is an important design choice • E.g. grain size for the communication tokens in a parallel computer:  coarse grain: less communication overhead  fine grain: less time penalty when two communication packets compete for transmission over the same channel and collide
  • 9. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/14 Parallelism • Introduction to parallel processing  Basic concepts: granularity, program, process, thread  Types of parallelism • Instruction level parallelism
  • 10. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/15 Types of parallelism • Functional parallelism Important for the exam!  Different computations have to be performed on the same or different data  E.g. Multiple users submit jobs to the same computer or a single user submits multiple jobs to the same computer  this is functional parallelism at the process level  taken care of at run-time by the OS
  • 11. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/18 Types of parallelism • Data parallelism Important for the exam!  Same computations have to be performed on a whole set of data  E.g. 2D convolution of an image  This is data parallelism at the loop level: consecutive loop iterations are candidates for parallel execution, subject to inter-iteration data dependencies  Leads often to massive amount of parallelism
  • 12. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Levels of parallelism • Instruction level parallel (ILP)  Functional parallelism at the instruction level  Example: pipelining • Data level parallel (DLP)  Data parallelism at the loop level • Process & thread level parallel (TLP) Computer Architectures for AI Computer Architectures In Practice 2.3/19  Functional parallelism at the thread and process level
  • 13. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/20 Parallelism • Introduction to parallel processing • Instruction level parallelism  Introduction  VLIW  Advanced pipelining techniques  Super scalar
  • 14. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/21 Parallelism • Introduction to parallel processing • Instruction level parallelism  Introduction  VLIW  Advanced pipelining techniques  Super scalar
  • 15. © V. De Florio KULeuven 2002 Basic Concepts Type of Instruction Level Parallelism utilization • Sequential instruction issuing, sequential instruction execution  von Neumann processors Computer Design Instruction word Computer Architectures for AI Computer Architectures In Practice 2.3/22 EU
  • 16. © V. De Florio KULeuven 2002 Basic Concepts Type of Instruction Level Parallelism utilization • Sequential instruction issuing, parallel instruction execution  pipelined processors Computer Design Instruction word EU1 Computer Architectures for AI Computer Architectures In Practice EU2 EU3 EU4 2.3/23
  • 17. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Type of Instruction Level Parallelism utilization • Parallel instruction issuing – compile-time determined by compiler, parallel instruction execution  VLIW processors: Very Long Instruction Word Instruction word Computer Architectures for AI Computer Architectures In Practice 2.3/24 EU1 EU2 EU3 EU4
  • 18. Type of Instruction Level Parallelism utilization © V. De Florio KULeuven 2002 Basic Concepts • Parallel instruction issuing – run-time determined by HW dispatch unit, parallel instruction execution  super-scalar processors (to be seen later) Computer Design Instruction window Computer Architectures for AI Computer Architectures In Practice 2.3/25 EU1 EU2 EU3 EU4
  • 19. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/26 Type of Instruction Level Parallelism utilization • Most processors provide sequential execution semantics  regardless how the processor actually executes the instructions (sequential or parallel, in-order or out-of-order), the result is the same as sequential execution in the order they were written • VLIW and IA-64 provide parallel execution semantics  explicit indication in ASM which instructions are executed in parallel
  • 20. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/27 Parallelism • Introduction to parallel processing • Instruction level parallelism  Introduction  VLIW  Advanced pipelining techniques  Super scalar
  • 21. © V. De Florio KULeuven 2002 VLIW Main instruction memory Basic Concepts 128 bit Instruction Cache Computer Design 128 bit Instruction Register 32 bit each Dec Computer Architectures for AI Dec 256 decoded bits each EU EU EU Register file EU 32 bit each; 8 read ports, 4 write ports 32 bit each; 2 read ports, 1 write port 32 bit; 1 bi-directional port 2.3/28 Dec Cache/ RAM Computer Architectures In Practice Dec Main data memory
  • 22. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/29 VLIW • Properties  Multiple Execution Units: multiple instructions issued in one clock cycle  Every EU requires 2 operands and delivers one result every clock cycle: high data memory bandwidth needed  Careful design of data memory hierarchy  Register file with many ports  Large register file: 64-256 registers  Carefully balanced cache/RAM hierarchy with decreasing number of ports and increasing memory size and access time for the higher levels (IMEC research: DTSE)
  • 23. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/32 VLIW • Properties  Compiler should determine which instructions can be issued in a single cycle without control dependency conflict nor data dependency conflict  Deterministic utilization of parallelism: good for hard-real-time  Compile-time analysis of source code: worst case analysis instead of actual case  Very sophisticated compilers, especially when the EUs are pipelined! Perform well since early 2000
  • 24. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/33 VLIW • Properties  Compiler should determine which instructions can be issued in a single cycle without control dependency conflict nor data dependency conflict  Very difficult to write assembly: programmer should resolve all control flow conflicts all data flow conflicts all pipelining conflicts and at the same time fit data accesses into the available data memory bandwidth  and all program accesses into the available program memory bandwidth  e.g. 2 weeks for a sum-of-products (3 lines of Ccode)      All high end DSP processors since 1999 are VLIW processors (examples: Philips Trimedia -high end TV, TI TMS320C6x -- GSM base stations and ISP modem arrays)
  • 25. © V. De Florio KULeuven 2002 Low power DSP Basic Concepts Computer Design Main instruction memory Too much power dissipation in fetching wide instructions 128 bit Instruction Cache 128 bit Computer Architectures for AI Instruction Register 32 bit each Dec Computer Architectures In Practice Dec Dec Dec 256 decoded bits each EU EU EU Register file EU 32 bit each; 8 read ports, 4 write ports 32 bit each; 2 read ports, 1 write port 2.3/34
  • 26. © V. De Florio KULeuven 2002 Main IMem Low power DSP Basic Concepts 24 bit ICache Computer Design 24 bit E.g. ADD4 is expanded into ADD || ADD || ADD || ADD Instruction expansion 128 bit Computer Architectures for AI Instruction Register 32 bit each Dec Computer Architectures In Practice Dec Dec Dec 256 decoded bits each EU EU EU Register file EU 32 bit each; 8 read ports, 4 write ports 32 bit each; 2 read ports, 1 write port 2.3/35
  • 27. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/36 Low power DSP • Properties  Power consumption in program memory is reduced by specializing the instructions for the application  Not all combinations of all instructions for the EUs are possible, but only a limited set, i.e. those combinations that lead to a substantial speed-up of the application  Those relevant combinations are represented by the smallest possible amount of bits to reduce program memory width and hence program memory power consumption  Can only be done for embedded DSP applications: processor is specialized for 1 application (examples: TI TMS320C54x -- GSM mobile phones, TI TMS320C55x -- UMTS mobile phones)
  • 28. Low power DSP for interactive multimedia © V. De Florio KULeuven 2002 Main IMem Basic Concepts ICache Computer Design 24 bit Run-time reconfiguration allows to adapt specialization to changing application requirements 24 bit Reconfigurable Instruction expansion 128 bit Computer Architectures for AI Instruction Register 32 bit each Dec Computer Architectures In Practice Dec Dec Dec 256 decoded bits each REU REU REU Register file REU 32 bit each; 8 read ports, 4 write ports 32 bit each; 2 read ports, 1 write port 2.3/37
  • 29. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/39 Parallelism • Introduction to parallel processing • Instruction level parallelism  Introduction  VLIW  Advanced pipelining techniques  Super scalar
  • 30. © V. De Florio KULeuven 2002 Basic Concepts Advanced Pipelining • Pipeline CPI is the result of many components  CPUTIME(p) = IC(p)  CPI(p) clock rate Computer Design • A number of techniques act on one or more of these components: Computer Architectures for AI Computer Architectures In Practice  Loop unrolling  Scoreboarding  Dynamic branch prediction  Speculation … • To be seen later 2.3/40
  • 31. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/41 Advanced Pipelining • Till now, Instruction-level parallelism was searched within the boundaries of a basic block (BB) • A BB is 6-7 instructions on average  too small to reach the expected performance • What is worse, there’s a big chance that these instructions have dependencies  Even less performance can be expected
  • 32. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/42 Advanced Pipelining • To obtain more, we need to go beyond the BB limitation: • We must exploit ILP across multiple BB’s • Simplest way: loop level parallelism (LLP):  Exploiting the parallelism among iterations of a loop • Converting LLP into ILP  Loop unrolling  Statically (compiler-based)  Dynamically (HW-based) • Using vector instructions  Does not require LLP -> ILP conversion
  • 33. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/43 Advanced Pipelining • The efficiency of the conversion depends  On the amount of ILP available  On latencies of the functional units in the pipeline  On the ability to avoid pipeline stalls by separating dependent instructions by a “distance” (in terms of stages) equal to the latency peculiar to the source instruction LW x, …  INSTR …, x a load must not be followed by the immediate use of the load destination register
  • 34. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/44 Advanced Pipelining  Loop unrolling Assumptions and steps 1. We assume the following latencies Consumer Instruction Producer Instruction Latency FP ALU OP FP ALU OP 3 FP ALU OP S ORE DBL T 2 LOAD DBL FP ALU OP 1 LOAD DBL S ORE DBL T 0
  • 35. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/45 Advanced Pipelining  Loop unrolling 2. We assume to work with a simple loop such as for (I=1; I<=1000; I++) x[I] = X[I] + s; • Note: each iteration is independent of the others  Very simple case
  • 36. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/46 Advanced Pipelining  Loop unrolling 3. Translated in DLX, this simple loop looks like this: ; assumptions: R1 = &x[1000] ; F2 = s Loop: LD F0, 0(R1) ; F0 = x[I] ADDD F4, F0, F2 ; F4 = F0 + s SD 0(R1), F4 ; store result SUBI R1, R1, #8 ; R1 = R1 - 1 BNEZ R1, Loop ; if (R1) ; goto Loop W O
  • 37. Advanced Pipelining  Loop unrolling © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 4. Tracing the loop (no scheduling!): Loop: LD stall ADDD stall stall SD SUBI BNEZ stall • 2.3/47 F0, 0(R1) ;  F4, F0, F2 ;  0(R1), F4 R1, R1, #8 R1, Loop  ; ; ; ; 1 2 3 4 5 6 7 8 9 9 clock cycles per iteration, with 4 stalls
  • 38. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/48 Advanced Pipelining  Loop unrolling 5. With scheduling, we move from Loop: LD ADDD SD SUBI BNEZ to F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Loop Loop: LD ADDD SUBI BNEZ SD F0, 0(R1) F4, F0, F2 R1, R1, #8 R1, Loop 8 8(R1), F4 whose trace shows that less cycles are wasted:
  • 39. Advanced Pipelining  Loop unrolling © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/49 6. Tracing the loop (with scheduling!): Loop: LD stall ADDD SUBI BNEZ SD • • • • F0, 0(R1)  F4, F0, F2 R1, R1, 8 R1, Loop 8(R1), F4 ; ; ; ; ; 1 2 3 4 5 6 O O 6 clock cycles per iteration, with 1 stall 3 stalls less! Still the useful cycles are just 3 How to gain more?
  • 40. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/50 Advanced Pipelining  Loop unrolling 7. With loop unrolling: replicating the body of loop multiple times Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 ; skip SUBI and BNEZ LD F6, -8(R1) ; F6 vs. F0 ADDD F8, F6, F2 ; F8 vs. F4 SD -8(R1), F8 ; skip SUBI and BNEZ LD F10, -16(R1) ; F10 vs. F0 ADDD F12, F10, F2 ; F12 vs. F4 SD -16(R1), F12 ; skip SUBI and BNEZ LD F14, -24(R1) ; F14 vs. F0 ADDD F16, F14, F2 ; F16 vs. F4 SD -24(R1), F16 ; skip SUBI and BNEZ SUBI R1, R1, #32 ; R1 = R1 – 4 BNEZ R1, Loop • Spared 3 x (SUBI + BNEZ)
  • 41. © V. De Florio KULeuven 2002 Basic Concepts Advanced Pipelining  Loop unrolling • Loop unrolling: replicating the body of loop multiple times  Some branches are eliminated  The ratio w/o increases Computer Design Computer Architectures for AI Computer Architectures In Practice  The BB artificially increases its size  Higher probability of optimal scheduling  Requires a wider set of registers and adjusting values of load and store registers  (In the given example,) Every operation is followed by a dependent instruction  Will cause a stall  Trace of unscheduled unrolled loop: 27 cycles  2 per LD, 3 per ADD, 2 per branch, 1 per any other 2.3/51  6.8 clock cycles per iteration  Pure scheduling is better! (6 cycles)
  • 42. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/52 Advanced Pipelining  Loop unrolling • Unrolled loop plus scheduling Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop ; skip SUBI and BNEZ ; F6 vs. F0 ; F8 vs. F4 ; skip SUBI and BNEZ ; F10 vs. F0 ; F12 vs. F4 ; skip SUBI and BNEZ ; F14 vs. F0 ; F16 vs. F4 ; skip SUBI and BNEZ ; R1 = R1 – 4
  • 43. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/53 Advanced Pipelining  Loop unrolling • Unrolled loop plus scheduling Loop: LD LD ADDD SD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F6, -8(R1) F4, F0, F2 0(R1), F4 F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop ; F6 vs. F0 ; skip SUBI and BNEZ ; F8 vs. F4 ; skip SUBI and BNEZ ; F10 vs. F0 ; F12 vs. F4 ; skip SUBI and BNEZ ; F14 vs. F0 ; F16 vs. F4 ; skip SUBI and BNEZ ; R1 = R1 – 4
  • 44. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/54 Advanced Pipelining  Loop unrolling • Unrolled loop plus scheduling Loop: LD LD LD ADDD SD ADDD SD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F6, -8(R1) F10, -16(R1) F4, F0, F2 0(R1), F4 F8, F6, F2 -8(R1), F8 F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop ; F6 vs. F0 ; F10 vs. F0 ; skip SUBI and BNEZ ; F8 vs. F4 ; skip SUBI and BNEZ ; F12 vs. F4 ; skip SUBI and BNEZ ; F14 vs. F0 ; F16 vs. F4 ; skip SUBI and BNEZ ; R1 = R1 – 4
  • 45. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/55 Advanced Pipelining  Loop unrolling • Unrolled loop plus scheduling Loop: LD LD LD LD ADDD SD ADDD SD ADDD SD ADDD SD SUBI BNEZ F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 0(R1), F4 F8, F6, F2 -8(R1), F8 F12, F10, F2 -16(R1), F12 F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop ; F6 vs. F0 ; F10 vs. F0 ; F14 vs. F0 ; skip SUBI and BNEZ ; F8 vs. F4 ; skip SUBI and BNEZ ; F12 vs. F4 ; skip SUBI and BNEZ ; F16 vs. F4 ; skip SUBI and BNEZ ; R1 = R1 – 4
  • 46. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/56 Advanced Pipelining  Loop unrolling • Unrolled loop plus scheduling Enough distance to prevent the dependency to turn into a hazard Loop: LD F0, 0(R1) LD F6, -8(R1) ; F6 vs. F0 LD F10, -16(R1) ; F10 vs. F0 LD F14, -24(R1) ; F14 vs. F0 ADDD F4, F0, F2 ADDD F8, F6, F2 ; F8 vs. F4 ADDD F12, F10, F2 ; F12 vs. F4 ADDD F16, F14, F2 ; F16 vs. F4 SD 0(R1), F4 ; skip SUBI and BNEZ SD -8(R1), F8 ; skip SUBI and BNEZ SD -16(R1), F12 ; skip SUBI and BNEZ SD -24(R1), F16 ; skip SUBI and BNEZ SUBI R1, R1, #32 ; R1 = R1 – 4 BNEZ R1, Loop • 14 clock cycles, or 3.5 clock cycles / iteration
  • 47. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/57 Advanced Pipelining  Loop unrolling • Unrolling the loop exposes more computation that can be scheduled to minimize the stalls • Unrolling increases the BB; as a result, a better choice can be done for scheduling • A useful technique with two key requirements:  Understanding how an instruction depends on another  Understanding how to change or reorder the instructions, given the dependencies • In what follows we concentrate on .
  • 48. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Loop unrolling: . dependencies • Again, let ( Ik)1  k  IC(p) be the ordered series of instructions executed during the run of program p • Given two instructions, Ii and Ij, with i<j, we say that Ij is dependent on Ii (Ii  Ij) iff  R(Ii)  D(Ij)  R is the range and D the domain of a given instruction  Ii produces a result which is consumed by Ij or  2.3/58 $ n  { 1,…,IC(p)} and $ k1 < k2 < … < kn such that Ii  Ik1  Ik2  .. Ikn  Ij
  • 49. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/59 Loop unrolling: . dependencies • (Ii , Ik1 , Ik2 , … Ikn , Ij) is called a dependency (transitive) chain • Note that a dependency chain can be as long as the entire execution of p • A hazard implies dependency • Dependency does not imply a hazard! • Scheduling tries to place dependent instructions in places where no hazard can occur
  • 50. © V. De Florio KULeuven 2002 Loop unrolling: . dependencies Basic Concepts • For instance: SUBI R1, R1, #8 Computer Design BNEZ R1, Loop • This is clearly a dependence, but it does not result in a hazard  Forwarding eliminates the hazard Computer Architectures for AI Computer Architectures In Practice 2.3/60 • Another example: LD F0, 0(R1) ADDD F4, F0, F2 • This is a data dependency which does lead to a hazard and a stall
  • 51. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/61 Loop unrolling: . dependencies • Dealing with data dependencies • Two classes of methods: 1. Keeping the dependence though avoiding the hazard (via scheduling) 2. Eliminating a dependence by transforming the code
  • 52. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/62 Loop unrolling: . dependencies • Class 2 implies more work • These are optimization methods used by the compilers • Detecting dependencies when only using registers is easy; the difficulties come from detecting dependencies in memory: • For instance 100(R4) and 20(R6) may point to the same memory location • Also the opposite situation may take place: LD 20(R4), R2 … ADD R3, R1, 20(R4) • If R4 changes, this is no dependency
  • 53. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/63 Loop unrolling: . dependencies • Ii  Ij means that Ii produces a result that is consumed by Ij • When there is no such production, e.g., Ii and Ij are both loads or stores, we call this a name dependency • Two types of name dependencies:  Antidependence Corresponds to WAR hazards Ij  x ; Ii  x (reordering implies an error)  Output dependence Corresponds to WAW hazards Ij  x ; Ii  x (reordering implies an error) • No value is transferred between the instructions • Register renaming solves the problem
  • 54. © V. De Florio KULeuven 2002 Basic Concepts Loop unrolling: . dependencies • • Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/64 • Register renaming: if the register name is changed, the conflict disappears This technique can be either static (and done by the compiler) or dynamic (done by the HW) Let us consider again the following loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop • Let us perform unrolling w/o renaming:
  • 55. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/65 Loop unrolling: . dependencies Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 F0, -8(R1) F4, F0, F2 -8(R1), F4 F0, -16(R1) F4, F0, F2 -16(R1), F4 F0, -24(R1) F4, F0, F2 -24(R1), F0 R1, R1, #32 R1, Loop The yellow arrows are name dependencies. To solve them, we perform renaming
  • 56. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/66 Loop unrolling: . dependencies Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F0, -16(R1) F4, F0, F2 -16(R1), F4 F0, -24(R1) F4, F0, F2 -24(R1), F0 R1, R1, #32 R1, Loop
  • 57. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/67 Loop unrolling: . dependencies Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop
  • 58. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/68 Loop unrolling: . dependencies Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop The yellow arrows are data dependencies. To solve them, we reorder the instructions
  • 59. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/69 Loop unrolling: . dependencies • A third class of dependencies is the one of control dependencies • Examples: if (p1) s1; if (p2) s2; then p1 c s1 (s1 is control dependent on p1) p2 c s2 (s2 is control dependent on p2) • Clearly  (p1 c s2) , that is, s2 is not control dependent on p1
  • 60. © V. De Florio KULeuven 2002 Basic Concepts Loop unrolling: . dependencies • Two properties are critical to control dependency:  Exception behaviour  Data flow Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/72 • Exception behaviour: suppose we have the following excerpt: BEQZ R2, L1 DIVI R1, 8(R2) L1: … • We may be able to move the DIVI to before the BEQZ without violating the sequential semantics of the program • Suppose the branch is taken. Normally one would simply need to undo the DIVI • What if DIVI triggers a DIVBYZERO exception?
  • 61. © V. De Florio KULeuven 2002 Basic Concepts Loop unrolling: . dependencies • Two properties are critical to control dependency:  Exception behaviour  Data flow Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/73 • Data flow must be preserved • Let us consider the following excerpt: ADD R1, R2, R3 BEQZ R4, L SUB R1, R5, R6 L: OR R7, R1, R8 • Value of R1 depends on the control flow • The OR depends on both ADD and SUB • Also depends on the nature of the branch • R1 = (taken)? ADD.. : SUB..
  • 62. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/74 Loop Level Parallelism • Let us consider the following loop: for (I=1; I<=100; I++) { A[I+1] = A[I] + C[I]; /* S1 */ B[I+1] = B[I] + A[I+1]; /* S2 */ } • S1 is a loop-carried dependency (LCD): iteration I+1 is dependent on iteration I: A’ = f(A) • S2 is B’ = f(B,A’) • If a loop has only non-LCD’s, then it is possible to execute more than one loop iteration in parallel – as long as the dependencies within each iteration are not violated
  • 63. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/75 Loop Level Parallelism • What to do in the presence of LCD’s? • Loop transformations. Example: for (I=1; I<=100; I++) { A[I+1] = A[I] + B[I]; /* S1 */ B[I+1] = C[I] + D[I]; /* S2 */ } • A’ = f(A, B) B’ = f(C, D) • Note: no dependencies except LCD’s Instructions can be swapped!
  • 64. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/76 Loop Level Parallelism • What to do in the presence of LCD’s? • Loop transformations. Example: for (I=1; I<=100; I++) { A[I+1] = A[I] + B[I]; /* S1 */ B[I+1] = C[I] + D[I]; /* S2 */ } • Note: the flow, i.e., A0 B0 A0 B0 C0 D0 C0 D0 A1 B1 can be A1 B1 C1 D1 changed into C1 D1 A2 B2 A2 B2 C2 D2 ... ...
  • 65. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/77 Loop Level Parallelism for (i=1; i <= 100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } becomes A[1] = A[1] + B[1]; for (i=1; i <= 99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100];
  • 66. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/78 Loop Level Parallelim • A’ = f(A, B) B’ = f(C, D) B’ = f(C, D) A’ = f(A’, B’) • Now we have dependencies but no more LCD’s! It is possible to execute more than one loop iteration in parallel – as long as the dependencies within each iteration are not violated
  • 67. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/79 Dependency avoidance 1. “Batch” approaches: at compile time, the compiler schedules the instructions in order to minimize the dependencies (static scheduling) 2. “Interactive” approaches: at run-time, the HW rearranges the instructions in order to minimize the stalls (dynamic scheduling) • Advantages of 2:  Only approach when dependencies are only known at run-time (pointers etc.)  The compiler can be simpler  Given an executable compiled for a machine with machine-level X and pipeline organization Y, it can run efficiently on another machine with the same machine level but a different pipeline organization Z
  • 68. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/80 Dynamic Scheduling • Static scheduling: compiler techniques for scheduling (rearranging) the instructions  so to separate dependent instructions  And hence minimize unsolvable hazards causing unavoidable stalls • Dynamic scheduling: HW-based, run-time techniques • A dynamically scheduled processor does not try to remove true data dependencies (which would be impossible): it tries to avoid stalling when dependencies are present • The two techniques can be both used
  • 69. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/81 Dynamic Scheduling: General Idea • If an instruction is stalled in the pipeline, no later instruction can proceed • A dependence between two instructions close to each other causes a stall • A stall means that, even though there may be idle functional units that could potentially serve other instructions, those units have to stay idle • Example: DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 • ADDD depends on DIVD; but SUBD does not. Despite this, it is not issued!
  • 70. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/82 Dynamic Scheduling: General Idea • So SUBD is not issued even there might be a functional unit ready to perform the requested operation • Big performance limitation! • What are the reasons that lead to this problem? • In-order instruction issuing and execution: instructions issue and execute one at a time, one after the other
  • 71. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/83 Dynamic Scheduling: General Idea • Example: in DLX, the issue of an instruction occurs at ID (instruction decode) • In DLX, ID checks for absence of structural hazards and waits for the absence of data hazards • These two steps may be made distinct
  • 72. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/84 Dynamic Scheduling: General Idea • The issue process gets divided into two parts: 1. Checking the presence of structural hazards 2. Waiting for the absence of a data hazard • Instructions are issued in order, but they execute and complete as soon as their data operands are available • Data flow approach
  • 73. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/85 Dynamic Scheduling: General Idea • The ID pipeline stage is divided into two sub-stages: • ID.1 (Issue) : decode the instruction, check for structural hazards • ID.2 (read operands) : wait until no data hazards, then read operands
  • 74. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/86 Dynamic Scheduling: General Idea • In the DLX floating point pipeline, the EX stage of instructions may take multiple cycles • For each issued instruction I, depending on the resolution of structural and data hazards, I may be be waiting for resources or data, or in execution, or completed • More than a single instruction can be in execution at the same time
  • 75. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Scoreboarding • Scorebord (CDC6600, 1964): a technique to allow instructions to execute out of order when there are sufficient resources and no data dependencies • Goal: execution rate of 1 instruction per clock cycle in the absence of structural hazards • Large set of FUs:  4 FPUs,  5 units for memory references  7 integer FUs  Highly redundant (parallel) system • Four steps replace the ID, EX, WB stages 2.3/87
  • 76. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/88 Scoreboarding Avoids WAWs • IF (a FU is available && no active instruction has same destination reg) { issue I to the FU; update state; } • ASA (the two source operands are available in the registers) { read operands; manage RAW stalls; } • For each FU: ASA (operands are available) { start EX; EOX? Alert scoreboard; } Avoids WARs • When at WB: { wait for (no WAR hazards); store output to destination reg; }
  • 77. © V. De Florio KULeuven 2002 Basic Concepts Scoreboarding • In eliminating stalls, a scoreboard is limited by several factors:  Amount of parallelism available among the instructions Computer Design  (in the presence of many dependencies there’s not much that one can do…)  Number of scoreboard entries Computer Architectures for AI Computer Architectures In Practice 2.3/89  (How far ahead the pipeline can look for independent instructions)  Number and types of FUs  Number of WAR’s and WAW’s
  • 78. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/90 Scoreboarding • The effectiveness of the scoreboard heavily depends on the register file • All operands are read from registers, all outputs go to destination registers  The availability of registers influence the capability to eliminate stalls
  • 79. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/91 Tomasulo’s approach • Tomasulo’s approach (IBM 360/91, 1967) : An improvement of scoreboarding when a limited number of registers is allowed by a machine architecture • Based on virtual registers • The IBM 360/91 had two key design goals:  To be faster than its predecessors  To be machine level compatible with its predecessors • Problem: the 360 family had only 4 FP registers • Tomasulo combined the key ideas of scoreboarding with register renaming
  • 80. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Tomasulo’s approach • IBM 360/91 FUs:  3 ADDD/SUBD, 2 MULD, 6 LD, 6 SD • Key element: the reservation station (RS): a buffer which holds the operands of the instructions waiting to issue • Key concept:  A RS fetches and buffers an operand as soon as it is available, eliminating the need to get that operand from a register  Instead of tracing the source and destination registers, we track source and destination RS’s RSa RSb OP 2.3/92 RSc
  • 81. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/93 Tomasulo’s approach • A reservation station represents:  A static data, read from a register  A “live” data (a future data) that will be produced by another RS and FU • Hazard detection and execution control are not centralised into a scoreboard • They are distributed in each RS, which, independently:  Controls a FU attached to it,  And starts that FU the moment the operands become available
  • 82. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Tomasulo’s approach • The operands go to the FUs through the (wide set of) RS’s, not through the (small) register file • This is managed through a broadcast that makes use of a common result-or-data bus • All units waiting for an operand can load it at the same time: RSa RSb RSb RSd OP2 RSc 2.3/94 OP RSe
  • 83. © V. De Florio KULeuven 2002 Basic Concepts Tomasulo’s approach • The execution is driven by a graph of dependencies RSg RSf SUBD Computer Design RSb RSa Computer Architectures for AI Computer Architectures In Practice 2.3/95 RSd SUBD MULTD RSc RSe • A “live data structure” approach (similar to LINDA): a tuple is made available in the future, when a thread will have finished producing it
  • 84. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/100 Major Advantages of Tomasulo’s • Distributed approach: the RS’s independently control the FU’s • Distributed hazard detection logic • The CDB broadcasts results -> all pending instructions depending on that result are unblocked simultaneously  The CDB, being a bus, reaches many destinations in a single clock cycle  If the waiting instructions get their missing operand in that clock cycle, they can all begin execution on the next clock cycle • WAR and WAW are eliminated by renaming registers using the RS’s
  • 85. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/101 Reducing branch penalties • Static Approaches Dynamic Approaches
  • 86. Reducing branch penalties: Dynamic Branch Prediction © V. De Florio KULeuven 2002 Basic Concepts Computer Design • A branch history table Address . . . Branch Nature 0xA0B2DF37 BNEZ … Computer Architectures for AI Computer Architectures In Practice taken 0xA0B2F02A BEQ … taken . . . . . . 0xA0B30504 . . . 0xA0B30537 2.3/102 untaken BNEZ … . . . taken untaken untaken 2A . . . un taken BGT … 04 37 . . .
  • 87. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/103 Dynamic Branch Prediction  Branch History Table  Algorithm /* before the branch is evaluated */ If (Current instruction is a branch) { entry = PC & 0x000000FF; predict branch as ( BHT [ entry ] ); } /* after the branch */ If (branch was mispredicted) BHT [ entry ] = 1 – BHT [ entry ]
  • 88. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/104 Dynamic Branch Prediction  Branch History Table  Algorithm • Just one bit is enough for coding the Boolean value “taken” vs. “untaken” • Note: the function associating addresses to entries in the BHT is not guaranteed to be a bijection (one-to-one relationship): • The algorithm records the most recently behaviour of one or more branches  For instance, entry 37 corresponds to two b.’s • Despite this, the scheme works well… • …though in some cases, the performance of the scheme is not that satisfactory:
  • 89. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/105 Dynamic Branch Prediction  Branch History Table  Accuracy • for (i=0; i<BIGN; i++) for (j=0; j<9; j++) { do stg(); } • Loop is  taken nine times in a row  then not taken once • Taken 90%, Untaken 10% • What is the prediction accuracy?
  • 90. Dynamic Branch Prediction  Branch History Table  Accuracy © V. De Florio KULeuven 2002 Basic Concepts 9 Computer Design 9 Computer Architectures for AI Computer Architectures In Practice 2.3/106 9 Taken Taken ... Taken Untaken Taken Taken ... Taken Untaken Taken Taken ... Taken Untaken Taken U T 0 1 T T U T 1 0 0 1 2 mispredictions T T U T 1 0 0 1 2 mispredictions T T U 1 0 0 8 successful predictions 8 successful predictions 8 successful predictions 2 mispredictions S.S. Prediction accuracy is just 80% !
  • 91. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Dynamic Branch Prediction  Branch History Table  Accuracy • Loop branches (taken n-1 times in a row, untaken once) • Performance of this dynamic branch predictor (based on a single-bit prediction entry):  Misprediction: 2 x 1 / n  Twice rate of untaken branches Computer Architectures for AI Computer Architectures In Practice 2.3/107
  • 92. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Dynamic Branch Prediction  Two-bit Prediction Scheme • Use a two bit field as a “branch behaviour recorder” • Allow a state to change only when two mispredictions in a row occur: Taken Computer Architectures for AI Not taken Predict taken Predict taken Taken Computer Architectures In Practice Not taken Taken Not taken Predict not taken Predict not taken Taken Not taken 2.3/108
  • 93. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Dynamic Branch Prediction  Branch History Table  Accuracy Taken Taken Taken Taken ... Taken Untaken Taken ... Taken Untaken Taken ... Taken U2 U T2 T2 0 0 1 1 T2 T T2 1 0 1 T2 T T2 1 0 1 T2 1 2 mispredictions first 7 successful predictions 9 successful predictions STEADY STATE 9 successful predictions S.S. Prediction accuracy is now 90% 2.3/109
  • 94. © V. De Florio KULeuven 2002 Basic Concepts Dynamic Branch Prediction  Branch History Table  Accuracy Prediction accuracy with programs from SPEC89 – 2-bit prediction buffer of 4096 entries Computer Design nasa7 matrix300 Computer Architectures for AI 1% 0% tomcatv 1% doduc 5% Computer Architectures In Practice spice 9% fpppp SPEC89 benchmarks 9% gcc 12% espresso 5% 18% eqntott 10% li 0% 2% 4% 6% 8% 10% 12% 14% 16% Frequency of mispredictions 2.3/110 18%
  • 95. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/111 Dynamic Branch Prediction  General Scheme • In the general case, one could use an n-bit branch behaviour recorder and a branch history table of 2m entries • In this case  A change occurs every 2n-1 mispredictions  There is a higher chance that not too many branch addresses be associated with the same BHT entry  Larger memory penalty
  • 96. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/112 D.B.P.  Comparing the 2-bit with the General Case
  • 97. © V. De Florio KULeuven 2002 Basic Concepts Dynamic Branch Prediction Schemes • One-bit prediction buffer  Good, but with limited accuracy • Two-bit prediction buffer Computer Design  Very good, greater accuracy, slightly higher overhead • Infinite-bit prediction buffer Computer Architectures for AI Computer Architectures In Practice 2.3/113  As good as the two-bit one, but with a very large overhead • Correlating predictors
  • 98. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/114 Dynamic Branch Prediction  Correlated predictors • Two-level predictors • If the behaviour of a branch is correlated to the behaviour of another branch, no single-level predictor would be able to capture its behaviour • Example: if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) { … • If we keep track of the recent behaviour of other previous branches, our accuracy may increase
  • 99. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/115 Dynamic Branch Prediction  Correlated predictors • A simpler example: if (d == 0) d = 1; if (d == 1) … • In DLX, this is BNEZ MOV L1: SUBI BNEZ ... L2: . . . R1, R1, R3, R3, L1 ; b1 ( d != 0 ) #1 R1, #1 L2 ; b2 ( d != 1)
  • 100. Dynamic Branch Prediction  Correlated predictors © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice • In DLX, this is BNEZ R1, L1 ; b1 ( d != 0 ) MOV R1, #1 L1: SUBI R3, R1, #1 BNEZ R3, L2 ; b2 ( d != 1) ... L2: . . . • Let us assume that d is 0, 1 or 2 Initial value d==0? of d b1 Value of d d==1? before b2 b2 0 Untaken 1 Yes Untaken 1 2.3/116 Yes No Taken 1 Yes Untaken 2 No Untaken 2 No Taken
  • 101. Dynamic Branch Prediction  Correlated predictors © V. De Florio KULeuven 2002 Basic Concepts Initial value d==0? of d B1 Value of d d==1? before b2 b2 0 Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/117 Yes Untaken 1 Yes Untaken 1 No Taken 1 Yes Untaken 2 No Untaken 2 No Taken • This means that (B1 == untaken )  (B2 == untaken ) • A one-bit predictor may not be able to capture this property and behave very badly
  • 102. Dynamic Branch Prediction  Correlated predictors © V. De Florio KULeuven 2002 Basic Concepts • Let us suppose that d alternates between 2 and 0 • This is the table for the one-bit predictor: d Computer Architectures for AI Computer Architectures In Practice 2.3/118 b1 action 2 NT T T 0 Computer Design b1 pred new b1 pred T NT NT 2 NT T T 0 T NT NT b2 b2 pred action NT T NT T • ALL branches are mispredicted! new b2 pred T T NT NT T T NT NT
  • 103. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Dynamic Branch Prediction  Correlated predictors • Correlated predictor: example: • Every branch, say branch number j>1, has two separate prediction bits  First bit: predictor used if branch j-1 was NT  Second bit: otherwise • At the end of branch j-1: Behaviour_j_min_1 = (taken?) 1 : 0; Computer Architectures for AI Computer Architectures In Practice 2.3/119 • At the beginning of branch j: predict branch as ( BHT [ Behaviour_j_min_1 ] [ entry ] ); • At the end of branch j If (branch was mispredicted) BHT [ B.. ] [ entry ] = 1 – BHT [ B.. ] [ entry ]
  • 104. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/120 Dynamic Branch Prediction  Correlated predictors • The behaviour of a branch selects a one-bit branch predictor • If the prediction is not OK, its state is flipped
  • 105. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/121 Dynamic Branch Prediction  Correlated predictors • We may also consider the last TWO branches  The behaviour of these two branches selects, e.g., a one-bit predictor  (NT NT, NT T, T NT, T T)  (0-3)  BHT [0..3]  This is called a (2,1) predictor  Or, the behaviour of the last two branches selects an n-bit predictor  This is a (2, n) predictor
  • 106. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/122 Dynamic Branch Prediction  Correlated predictors A (2,2) predictor: A 2-bit branch history entry selects a 2-bit predictor
  • 107. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/123 Dynamic Branch Prediction  Correlated predictors • General case: (m, n) predictors  Consider the last m branches and their 2m possible values  This m-tuple selects an n-bit predictor  A change in the prediction only occurs after 2n-1 mispredictions
  • 108. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/124 Dynamic Branch Prediction  Branch-Target Buffer • A run-time technique to reduce the branch penalty • In DLX, it is possible to “predict” the new PC, via a branch prediction buffer, during the second stage of the pipeline • With a Branch-Target Buffer (BTB), the new PC can be derived during the first stage of the pipeline
  • 109. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/125 Dynamic Branch Prediction  Branch-Target Buffer • The BTB is a branch-prediction cache that stores the addresses of taken branch • An associative array which works as follows: (instruction address)  (branch target address) • In case of a hit, we know the predicted instruction address one cycle earlier w.r.t. the branch prediction buffer • Fetching begins immediately at the predicted PC
  • 110. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/126 Dynamic Branch Prediction  Branch-Target Buffer • Design issues:  The entire address must be used (correspondence must be one-to-one)  Limited number of entries in the BTB  Most frequently used  BTB requires a number of actions to be executed during the first pipeline stage, also in order to update the state of the buffer  The pipeline management gets more complex and the clock cycle duration may have to be increased
  • 111. Dynamic Branch Prediction  Branch-Target Buffer © V. De Florio KULeuven 2002 Basic Concepts • Total branch penalty for a BTB • Assumptions: penalties are as follows Computer Architectures for AI Computer Architectures In Practice 2.3/127 Prediction Actual branch Penalty cycles Yes Taken Taken 0 Yes Computer Design Instruction is in buffer Taken Untaken 2 No * Taken 2 • Prediction accuracy: 90% • Hit rate in buffer: 90% • Taken branch frequency: 60%
  • 112. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Dynamic Branch Prediction  Branch-Target Buffer • Branch penalty = 90% Percent buffer hit rate x 10% Percent incorrect predictions x Penalty 10% + (1 - Percent buffer hit rate) x Percent taken branches x 60% Penalty = 90%x10%x2 + 10%x60%x2 = 0.18+0.12= 0.30 clock cycles (vs. 0.50 for delayed br.) Prediction Actual branch Penalty cycles Yes Taken Taken 0 Taken Untaken 2 No 2.3/128 Instruction is in buffer Yes Computer Architectures In Practice * Taken 2
  • 113. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Dynamic Branch Prediction  Branch-Target Buffer • The same approach can be applied to the procedures return addresses • Example: 0x4ABC CALL 0x30A0 0x4AC0 … … 0x4CF4 CALL 0x30A0 0x4CF8 … … 0x4AC0 Computer Architectures In Practice 2.3/129 0x30A0 0x4CF8 • Associative arrays of stacks • If cache is large enough, all return addresses are predicted correctly
  • 114. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/130 Parallelism • Introduction to parallel processing • Instruction level parallelism  Introduction  VLIW  Advanced pipelining techniques  Superscalar
  • 115. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Superscalar architectures • So far, the goal was reaching the ideal CPI = 1 goal • Further increasing performance by having CPI < 1 is the goal of superscalar processors (SP) • To reach this goal, SP issue multiple instructions in the same clock cycle • Multiple-issue processors  VLIW (seen already)  SP  Statically scheduled (compiler)  Dynamically scheduled (HW; Scoreboarding/Tomasulo) • In SP, a varying # of instructions is issued, depending on structural limits and dependencies 2.3/131
  • 116. © V. De Florio KULeuven 2002 Basic Concepts Superscalar architectures • • 1. One of: load, store (integer or FP), branch, integer ALU operation 2. A FP ALU operation Computer Design • Computer Architectures for AI Computer Architectures In Practice 2.3/132 Superscalar version of DLX At most two instructions per clock cycle can be issued • IF and ID operate on 64 bits of instructions Multiple independent FPU are available
  • 117. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/133 Superscalar architectures • The superscalar DLX is indeed a sort of “bidimensional pipeline”: Integer Instr. FP Instr. Integer Instr. FP Instr. Integer Instr. FP Instr. Integer Instr. FP Instr. IF IF ID ID IF IF EX EX ID ID IF IF MEM MEM EX EX ID ID IF IF WB WB MEM MEM EX EX ID ID WB WB MEM MEM EX EX WB WB MEM WB MEM WB
  • 118. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Superscalar architectures • Every new solution breeds new problems.. • Latencies! • When the latency of the load is 1:  In the “monodimensional pipeline”, one cannot use the result of the load in the current and next cycle: Computer Architectures for AI Computer Architectures In Practice LD NOP LDc P  In the bidimensional pipeline of SP, this means a loss of three cycles: Pfp NOP NOP LDc LD NOP LDc’ 2.3/134 Pi
  • 119. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/135 Superscalar architectures • Let us consider again the following loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop • Let us perform unrolling (x5) + scheduling on the Superscalar DLX:
  • 120. © V. De Florio KULeuven 2002 Superscalar architectures Integer Basic Concepts Loop: FP Cycle LD F0, 0(R1) 1 LD F6, -8(R1) 2 LD F10, -16(R1) LD F14, -24(R1) ADDD F8,F6,F2 4 ADDD F12,F10,F2 5 SD 0(R1), F4 Computer Architectures for AI 3 LD F18, -32(R1) Computer Design ADDD F4,F0,F2 ADDD F16,F14,F2 6 SD -8(R1), F8 ADDD F20,F18,F2 7 8 SD -24(R1), F16 Computer Architectures In Practice SD -16(R1), F12 9 SUBI R1, R1, #40 10 BNEZ R1, Loop 11 SD -32(R1), F20 12 • 12 clock cycles per 5 iterations = 2.4 cc/i 2.3/136
  • 121. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Superscalar architectures • Superscalar = 2.4 cc/i vs normal = 3.5 cc/i • But in the example there were not enough FP instructions to keep the FP pipeline in use  From cycle 8 to cycle 12 and for the first two cycles, each cycle holds just one instruction • How to get more?  Dynamic scheduling for SP Computer Architectures In Practice 2.3/137  Multicycle extension of the Tomasulo algorithm
  • 122. © V. De Florio KULeuven 2002 Basic Concepts Superscalar architectures and the Tomasulo algorithm • Idea: employing separate data structures for the Integer and the FP registers  Integer Reservation Stations (IRS)  FP Reservation Stations (FRS) Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/138 • In the same cycle, issue a FP (to a FRS) and an integer instruction (to a IRS) • Note: issuing does not mean executing!  Possible dependencies might serialize the two instructions issued in parallel • Dual issue is obtained pipelining the instruction-issue stage so that it runs twice as fast
  • 123. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Superscalar architectures • Multiple issue strategy’s inherent limitations:  The amount of ILP may be limited (see loop p.134)  Extra HW is required  Multiple FPU and IU  More complex (-> slower) design Computer Architectures for AI Computer Architectures In Practice  Extra need for large memory and register-file bandwith  Increase in code size due to hard loop unrolling  Recall: CPUTIME(p) = 2.3/139 IC(p)  CPI(p) clock rate
  • 124. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Superscalar architectures: compiler support • Symbolic loop unrolling  The loop is not physically unrolled, though reorganized, so to eliminate dependencies • Software pipelining:  Dependencies are eliminated by interleaving instructions from different iterations of the loop  Loop is not unrolled <startup> Loop: LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Loop RAW: problematic 2.3/140  Loop: SD ADDD LD SUBI BNEZ <clean-up> 0(R1), F4 F4, F0, F2 F0, -16(R1) R1, R1, #8 R1, Loop WAR: HW removable
  • 125. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/141 Superscalar architectures: compiler support • Trace scheduling • Aim: tackling the problem of too short basic blocks • Method:  Trace selection  Trace compaction
  • 126. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Superscalar architectures: compiler support • Trace selection:  A number of contiguous basic blocks are put together into a “trace”  Using static branch prediction, the conditional branches are chosen as taken/untaken, while loop branches are considered as taken A test Computer Architectures for AI Computer Architectures In Practice A B B X C 2.3/142  C Bookkeeping
  • 127. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Superscalar architectures: compiler support • Trace compaction:  The resulting trace is a longer straight-line of code  Trace compaction: global code scheduling A B Code scheduling with a basic block whose size is that of A + B + C C Bookkeeping • Speculative movement of code 2.3/143
  • 128. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/144 Superscalar architectures: HW support • Conditional instructions: instructions like CMOVZ R2, R3, R1 which means if (R1 == 0) R2 = R3; or (R1)? R2 = R3 : /* NOP */; • The instruction turns into a NOP if the condition is not met  This also means that no exception are raised! • Using conditional instructions we convert a control dependence (due to a branch) into a data dependence • Speculative transformation in a two-issue superscalar with conditional instructions:
  • 129. © V. De Florio KULeuven 2002 Superscalar architectures: HW support : conditional instructions Integer FP LW R1, 40(R2) ADDD R3,R4,R5 1 ADDD R6,R3,R7 Basic Concepts Cycle 2 Computer Architectures for AI Computer Architectures In Practice BEQZ R10, L 3 LW R8, 20(R10) 4 LW R9,0(R8) Computer Design 5 LW R1, 40(R2) ADDD R3,R4,R5 1 LWC R8,20(R10),R10 ADDD R6,R3,R7 2 BEQZ R10, L 3 LW R9,0(R8) 4 We speculate on the outcome of the branch. If the condition is not met, we don’t slow down the execution, because we had used a slot that would otherwise be lost 2.3/145
  • 130. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Superscalar architectures: HW support : conditional instructions • Conditional instructions are useful to implement short alternative control flows • Their usefulness though is limited by several factors:  Conditional instructions that are annullated still take execution time – unless they are scheduled into waste slots  They are good only in limited cases, when there’s a simple alternative sequence  Moving an instruction across multiple branches would require double-conditional instructions! LWCC R1, R2, R10, R12 (makes no sense)  They require to do extra work w.r.t. their “regular” version 2.3/146  The extra time required for the test may require more cycles than the regular versions
  • 131. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Superscalar architectures: HW support : conditional instructions • Most architectures support a few conditional instructions (conditional move) • The HP PA architecture allows any register-register instruction to turn the next instruction into a NOP – which makes that a conditional instruction Computer Architectures for AI Computer Architectures In Practice 2.3/147 • Exceptions
  • 132. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/148 Superscalar architectures: HW support : conditional instructions • Exceptions:  Fatal (normally causing termination; e.g., memory protection violation)  Resumable exceptions (causing a delay, but no termination; e.g., page fault exception) • Resumable exceptions can be processed for speculative instructions just as if they were normal instructions  Corresponding time penalty is not considered as incorrect • Fatal exceptions cannot be handled by speculative instructions, hence must be deferred to the next non-speculative instructions
  • 133. Superscalar architectures: HW support : conditional instructions © V. De Florio KULeuven 2002 Basic Concepts • Moving instructions across a branch must not affect  The (fatal) exception behaviour  The data dependences Computer Design • How to obtain this? 1. All the exceptions triggered by speculative instructions are ignored by HW and OS Computer Architectures for AI Computer Architectures In Practice 2.3/149 The HW and OS do handle all exceptions, but return an undefined value for any fatal exception. The program is allowed to continue – though this will almost certainly lead to incorrect results Note: scheme 1. can never cause a correct program to fail, regardless the fact that you used or not speculation
  • 134. Superscalar architectures: HW support : conditional instructions © V. De Florio KULeuven 2002 2. Poison bits: A speculative instructions does not trigger any exception, but turns a bit on in the involved result registers. Next “normal” (non-speculative) instruction using those registers will be “poisoned” -> it will cause an exception 3. Boosting: Renaming and buffering in the HW (similar to the Tomasulo approach) Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice • Speculation can be used, e.g., to optimize an if-the-else such as if (a==0) a = b; else a = a + 4 or, equivalently, a = (a==0)? b : a + 4 2.3/150
  • 135. Superscalar architectures: HW support : conditional instructions © V. De Florio KULeuven 2002 Basic Concepts • • Computer Design Computer Architectures for AI • Computer Architectures In Practice 2.3/151 • • Suppose A is in 0(R3) and B in 0(R2) Example: LW R1, 0(R3) ; load A BNEZ R1, L1 ; A != 0 ? GOTO L1 LW R1, 0(R2) ; load B J L2 ; skip ELSE L1:ADD R1,R1,4 ; ELSE part L2:SW 0(R3), R1 ; store A Speculation: LW R1, 0(R3) ; load A LW R9, 0(R2) ; load speculatively B BNEZ R1, L3 ADD R9, R1, 4 ; here R9 is A+4 L3: SW 0(R3), R9 ; here R9 is A+4 or B In this case, a temporary register is used Method 1: speculation is transparent
  • 136. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/152 Superscalar architectures: HW support : conditional instructions • Method 2 applied to the previous code fragment: LW R1, 0(R3) ; load A LW* R9, 0(R2) ; load speculatively B BNEZ R1, L3 ADD R9, R1, 4 ; here R9 is A+4 L3: SW 0(R3), R9 ; here R9 is A+4 or B • LW* is a speculative version of LW • LW* an opcode that turns on the poison bit of register R9 • Next non speculative instruction using R9 will be “poisoned”: it will cause an exception • If another speculative instruction uses R9, the poison bit will be inherited
  • 137. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/153 Superscalar architectures: HW support : conditional instructions • Combining speculation with dynamic scheduling  An attribute bit is added to each instruction (1: speculative, 0: normal)  When that bit is 1, it is allowed to execute, but cannot enter the commit (WB) stage  The instruction then has to wait until the end of the speculated code  It will be allowed to modify the register file / memory only at end of speculative-mode • Hence: instructions execute out-of-order, but are forced to commit in order • A special set of buffers holds the results that have finished execution but have not committed yet (reorder buffers)
  • 138. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/154 Superscalar architectures: HW support : conditional instructions • As neither the register values nor the memory values are actually WRITTEN until an instruction commits, the processor can easily undo its speculative actions when a branch is found to be mispredicted • If a speculated instruction raises an exception, this is recorded in the reorder buffer • In case of branch misprediction such that a certain speculative instruction should not have been executed, the exception is flushed along with the instruction when the reorder buffer is cleared
  • 139. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/155 Superscalar architectures: HW support : conditional instructions • Reorder buffers:  An additional set of virtual registers that hold the result of the instructions  That have finished execution, but  Have not committed yet  Issue: only when both a Reservation Station and a reorder buffer are available  As soon as an instruction completes, its output goes into its reorder buffer  Until the instruction has not committed, input is received from the reorder buffer (the Reservation Station is freed, the reorder buffer is not)  The actual updating of registers takes place when the instruction reaches the top of the list of reorder buffers
  • 140. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/156 Superscalar architectures: HW support : conditional instructions • At this point the commit phase takes place:  Either the result is written into the register file,  Or, in case of a mispredicted branch, the reorder buffer is flushed and execution restarts at the correct successor of the branch • Assumption: when a branch with incorrect prediction reaches the head of the buffer, it means that the speculation was wrong
  • 141. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/157 Superscalar architectures: HW support : conditional instructions • This technique allows also to tackle situation like if (cond) do_this ; else do_that ; • One may “bet” on the outcome of the branch and say, e.g., it will be a taken one • Even unlikely events do happen, so sooner or later a misprediction occurs • Idea: let the instructions in the else part (do_that) issue and execute, with a separate list of reorder buffers (list2) • This second list is simpler: we don’t check for the current head-of-list. Elements in there need to be explicitly removed • In case of a misprediction, in the second list we have already executed the do_that part, and we just need to perform its commit • In case of positive prediction, the ELSE part is purged off list2
  • 142. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/158 Superscalar architectures • If a processor A has a lower CPI w.r.t another processor B, will A always run faster than B? • Not always!  A higher clock rate is indeed a deterministic measure of the performance improvement  A multiple issue (superscalar) architecture cannot guarantee its improvements (stochastic improvements)  Pushing towards a low CPI means adapting sophisticated (=complex) techniques… which slows down the clock rate!  Improving one aspect of a M.I.P. does not necessarily lead to overall performance improvements
  • 143. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/159 Superscalar architectures • A simple question: “how much ILP exists in a program?” or, in other words, “how much can we expect from techniques that are based on the exploitation of the ILP?” • How to proceed:  Delivering a set of very optimistic assumptions and measuring how much parallelism is available under those assumptions
  • 144. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/160 Superscalar architectures • Assumptions (HW model of an ideal processor): 1. Infinite # of virtual registers (-> no WAW or WAR can suspend the pipeline) 2. All conditional branches are predicted exactly (!!) 3. All computed jumps and returns are perfectly predicted 4. All memory addresses are known exactly, so a store can be moved before a load – provided that the addresses are not identical 5. Infinite issue processor 6. No restriction about the types of instructions to be executed in a cycle (no structural hazards) 7. All latencies are 1
  • 145. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/161 Superscalar architectures • How to match these assumptions?? • Gambling! • We run a program and produce a trace with all the values of all the instances of each branch  Taken, Taken, Taken, Untaken, Taken, …  Each corresponding target address is recorded and assumed to be available  Then we use a simulator to mimic, e.g., an infinite virtual registers machine etc. • Results are depicted in next picture • Parallelism is expressed in IPC: instruction issues per clock cycles
  • 146. © V. De Florio KULeuven 2002 Superscalar architectures Basic Concepts 54.8 gcc espresso Computer Design SPEC benchmarks li fpppp doduc Computer Architectures for AI Computer Architectures In Practice 2.3/162 62.6 17.9 75.2 118.7 150.1 tomcatv 140 160 • Tomcatv reaches 150 IPC (for a particular run)
  • 147. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/163 Superscalar architectures • Then we can diminish the above assumptions and introduce limitations that represent our current possibilities with computer design techniques for ILP  Window size: the actual range of instructions we inspect when looking for candidates for contemporary issuing  Realistic branch prediction  Finite # of registers • See images 4-39 and 4-40
  • 148. © V. De Florio KULeuven 2002 Superscalar architectures Basic Concepts 160 140 120 Computer Design 100 Instruction issues per cycle 80 60 Computer Architectures for AI 40 20 0 Computer Architectures In Practice Infinite 2k 512 128 32 Window size gcc li fpppp 2.3/164 espresso doduc tomcatv 8 4
  • 149. © V. De Florio KULeuven 2002 Superscalar architectures 55 10 10 gcc Basic Concepts 8 4 3 63 15 13 espresso 8 4 3 Computer Design 18 12 11 9 li 4 3 Benchmarks 75 49 Computer Architectures for AI 35 fpppp 14 5 3 119 16 15 doduc Computer Architectures In Practice 9 4 3 150 45 34 tomcatv 14 6 3 0 20 40 60 80 100 120 Instruction issues per cycle Infinite 2.3/165 512 8 4 128 32 140 160
  • 150. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/166 Superscalar architectures: conclusive notes • In the next 10 years it is realistic to reach an architecture that looks like this:  64 instruction issues per clock cycle  Selective predictor, 1K entries, 16-entry return predictor  Perfect disambiguation of memory references  Register renaming with 64 + 64 extra registers • Computer architectures in practice: Section 4.8 (PowerPC 620)
  • 151. Superscalar architectures: conclusive notes © V. De Florio KULeuven 2002 • Reachable performance Basic Concepts 60 Computer Architectures for AI Computer Architectures In Practice Instruction issues per cycle Computer Design 50 40 30 20 10 0 Infinite 256 128 64 32 16 Window size gcc li fpppp 2.3/167 espresso doduc tomcatv 8 4
  • 152. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/168 Pipelining and communications • Suppose that N+1 processes need to communicate a private value to all the others • They use all the values to produce next output (e.g., for voting) • Communication is fully synchronous and needs to be repeated m times, m large ...
  • 153. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/169 Pipelining and communications • • • • Let us assume that no bus is available Point-to-point communication Processes are numbered p0…pN Two instructions are available  Send (pj, value)  Receive (pj, &value) • Blocking functions • If the receiver is ready to receive, they last one stage time, otherwise they block the caller for a multiple of the stage time • Sending and receiving occur at discrete time steps
  • 154. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/170 Pipelining and communications • In each time t, processor pi may be  Sending data (next stage pi is unblocked)  Receiving data (next stage pi is unblocked)  Blocked in a Receive()  Blocked in a Send() • Slot = time corresponding to an entire stage time • Each time t we have n slots (a slot per process) • If pi is blocked, its slot is wasted (it’s a “bubble”) • Otherwise the slot is used
  • 155. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/171 Pipelining and communications • In each time t, processor pi may be in  State  State  State  State S(j) : Sending data to processor pj R(j) : Receiving data from pj WR(j) : Blocked in a Receive( pj, … ) WS(j) : Blocked in a Send( pj, …) • We use formalism: proc st proc’ to indicate that, at time t, proc is in state s with proc’ • For instance p1 WR(4)21 p3 means that the 21st slot of p1 is wasted waiting for p3 to send its value to it
  • 156. © V. De Florio KULeuven 2002 Basic Concepts Pipelining and communications • The following algorithm is executed by process j: Before gaining the right to broadcast, process j needs to go through j couples of states (WR, R) Computer Design Computer Architectures for AI Computer Architectures In Practice Ordered broadcast : the k-th message to be sent goes to process pk Finally, process j goes through N-j couples of states (WR, R) 2.3/172
  • 157. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Pipelining and communications • p is a vector of indices • For process j, p can be any arrangement of the integers 0, 1, …, j-1, j+1, … N • Whatever the arrangement, the algorithm works correctly • For instance, if N = 4 (5 processes) and j = 1, then p can be any permutation of 0, 2, 3, and 4 • p determines the order in which process j Computer Architectures In Practice 2.3/173 sends its value to its neighbours • Example: p[] = [ 3, 2, 0, 4]. Then p1 executes: send (p3), send(p2), send(p0), send(p4)
  • 158. © V. De Florio KULeuven 2002 Basic Concepts Pipelining and communications • Example: p[] = ordered permutation  Ex: N=5 and pj  p [ 0, … j-1,j+1, … N ] Duration Computer Design Computer Architectures for AI Computer Architectures In Practice Frequencies of used slots 2.3/174 Slot wasted in send Slot wasted in receive
  • 159. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Pipelining and communications • Case N = 20, p[] = ordered permutation • Gray = wasted slots • Black = used slots • In general, duration is • Used slots / total # of slots • Average # used slots during one stage time • This image:reminds us of another one: 2.3/175
  • 160. © V. De Florio KULeuven 2002 Basic Concepts Pipelining and communications Time 6 PM Computer Design 7 8 9 10 11 12 1 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 A Computer Architectures for AI Computer Architectures In Practice B C D No pipelining: Many slots are wasted! 2.3/176 2 AM
  • 161. © V. De Florio KULeuven 2002 Basic Concepts Pipelining and communications • Let us now consider the case in which processor k uses p[] = [ k+1, k+2, …, N, O, 1, …, k-1 ] Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/177
  • 162. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/178 Pipelining and communications
  • 163. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/179 Pipelining and communications • Duration: first case vs. second case
  • 164. © V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/180 Pipelining and communications • Efficiency: first case vs. second case
  • 165. © V. De Florio KULeuven 2002 Basic Concepts Pipelining and communications • Algorithm of pipelined broadcast Computer Design Computer Architectures for AI Computer Architectures In Practice Every 10 slots, 5 mark the completion of a broadcast Beginning of steady state Throughput = t / 2 (t = 1 slot) A full broadcast is finished every 2 t 2.3/181 • The image may remind us of another one…
  • 166. © V. De Florio KULeuven 2002 Pipelining (slide P2.2/20) 6 PM 7 9 8 10 11 12 1 2 AM Basic Concepts 30 30 30 30 30 …A Computer Design Computer Architectures In Practice 2.3/182 C … D  B … Computer Architectures for AI …      Between 7.30 and 9.30pm, a whole job is completed every 30’ During that period, each worker is permanently at work… …but a new input must arrive within 30’