tdt4260

TDT 4260 – lecture 1 – 2011 Course goal
• Course introduction • To get a general and deep understanding of the
– course goals organization of modern computers and the
– staff motivation for different computer architectures. Give
– contents a base for understanding of research themes within
– evaluation the field.
– web, ITSL
• High level
• Textbook
• Mostly HW and low-level SW
– Computer Architecture, A
Quantitative Approach, Fourth • HW/SW interplay
Edition • Parallelism
• by John Hennessy & David Patterson
(HP90 - 96 – 03) - 06 • Principles, not details
• Today: Introduction (Chapter 1)
– Partly covered  inspire to learn more
1 Lasse Natvig 2 Lasse Natvig

Contents TDT-4260 / DT8803
• Recommended background
• Computer architecture fundamentals, trends, measuring – Course TDT4160 Computer Fundamentals, or
performance, quantitative principles. Instruction set equivalent.
architectures and the role of compilers. Instruction-level • http://www.idi.ntnu.no/emner/tdt4260/
parallelism, thread-level parallelism, VLIW. – And Its Learning
• Memory hierarchy design, cache. Multiprocessors, shared • Friday 1215-1400
memory architectures, vector processors, NTNU/Notur – And/or some Thursdays 1015-1200
supercomputers,
supercomputers distributed shared memory
memory, – 12 lectures planned
synchronization, multithreading. – some exceptions may occur
• Interconnection networks, topologies • Evaluation
• Multicores,homogeneous and heterogeneous, principles and – Obligatory exercise (counts 20%). Written
product examples exam counts 80%. Final grade (A to F) given
at end of semester. If there is a re-sit
• Green computing (introduction) examination, the examination form may
• Miniproject - prefetching change from written to oral.


Lecture plan Subject to change
EMECS, new European Master's
Date  and lecturer Topic Course in Embedded Computing Systems
1:  14 Jan (LN, AI) Introduction, Chapter 1 / Alex: PfJudge

2:  21 Jan (IB) Pipelining, Appendix A; ILP, Chapter 2

3: 28 Jan (IB) ILP, Chapter 2; TLP, Chapter 3

4: 4 Feb (LN) Multiprocessors, Chapter 4

5: 11 Feb MG(?)) Prefetching + Energy Micro guest lecture

6: 18 Feb (LN) Multiprocessors continued

7: 25 Feb (IB) Piranha CMP + Interconnection networks

8: 4 Mar (IB) Memory and cache, cache coherence  (Chap. 5)

9: 11 Mar (LN) Multicore architectures (Wiley book chapter) + Hill Marty Amdahl
multicore ... Fedorova ... assymetric multicore ...

10: 18 Mar (IB) Memory consistency (4.6) + more on memory

11: 25 Mar (JA, AI) (1) Kongull and other NTNU and NOTUR supercomputers   (2) Green
computing
12: 1 Apr (IB/LN) Wrap up lecture, remaining stuff

13: 8 Apr Slack – no lecture planned

1

Preliminary reading list, subject to change!!! People involved
• Chap.1: Fundamentals, sections 1.1 - 1.12 (pages 2-54)
• Chap.2: ILP, sections 2.1 - 2.2 and parts of 2.3 (pages 66-81), section 2.7
(pages 114-118), parts of section 2.9 (pages 121-127, stop at speculation), Lasse Natvig
section 2.11 - 2.12 (pages 138 - 141). (Sections 2.4 - 2.6 are covered by similar
material in our computer design course) Course responsible, lecturer
• Chap.3: Limits on ILP, section 3.1 and parts of section 3.2 (pages 154 -159), lasse@idi.ntnu.no
section 3.5 - 3.8 (pages 172-185).
• Chap.4: Multiprocessors and TLP, sections 4.1 - 4.5, 4.8 - 4.10
• Chap.5: Memory hierachy, section 5.1 - 5.3 (pages 288 - 315). Ian Bratt
• App A: section A 1 (Expected to be repetition from other courses)
A.1 Lecturer (Also t Til
(Al at Tilera.com)
)
• Appendix E, interconnection networks, pages E2-E14, E20-E25, E29-E37 ianbra@idi.ntnu.no
and E45-E51.
• App. F: Vector processors, sections F1 - F4 and F8 (pages F-2 - F-32, F-
44 - F-45) Alexandru Iordan
• Data prefetch mechanisms (ACM Computing Survey)
Teaching assistant (Also PhD-student)
• Piranha, (To be announced)
iordan@idi.ntnu.no
• Multicores (New bookchapter) (To be announced)
• (App. D; embedded systems?)  see our new course TDT4258
Mikrokontroller systemdesign http://www.idi.ntnu.no/people/


research.idi.ntnu.no/multicore Prefetching ---pfjudge

Some few highlights:
- Green computing, 2xPhD + master students
- Multicore memory systems, 3 x PhD theses
- Multicore programming and parallel computing
- Cooperation with industry


”Computational computer architecture” Experiment Infrastructure
• Computational science and engineering (CSE) • Stallo compute cluster
– Computational X, X = comp.arch. – 60 Teraflop/s peak
• Simulates new multicore architectures – 5632 processing cores
– Last level, shared cache fairness (PhD-student M. Jahre) – 12 TB total memory
– Bandwidth aware prefetching (PhD-student M. Grannæs) – 128 TB centralized disk
• Complex cycle-accurate simulators – Weighs 16 tons
– 80 000 lines C++ 20 000 lines python
C++,
– Open source, Linux-based
• Multi-core research
• Design space exploration (DSE)
– About 60 CPU years allocated per
– one dimension for each arch. parameter year to our projects
– DSE sample point = specific multicore configuration – Typical research paper uses 5 to
– performance of a selected set of configurations evaluated by 12 CPU years for simulation
simulating the execution of a set of workloads (extensive, detailed design space
exploration)


2

The End of Moore’s law
Motivational background
for single-core microprocessors
• Why multicores
– in all market segments from mobile phones to supercomputers
• The ”end” of Moores law
• The power wall
• The memory wall
• The bandwith problem
• ILP limitations
• The complexity wall
But Moore’s law still holds for
FPGA, memory and
multicore processors


Energy & Heat Problems The Memory Wall
1000

• Large power “Moore’s Law”

consumption 100 CPU
60%/year
– Costly
Performance

P-M gap grows 50% / year
– Heat problems 10
– Restricted battery
DRAM
operation time 9%/year
9%/
1
• Google ”Open
House Trondheim 1980 1990 2000

2006” • The Processor Memory Gap
– ”Performance/Watt
is the only flat
• Consequence: deeper memory hierachies
trend line” – P – Registers – L1 cache – L2 cache – L3 cache – Memory - - -
– Complicates understanding of performance
• cache usage has an increasing influence on performance


The I/O pin or Bandwidth problem The limitations of ILP
(Instruction Level Parallelism)
• # I/O signaling pins in Applications
– limited by physical
tecnology 30 3
 
– speeds have not 2.5

25
increased at the same
Fraction of total cycles (%)

rate as processor clock 20 2
rates 
dup
Speed

1.5
• Projections 15

– from ITRS (International 10 1 
Technology Roadmap
for Semiconductors) 5 0.5

0 0
[Huh, Burger and Keckler 2001] 0 1 2 3 4 5 6+ 0 5 10 15
Number of instructions issued Instructions issued per cycle


3

Reduced Increase in Clock Frequency Solution: Multicore architectures
(also called Chip Multi-processors - CMP)

• More power-efficient
– Two cores with clock frequency f/2
can potentially achieve the same
speed as one at frequency f with 50%
reduction in total energy consumption
[Olukotun & Hammond 2005]
• Exploits Thread Level
Parallelism (TLP)
– in addition to ILP
– requires multiprogramming or
parallel programming
• Opens new possibilities for
architectural innovations

Why heterogeneous multicores? CPU – GPU – convergence
• Specialized HW is (Performance – Programmability)
Cell BE processor
faster than general
HW Processors: Larrabee,
Fermi, …
– Math co-processor Languages: CUDA,
OpenCL, …
– GPU, DSP, etc…
• Benefits of
customization
– Similar to ASIC vs. general
purpose programmable
HW
• Amdahl’s law
– Parallel speedup limited by
serial fraction
•  1 super-core

Parallel processing – conflicting Multicore programming challenges
goals • Instability, diversity, conflicting goals … what to do?
Performance • What kind of parallel programming?
The P6-model: Parallel Processing – Homogeneous vs. heterogeneous
challenges: Performance, Portability, – DSL vs. general languages
Programmability and Power efficiency – Memory locality
Portability • What to teach?
– Teaching should be founded on
active research
Programmability Powerefficiency • Two layers of programmers
y p g
– The Landscape of Parallel Computing Research: A View from
• Examples; Berkeley [Asan+06]
– Performance tuning may reduce portability • Krste Asanovic presentation at ACACES Summerschool 2007
• Eg. Datastructures adapted to cache block size – 1) Programmability layer (Productivity layer) (80 - 90%)
• ”Joe the programmer”
– New languages for higher programmability may reduce performance
and increase power consumption – 2) Performance layer (Efficiency layer) (10 - 20%)
• Both layers involved in HPC
• Programmability an issue also at the performance-layer


4

Parallel Computing Laboratory, U.C. Berkeley,
(Slide adapted from Dave Patterson ) Classes of computers
Easy to write correct programs that run efficiently on manycore • Servers
– storage servers
Personal Image Hearing, Parallel – compute servers (supercomputers)
Speech
Health Retrieval Music Browser
– web servers
Design Patterns/Motifs – high availability
Composition & Coordination Language (C&CL) – scalability
– throughput oriented (response time of less importance)
ormance

C&CL Compiler/Interpreter
• Desktop (price 3000 NOK – 50 000 NOK)
– the largest market
g
Diagnosing Power/Perfo

Parallel
P ll l
Libraries
Parallel Frameworks – price/performance focus
– latency oriented (response time)
• Embedded systems
Efficiency Languages Sketching
– the fastest growing market (”everywhere”)
Autotuners – TDT 4258 Microcontroller system design
Legacy Communication & Synch. – ATMEL, Nordic Semic., ARM, EM, ++
Schedulers
Code Primitives
Efficiency Language Compilers
OS Libraries & Services
Legacy OS
Hypervisor

Multicore/GPGPU RAMP Manycore
25

Borgar  FXI Technologies
Falanx (Mali)
ARM ”An idependent compute
platform to gather the

Norway fragmented mobile space
and thus help accelerate the
prolifitation of content and
applications eco- systems (I.e
build an ARM based SoC, put it
,p
in a memory card, connect it to
the web- and voila, you got
iPhone for the masses ).”

• http://www.fxitech.com/
– ”Headquartered in Trondheim
• But also an office in Silicon Valley …”


Trends Comp. Arch. is an
Integrated Approach
• For technology, costs, use
• What really matters is the functioning of the
• Help predicting the future complete system
• Product development time – hardware, runtime system, compiler, operating system, and
– 2-3 years application
–  design for the next technology – In networking, this is called the “End to End argument”
• Computer architecture is not just about
– Why should an architecture live longer than a product?
transistors(not at all), individual instructions, or
particular implementations
– E.g., Original RISC projects replaced complex instructions with a
compiler + simple instructions


5

Computer Architecture is
Design and Analysis TDT4260 Course Focus
Architecture is an iterative process: Understanding the design techniques, machine
• Searching the space of possible designs
Design
• At all levels of computer systems structures, technology factors, evaluation
Analysis
methods that will determine the form of
computers in 21st Century
Technology Parallelism
Programming
Creativity
C ti it Languages
Applications Interface Design
Cost / Computer Architecture: (ISA)
Performance • Organization
Analysis • Hardware/Software Boundary
Compilers

Good Ideas Operating Measurement &
Systems Evaluation History
Mediocre Ideas
Bad Ideas

Moore’s Law: 2X transistors /
Holistic approach “year”
e.g., to programmability

Parallel & concurrent programming

Operating System & system software

Multicore, interconnect, memory
• “Cramming More Components onto Integrated Circuits”
– Gordon Moore, Electronics, 1965
• # of transistors / cost-effective integrated circuit double
every N months (12 ≤ N ≤ 24)


Tracking Technology
Latency Lags Bandwidth (last ~20 years)
Performance Trends
10000
• 4 critical implementation technologies: CPU high, • Performance Milestones
Processor
– Disks, Memory low • Processor: ‘286, ‘386, ‘486, Pentium,
– Memory, (“Memory Pentium Pro, Pentium 4 (21x,2250x)
Wall”) 1000
– Network, • Ethernet: 10Mb, 100Mb, 1000Mb,
Network
– Processors 10000 Mb/s (16x,1000x)
Relative Memory
• Compare for Bandwidth vs. Latency BW
100
Disk • Memory Module: 16bit plain DRAM,
Page Mode DRAM 32b 64b SDRAM,
P M d DRAM, 32b, 64b, SDRAM
improvements in performance over time Improve
ment DDR SDRAM (4x,120x)
• Bandwidth: number of events per unit time • Disk : 3600, 5400, 7200, 10000, 15000
– E.g., M bits/second over network, M bytes / second from 10 RPM (8x, 143x)
disk
(Processor latency = typical # of pipeline-stages * time
• Latency: elapsed time for a single event (Latency improvement
= Bandwidth improvement)
pr. clock-cycle)

– E.g., one-way network delay in microseconds, 1
average disk access time in milliseconds 1 10 100
Relative Latency Improvement


6

COST and COTS Speedup Superlinear speedup ?

• Cost • General definition:
Performance (p processors)
– to produce one unit Speedup (p processors) = Performance (1 processor)
– include (development cost / # sold units)
– benefit of large volume
• COTS • For a fixed problem size (input data set),
– commodity off the shelf
dit ff th h lf performance = 1/time
– Speedup
Time (1 processor)
fixed problem (p processors) =
Time (p processors)

• Note: use best sequential algorithm in the uni-processor
solution, not the parallel algorithm with p = 1


Amdahl’s Law (1967) (fixed problem size) Gustafson’s “law” (1987)
(scaled problem size, fixed execution time)
• “If a fraction s of a
(uniprocessor) • Total execution time on
computation is inherently parallel computer with n
serial, the speedup is at processors is fixed
most 1/s” – serial fraction s’
• Total work in computation – parallel fraction p’
– serial fraction s – s’ + p’ = 1 (100%)
– parallel fraction p
p • S (n) Time’(1)/Time’(n)
S’(n) = Time (1)/Time (n)
– s + p = 1 (100%)
= (s’ + p’n)/(s’ + p’)
• S(n) = Time(1) / Time(n) = s’ + p’n = s’ + (1-s’)n
= (s + p) / [s +(p/n)] = n +(1-n)s’
• Reevaluating Amdahl's law,
= 1 / [s + (1-s) / n] John L. Gustafson, CACM May
1988, pp 532-533. ”Not a new
= n / [1 + (n - 1)s] law, but Amdahl’s law with
changed assumptions”
• ”pessimistic and famous”

How the serial fraction limits speedup

• Amdahl’s law

• Work hard to
reduce the
serial part of
the application
– remember IO
– think different
(than traditionally  = serial fraction
or sequentially)

41 Lasse Natvig

7

1

TDT4260 Computer architecture
Mini-project

PhD candidate Alexandru Ciprian Iordan
Institutt for datateknikk og informasjonsvitenskap

2

What is it…? How much…?
• The mini-project is the exercise part of TDT4260
course

• This year the students will need to develop and
evaluate a PREFETCHER

• The mini-project accounts for 20 % of the final grade
in TDT4260
• 80 % for report
• 20 % for oral presentation

3

What will you work with…

• Modified version of M5 (for development and
evaluation)

• Computing time on Kongull cluster (for
benchmarking)

• More at: http://dm-ark.idi.ntnu.no/

4

M5
• Initially developed by the University of Michigan

• Enjoys a large community of users and developers

• Flexible object-oriented architecture

• Has support for 3 ISA: ALPHA, SPARC and MIPS

5

Team work…

• You need to work in groups of 2-4 students

• Grade is based on written paper AND oral
presentation (chose you best speaker)

6

Time Schedule and Deadlines

More on It’s learning

7

Web page presentation

Contents
• Instruction level parallelism Chap 2
• Pipelining (repetition) App A
TDT 4260 ▫ Basic 5-step pipeline
• Dependencies and hazards Chap 2.1
App A.1, Chap 2
▫ Data, name, control, structural
Instruction Level Parallelism
• Compiler techniques for ILP Chap 2.2
• (Static prediction Chap 2.3)
▫ Read this on your own
• Project introduction

Pipelining
Instruction level parallelism (ILP) (1/3)

• A program is sequence of instructions typically
written to be executed one after the other
• Poor usage of CPU resources! (Why?)
• Better: Execute instructions in parallel
▫ 1: Pipeline
Partial overlap of instruction execution
▫ 2: Multiple issue
Total overlap of instruction execution
• Today: Pipelining

Pipelining (2/3) Pipelining (3/3)
• Multiple different stages executed in parallel • Good Utilization: All stages are ALWAYS in use
▫ Laundry in 4 different stages ▫ Washing, drying, folding, ...
▫ Wash / Dry / Fold / Store ▫ Great usage of resources!
• Assumptions: • Common technique, used everywhere
▫ Task can be split into stages ▫ Manufacturing, CPUs, etc
▫ Storage of temporary data • Ideal: time_stage = time_instruction / stages
▫ But stages are not perfectly balanced
▫ Stages synchronized
▫ But transfer between stages takes time
▫ Next operation known before last finished?
▫ But pipeline may have to be emptied
▫ ...

Example: MIPS64 (2/2)
Example: MIPS64 (1/2) Time (clock cycles)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
• RISC • Pipeline I

ALU
• Load/store ▫ IF: Instruction fetch n Ifetch Reg DMem Reg

s
• Few instruction formats ▫ ID: Instruction decode / t
register fetch r.

ALU
• Fixed instruction length ▫ EX: Execute / effective
Ifetch Reg DMem Reg

• 64-bit address (EA) O
r

ALU
▫ DADD = 64 bits ADD ▫ MEM: Memory access d
Ifetch Reg DMem Reg

▫ LD = 64 bits L(oad) ▫ WB: Write back (reg) e
r
• 32 registers (R0 = 0)

ALU
Ifetch Reg DMem Reg

• EA = offset(Register)

Big Picture: Big Picture (continued):
• What are some real world examples of • Computer Architecture is the study of design
pipelining? tradeoffs!!!!
• Why do we pipeline? • There is no “philosophy of architecture” and no
• Does pipelining increase or decrease instruction “perfect architecture”. This is engineering, not
throughput? science.
• Does pipelining increase or decrease instruction • What are the costs of pipelining?
latency? • For what types of devices is pipelining not a
good choice?

Improve speedup? Dependencies and hazards
• Why not perfect speedup? • Dependencies
▫ Sequential programs ▫ Parallel instructions can be executed in parallel
▫ Dependent instructions are not parallel
▫ One instruction dependent on another I1: DADD R1, R2, R3
▫ Not enough CPU resources I2: DSUB R4, R1, R5
• What can be done? ▫ Property of the instructions
▫ Forwarding (HW) • Hazards
▫ Situation where a dependency causes an instruction to
▫ Scheduling (SW / HW) give a wrong result
▫ Prediction (SW / HW) ▫ Property of the pipeline
• Both hardware (dynamic) and compiler (static) ▫ Not all dependencies give hazards
can help Dependencies must be close enough in the instruction
stream to cause a hazard

Dependencies Hazards
• (True) data dependencies • Data hazards
▫ One instruction reads what an earlier has written ▫ Overlap will give different result from sequential
• Name dependencies ▫ RAW / WAW / WAR
▫ Two instructions use the same register / mem loc • Control hazards
▫ But no flow of data between them ▫ Branches
▫ Two types: Anti and output dependencies ▫ Ex: Started executing the wrong instruction
• Control dependencies • Structural hazards
▫ Instructions dependent on the result of a branch ▫ Pipeline does not support this combination of instr.
• Again: Independent of pipeline implementation ▫ Ex: Register with one port, two stages want to read

Data dependency Hazard?
Figure A.6, Page A-16
Data Hazards (1/3)
• Read After Write (RAW)
I InstrJ tries to read operand before InstrI writes
ALU

Reg
add r1,r2,r3 Ifetch Reg DMem

n it
s
ALU

t sub r4,r1,r3 Ifetch Reg DMem Reg I: add r1,r2,r3
r. J: sub r4,r1,r3
ALU

Ifetch Reg DMem Reg
O and r6,r1,r7
r • Caused by a true data dependency
d
• This hazard results from an actual need for
ALU

Ifetch Reg DMem Reg
e or r8,r1,r9
r communication.
ALU

Ifetch Reg DMem Reg
xor r10,r1,r11

Data Hazards (2/3) Data Hazards (3/3)
• Write After Write (WAW)
• Write After Read (WAR) InstrJ writes operand before InstrI writes it.
InstrJ writes operand before InstrI reads it
I: sub r1,r4,r3
I: sub r4,r1,r3 J: add r1,r2,r3
J: add r1,r2,r3

• Caused by an output dependency
• Caused by an anti dependency
This results from reuse of the name “r1” • Can’t happen in MIPS 5 stage pipeline because:
▫ All instructions take 5 stages, and
• Can’t happen in MIPS 5 stage pipeline because: ▫ Writes are always in stage 5
▫ All instructions take 5 stages, and • WAR and WAW can occur in more
▫ Reads are always in stage 2, and
▫ Writes are always in stage 5
complicated pipes

Forwarding Can all data hazards be solved via
Figure A.7, Page A-18 forwarding???
IF ID/RF EX MEM WB IF ID/RF EX MEM WB

I I

ALU

ALU
Reg Reg
add r1,r2,r3 Ifetch Reg DMem
Ld r1,r2 Ifetch Reg DMem

n n
s s

ALU

ALU
t sub r4,r1,r3 Ifetch Reg DMem Reg
t add r4,r1,r3 Ifetch Reg DMem Reg

r. r.

ALU

ALU
Ifetch Reg DMem Reg Ifetch Reg DMem Reg
O and r6,r1,r7 O and r6,r1,r7
r r
d d
ALU

ALU
e or r8,r1,r9 e or r8,r1,r9
r r
ALU

ALU
xor r10,r1,r11 xor r10,r1,r11

Structural Hazards (Memory Port) Hazards, Bubbles (Similar to Figure A.5, Page A-15)
Figure A.4, Page A-14
Time (clock cycles)
Time (clock cycles)

ALU
I Load Ifetch Reg DMem Reg
ALU

I Load Ifetch Reg DMem Reg
n
n s
ALU
Reg
s t
Instr 1 Ifetch Reg DMem
ALU

Reg
t
Instr 1 Ifetch Reg DMem

r.
r.
ALU
Ifetch Reg DMem Reg
Ld r1, r2
ALU

Ifetch Reg DMem Reg
Instr 2 O
O r
r Stall Bubble Bubble Bubble Bubble Bubble
d
ALU

Ifetch Reg DMem Reg
d Instr 3
e
e
ALU
r Add r1, r1, r1 Ifetch Reg DMem Reg
ALU

r Instr 4 Ifetch Reg DMem Reg

How do you “bubble” the pipe? How can we avoid this hazard?

Control hazards (1/2)
Control hazards (2/2)
• Sequential execution is predictable, • What can be done?
(conditional) branches are not
▫ Always stop (previous slide)
• May have fetched instructions that should not be
executed Also called freeze or flushing of the pipeline
• Simple solution (figure): Stall the pipeline (bubble) ▫ Assume no branch (=assume sequential)
▫ Performance loss depends on number of branches in the program Must not change state before branch instr. is complete
and pipeline implementation
▫ Branch penaltyC ▫ Assume branch
Only smart if the target address is ready early
▫ Delayed branch
Execute a different instruction while branch is evaluated
Static techniques (fixed rule or compiler)
Possibly wrong instruction Correct instruction

Example
• Assume branch conditionals are evaluated in the EX
Dynamic scheduling
stage, and determine the fetch address for the following
cycle. • So far: Static scheduling
• If we always stall, how many cycles are bubbled? ▫ Instructions executed in program order
• Assume branch not taken, how many bubbles for an ▫ Any reordering is done by the compiler
incorrect assumption?
• Is stalling on every branch ok? • Dynamic scheduling
• What optimizations could be done to improve stall ▫ CPU reorders to get a more optimal order
penalty? Fewer hazards, fewer stalls, ...
▫ Must preserve order of operations where
reordering could change the result
▫ Covered by TDT 4255 Hardware design

Example
Compiler techniques for ILP Source code: Notice:
for (i = 1000; i >0; i=i-1) • Lots of dependencies
• For a given pipeline and superscalarity • No dependencies between iterations
x[i] = x[i] + s;
▫ How can these be best utilized? • High loop overhead
▫ As few stalls from hazards as possible Loop unrolling

• Dynamic scheduling MIPS:
▫ Tomasulo’s algorithm etc. (TDT4255) Loop: L.D F0,0(R1) ; F0 = x[i]
▫ Makes the CPU much more complicated ADD.D F4,F0,F2 ; F2 = s
• What can be done by the compiler? S.D F4,0(R1) ; Store x[i] + s
▫ Has ”ages” to spend, but less knowledge DADDUI R1,R1,#-8 ; x[i] is 8 bytes
▫ Static scheduling, but what else? BNE R1,R2,Loop ; R1 = R2?

Loop: L.D F0,0(R1)

Static scheduling Loop unrolling ADD.D F4,F0,F2
S.D F4,0(R1)
Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) Loop: L.D F0,0(R1) L.D F6,-8(R1)
stopp DADDUI R1,R1,#-8 ADD.D F4,F0,F2 ADD.D F8,F6,F2
ADD.D F4,F0,F2 ADD.D F4,F0,F2 S.D F4,0(R1) S.D F8,-8(R1)
stopp stopp DADDUI R1,R1,#-8 L.D F10,-16(R1)
stopp stopp BNE R1,R2,Loop ADD.D F12,F10,F2
S.D F4,0(R1) S.D F4,8(R1) S.D F12,-16(R1)
DADDUI R1,R1,#-8 BNE R1,R2,Loop
L.D F14,-24(R1)
stopp • Reduced loop overhead
ADD.D F16,F14,F2
BNE R1,R2,Loop • Requires number of iterations S.D F16,-24(R1)
divisible by n (here n=4)
DADDUI R1,R1,#-32
• Register renaming BNE R1,R2,Loop
• Offsets have changed
Result: From 9 cycles per iteration to 7 • Stalls not shown
(Delays from table in figure 2.2)

Loop: L.D F0,0(R1) Loop: L.D F0,0(R1)
ADD.D F4,F0,F2 L.D F6,-8(R1)
S.D F4,0(R1) L.D F10,-16(R1) Loop unrolling: Summary
L.D F6,-8(R1) L.D F14,-24(R1)
ADD.D F8,F6,F2 ADD.D F4,F0,F2 • Original code 9 cycles per element
S.D F8,-8(R1) ADD.D F8,F6,F2 • Scheduling 7 cycles per element
L.D F10,-16(R1) ADD.D F12,F10,F2 • Loop unrolling 6,75 cycles per element
ADD.D F12,F10,F2 ADD.D F16,F14,F2
▫ Unrolled 4 iterations
S.D F12,-16(R1) S.D F4,0(R1)
L.D F14,-24(R1) S.D F8,-8(R1) • Combination 3,5 cycles per element
ADD.D F16,F14,F2 DADDUI R1,R1,#-32 ▫ Avoids stalls entirely
S.D F16,-24(R1) S.D F12,-16(R1)
DADDUI R1,R1,#-32 S.D F16,-24(R1)
BNE R1,R2,Loop
Compiler reduced execution time by 61%
BNE R1,R2,Loop

Avoids stall after: L.D(1), ADD.D(2), DADDUI(1)

Loop unrolling in practice
• Do not usually know upper bound of loop
• Suppose it is n, and we would like to unroll the loop
to make k copies of the body
• Instead of a single unrolled loop, we generate a pair
of consecutive loops:
▫ 1st executes (n mod k) times and has a body that is the
original loop
▫ 2nd is the unrolled body surrounded by an outer loop
that iterates (n/k) times
• For large values of n, most of the execution time will
be spent in the unrolled loop

Review
• Name real-world examples of pipelining
• Does pipelining lower instruction latency?
• What is the advantage of pipelining?
• What are some disadvantages of pipelining?
TDT 4260 • What can a compiler do to avoid processor
Chap 2, Chap 3 stalls?
Instruction Level Parallelism (cont) • What are the three types of data dependences?
• What are the three types of pipeline hazards?

Getting CPI below 1
Contents • CPI ≥ 1 if issue only 1 instruction every clock cycle
• Multiple-issue processors come in 3 flavors:
• Very Large Instruction Word Chap 2.7
1. Statically-scheduled superscalar processors
▫ IA-64 and EPIC • In-order execution
• Instruction fetching Chap 2.9 • Varying number of instructions issued (compiler)
2. Dynamically-scheduled superscalar processors
• Limits to ILP Chap 3.1/2 • Out-of-order execution
• Multi-threading Chap 3.5 • Varying number of instructions issued (CPU)
3. VLIW (very long instruction word) processors
• In-order execution
• Fixed number of instructions issued

VLIW: Very Large Instruction Word (2/2)
VLIW: Very Large Instruction Word (1/2)
• Assume 2 load/store, 2 fp, 1 int/branch
▫ VLIW with 0-5 operations.
• Each VLIW has explicit coding for multiple
▫ Why 0?
operations
▫ Several instructions combined into packets • Important to avoid empty instruction slots
▫ Possibly with parallelism indicated ▫ Loop unrolling
▫ Local scheduling
• Tradeoff instruction space for simple decoding
▫ Global scheduling
▫ Room for many operations
Scheduling across branches
▫ Independent operations => execute in parallel
▫ E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 • Difficult to find all dependencies in advance
branch ▫ Solution1: Block on memory accesses
▫ Solution2: CPU detects some dependencies

Loop: L.D F0,0(R1)
Recall: L.D F6,-8(R1)
Loop Unrolling in VLIW
Unrolled Loop L.D
L.D
F10,-16(R1)
F14,-24(R1)
Memory Memory FP FP Int. op/ Clock
reference 1 reference 2 operation 1 op. 2 branch
that minimizes ADD.D F4,F0,F2 L.D F0,0(R1) L.D F6,-8(R1) 1
ADD.D F8,F6,F2 L.D F10,-16(R1) L.D F14,-24(R1) 2
stalls for Scalar ADD.D F12,F10,F2 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3
L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4
ADD.D F16,F14,F2
Source code: ADD.D F20,F18,F2 ADD.D F24,F22,F2 5
S.D F4,0(R1) S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6
for (i = 1000; i >0; i=i-1) S.D -16(R1),F12 S.D -24(R1),F16 7
S.D F8,-8(R1)
x[i] = x[i] + s; S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8
DADDUI R1,R1,#-32
S.D -0(R1),F28 BNEZ R1,LOOP 9
S.D F12,-16(R1)
Register mapping: Unrolled 7 iterations to avoid delays
S.D F16,-24(R1)
7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
s F2 BNE R1,R2,Loop
Average: 2.5 ops per clock, 50% efficiency
i R1 Note: Need more registers in VLIW (15 vs. 6 in SS)

Problems with 1st Generation VLIW VLIW Tradeoffs
• Increase in code size • Advantages
▫ Loop unrolling ▫ “Simpler” hardware because the HW does not have to
▫ Partially empty VLIW identify independent instructions.
• Operated in lock-step; no hazard detection HW • Disadvantages
▫ A stall in any functional unit pipeline causes entire processor to ▫ Relies on smart compiler
stall, since all functional units must be kept synchronized
▫ Code incompatibility between generations
▫ Compiler might predict function units, but caches hard to predict
▫ There are limits to what the compiler can do (can’t move
▫ Moder VLIWs are “interlocked” (identify dependences between
loads above branches, can’t move loads above stores)
bundles and stall).
• Binary code compatibility
• Common uses
▫ Strict VLIW => different numbers of functional units and unit ▫ Embedded market where hardware simplicity is
latencies require different versions of the code important, applications exhibit plenty of ILP, and binary
compatibility is a non-issue.

IA-64 and EPIC Instruction bundle (VLIW)
• 64 bit instruction set architecture
▫ Not a CPU, but an architecture
▫ Itanium and Itanium 2 are CPUs
based on IA-64
• Made by Intel and Hewlett-Packard (itanium 2 and 3
designed in Colorado)
• Uses EPIC: Explicitly Parallel Instruction Computing
• Departure from the x86 architecture
• Meant to achieve out-of-order performance with in-
order HW + compiler-smarts
▫ Stop bits to help with code density
▫ Support for control speculation (moving loads above
branches)
▫ Support for data speculation (moving loads above stores)
Details in Appendix G.6

Functional units and template
• Functional units:
Code example (1/2)
▫ I (Integer), M (Integer + Memory), F (FP), B (Branch),
L + X (64 bit operands + special inst.)
• Template field:
▫ Maps instruction to functional unit
▫ Indicates stops: Limitations to ILP

Control Speculation
Code example 2/2 • Can the compiler schedule an independent load
above a branch?
Bne R1, R2, TARGET
Ld R3, R4(0)
• What are the problems?
• EPIC provides speculative loads
Ld.s R3, R4(0)
Bne R1, R2, TARGET
Check R4(0)

Data Speculation EPIC Conclusions
• Goal of EPIC was to maintain advantages of VLIW, but
• Can the compiler schedule an independent load achieve performance of out-of-order.
above a store? • Results:
St R5, R6(0) ▫ Complicated bundling rules saves some space, but
Ld R3, R4(0) makes the hardware more complicated
• What are the problems? ▫ Add special hardware and instructions for scheduling
• EPIC provides “advanced loads” and an ALAT loads above stores and branches (new complicated
(Advanced Load Address Table) hardware)
Ld.a R3, R4(0) creates entry in ALAT ▫ Add special hardware to remove branch penalties
St R5, R6(0) looks up ALAT, if match, jump to (predication)
fixup code ▫ End result is a machine as complicated as an out-of-
order, but now also requiring a super-sophisticated
compiler.

tdt4260

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to tdt4260

Similar to tdt4260 (20)

Recently uploaded

Recently uploaded (20)

tdt4260