2. Carnegie Mellon
2
What Will We Learn?
Tricks invented over the years
Deep Pipelining
Branch Prediction
Superscalar Processors
Out of Order Processors
Register Renaming
SIMD
Multithreading
Multiprocessors
A short history of interesting processors
3. Carnegie Mellon
3
Deep Pipelining
Idea: Pipelining is good, so let us pipeline the processor as
much as possible
MHz wars (until mid 2000s): 10–20 stages became typical
Number of stages limited by:
Pipeline hazards (penalty of branch misprediction increases)
Sequencing overhead (setup and propagation delays of flip-flops)
Power (faster clock rate, more activity)
Cost (larger area)
4. Carnegie Mellon
4
Branch Prediction
Ideal pipelined processor: CPI = 1
Branch misprediction increases CPI
Static branch prediction:
Check direction of branch (forward or backward)
If backward, predict taken
Otherwise, predict not taken
Dynamic branch prediction:
Keep history of last (several hundred) branches in a branch target
buffer which holds:
Branch destination
Whether branch was taken
5. Carnegie Mellon
5
Branch Prediction Example
add $s1, $0, $0 # sum = 0
add $s0, $0, $0 # i = 0
addi $t0, $0, 10 # $t0 = 10
for:
beq $s0, $t0, done # if i == 10, branch
add $s1, $s1, $s0 # sum = sum + i
addi $s0, $s0, 1 # increment i
j for
done:
6. Carnegie Mellon
6
1-Bit Branch Predictor
Remembers whether branch was taken the last time and
does the same thing
Mispredicts first and last branch of loop
add $s1, $0, $0 # sum = 0
add $s0, $0, $0 # i = 0
addi $t0, $0, 10 # $t0 = 10
for:
beq $s0, $t0, done # if i == 10, branch
add $s1, $s1, $s0 # sum = sum + i
addi $s0, $s0, 1 # increment i
j for
done:
7. Carnegie Mellon
7
2-Bit Branch Predictor
Only mispredicts last branch of loop
strongly
taken
predict
taken
weakly
taken
predict
taken
weakly
not taken
predict
not taken
strongly
not taken
predict
not taken
taken taken taken
taken
taken
taken
taken
taken
add $s1, $0, $0 # sum = 0
add $s0, $0, $0 # i = 0
addi $t0, $0, 10 # $t0 = 10
for:
beq $s0, $t0, done # if i == 10, branch
add $s1, $s1, $s0 # sum = sum + i
addi $s0, $s0, 1 # increment i
j for
done:
8. Carnegie Mellon
8
Superscalar
Multiple copies of datapath: Can issue multiple
instructions at per cycle
Dependencies make it tricky to issue multiple instructions
at once
CLK CLK CLK CLK
A
RD A1
A2
RD1
A3
WD3
WD6
A4
A5
A6
RD4
RD2
RD5
Instruction
Memory
Register
File Data
Memory
ALUs
PC
CLK
A1
A2
WD1
WD2
RD1
RD2
Here: Ideal IPC = 2
9. Carnegie Mellon
9
Superscalar Example
lw $t0, 40($s0)
add $t1, $s1, $s2
sub $t2, $s1, $s3
and $t3, $s3, $s4
or $t4, $s1, $s5
sw $s5, 80($s0)
Time (cycles)
1 2 3 4 5 6 7 8
RF
40
$s0
RF
$t0
+
DM
IM
lw
add
lw $t0, 40($s0)
add $t1, $s1, $s2
sub $t2, $s1, $s3
and $t3, $s3, $s4
or $t4, $s1, $s5
sw $s5, 80($s0)
$t1
$s2
$s1
+
RF
$s3
$s1
RF
$t2
-
DM
IM
sub
and $t3
$s4
$s3
&
RF
$s5
$s1
RF
$t4
|
DM
IM
or
sw
80
$s0
+
$s5
Ideal IPC = 2
Actual IPC = 2 (6 instructions issued in 3 cycles)
10. Carnegie Mellon
10
Superscalar Example with Dependencies
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
or $t3, $s5, $s6
sw $s7, 80($t3)
Stall
Time (cycles)
1 2 3 4 5 6 7 8
RF
40
$s0
RF
$t0
+
DM
IM
lw
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
RF
$s1
$t0
add
RF
$s1
$t0
RF
$t1
+
DM
RF
$t0
$s4
RF
$t2
&
DM
IM
and
IM
or
and
sub
|
$s6
$s5
$t3
RF
80
$t3
RF
+
DM
sw
IM
$s7
9
$s3
$s2
$s3
$s2
-
$t0
or
or $t3, $s5, $s6
IM
Ideal IPC = 2
Actual IPC = 1.2 (6 instructions issued in 5 cycles)
11. Carnegie Mellon
11
Out of Order Processor
Looks ahead across multiple instructions to issue as many as
possible at once
Issues instructions out of order as long as no dependencies
Dependencies:
RAW (read after write): one instruction writes, and later instruction
reads a register
WAR (write after read): one instruction reads, and a later instruction
writes a register (also called an antidependence)
WAW (write after write): one instruction writes, and a later instruction
writes a register (also called an output dependence)
12. Carnegie Mellon
12
Out of Order Processor
Instruction level parallelism: the number of instruction that
can be issued simultaneously
Reorder buffer: stores instructions until they are executed
Scoreboard: table that keeps track of:
Instructions waiting to issue
Available functional units
Dependencies
13. Carnegie Mellon
13
Out of Order Processor Example
# program
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
or $t3, $s5, $s6
sw $s7, 80($t3)
14. Carnegie Mellon
14
Out of Order Processor Example
# program
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
or $t3, $s5, $s6
sw $s7, 80($t3)
# execution order
lw $t0, 40($s0) #1
or $t3, $s5, $s6 #1
sw $s7, 80($t3) #2
add $t1, $t0, $s1 #3
sub $t0, $s2, $s3 #3
and $t2, $s4, $t0 #4
15. Carnegie Mellon
15
Time (cycles)
1 2 3 4 5 6 7 8
RF
40
$s0
RF
$t0
+
DM
IM
lw
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
or
|
$s6
$s5
$t3
RF
80
$t3
RF
+
DM
sw $s7
or $t3, $s5, $s6
IM
RF
$s1
$t0
RF
$t1
+
DM
IM
add
sub
-
$s3
$s2
$t0
two cycle latency
between load and
use of $t0
RAW
WAR
RAW
RF
$t0
$s4
RF
&
DM
and
IM
$t2
RAW
# execution order
lw $t0, 40($s0) #1
or $t3, $s5, $s6 #1
sw $s7, 80($t3) #2
add $t1, $t0, $s1 #3
sub $t0, $s2, $s3 #3
and $t2, $s4, $t0 #4
Actual IPC = 1.5 (6 instructions issued in 4 cycles)
16. Carnegie Mellon
18
SIMD
Single Instruction Multiple Data (SIMD)
Single instruction acts on multiple pieces of data at once
Common application: graphics
Perform short arithmetic operations (also called packed arithmetic)
For example: add four 8-bit numbers
Must modify ALU to eliminate carries between 8-bit values
padd8 $s2, $s0, $s1
a0
0
7
8
15
16
23
24
32 Bit position
$s0
a1
a2
a3
b0
$s1
b1
b2
b3
a0
+ b0
$s2
a1
+ b1
a2
+ b2
a3
+ b3
+
17. Carnegie Mellon
19
Advanced Architecture Techniques
Multithreading
Wordprocessor: thread for typing, spell checking, printing
Multiprocessors
Multiple processors (cores) on a single chip
18. Carnegie Mellon
20
Multithreading: First Some Definitions
Process: program running on a computer
Multiple processes can run at once: e.g., surfing Web, playing
music, writing a paper
Thread: part of a program
Each process has multiple threads: e.g., a word processor may have
threads for typing, spell checking, printing
19. Carnegie Mellon
21
Threads in Conventional Processor
One thread runs at once
When one thread stalls (for example, waiting for memory):
Architectural state of that thread is stored
Architectural state of waiting thread is loaded into processor and it
runs
Called context switching
Appears to user like all threads running simultaneously
20. Carnegie Mellon
22
Multithreading
Multiple copies of architectural state
Multiple threads active at once:
When one thread stalls, another runs immediately (no need to
store or restore architectural state)
If one thread can’t keep all execution units busy, another thread
can use them
Does not increase instruction-level parallelism (ILP) of
single thread, but does increase throughput
21. Carnegie Mellon
23
Multiprocessors
Multiple processors (cores) with a method of
communication between them
Types of multiprocessing:
Symmetric multiprocessing (SMT): multiple cores with a shared
memory
Asymmetric multiprocessing: separate cores for different tasks (for
example, DSP and CPU in cell phone)
Clusters: each core has its own memory system
24. Carnegie Mellon
26
Sun Ultrasparc (1995)
500nm, 4 million transistors
Early 64-bit architecture
Four issue superscalar
Thirty two 64-bit registers
7 read – 3 write ports
Nine stage integer pipeline
Cache
16 kByte data (direct)
16 kByte Instruction (2-way)
External L2 cache
http://www.cs.cmu.edu/afs/cs/academic/class/15740-f97/public/platform/ultrasparc.pdf
25. Carnegie Mellon
27
Dec Alpha 21264 (1996)
350nm, 15 million transistors
Early high frequency/power
(600 MHz): 80–100W
Architecture
Out-of-order execution
Peak CPI == 6
Seven stage pipeline
Up to 80 instructions active
All instructions 32-bit (MIPS like)
Cache
64 kByte L1 Data & 64kByte L1 Instruction
1-16 Mbyte L2 Cache external
http://www.ralph.timmermann.org/controller/ev6/chip.gif
26. Carnegie Mellon
28
Intel Pentium 4 (2000)
180nm, 42 million transistors
Extreme pipelining
Net Burst Architecture
20-stage instruction pipeline
P6 has 10 stages, P5 has 5 stages
2 ALUs, working at twice the clock
rate to increase IPC
1 Power PC processor
12 k Execution Trace Cache
(stores micro operations)
8 kByte L1 data cache
256 kByte L2 Cache
http://www.tayloredge.com/museum/processor/2000_Pentium4.jpg
27. Carnegie Mellon
29
IBM Cell (2006)
90nm, 250 million transistors
Early heterogeneous multicore
Heart of Playstation 3
8 Synergistic Processing Elements
256 kByte local storage
128, 128-bit registers
SIMD operation
(16x 8-bit, 8x 16-bit, 4x 32-bit)
1 Power PC processor
64 kByte L1 Cache +
512 kByte L2 Cache
http://www.ps3news.com/images/img_19889.jpg
28. Carnegie Mellon
30
AMD Bulldozer (2011)
32nm technology, 1.2 billion transistors
Up to 4 modules
2 INT + 1 FP core each
Each INT core 2 ALUs
Each FP core: 4 ADD + 4 MAC
3 Levels of Cache on chip
8 MByte L3
2 MByte L2 per module
64 kByte two-way
L1 instruction per module
16 kByte four-way
L1 data cache per core
http://en.wikipedia.org/wiki/File:AMD_Bulldozer_block_diagram_(8_core_CPU).PNG
http://info.nuje.de/OrochiDieWithModule.jpg
29. Carnegie Mellon
31
Other Resources
Patterson & Hennessy’s:
Computer Architecture: A Quantitative Approach
Conferences:
www.cs.wisc.edu/~arch/www/
ISCA (International Symposium on Computer Architecture)
HPCA (International Symposium on High Performance Computer
Architecture)