23_Advanced_Processors controller system

Carnegie Mellon
1
Design of Digital Circuits 2014
Srdjan Capkun
Frank K. Gürkaynak
Adapted from Digital Design and Computer Architecture, David Money Harris & Sarah L. Harris ©2007 Elsevier
http://www.syssec.ethz.ch/education/Digitaltechnik_14
Advanced Microprocessors

Carnegie Mellon
2
What Will We Learn?
 Tricks invented over the years
 Deep Pipelining
 Branch Prediction
 Superscalar Processors
 Out of Order Processors
 Register Renaming
 SIMD
 Multithreading
 Multiprocessors
 A short history of interesting processors

Carnegie Mellon
3
Deep Pipelining
 Idea: Pipelining is good, so let us pipeline the processor as
much as possible
 MHz wars (until mid 2000s): 10–20 stages became typical
 Number of stages limited by:
 Pipeline hazards (penalty of branch misprediction increases)
 Sequencing overhead (setup and propagation delays of flip-flops)
 Power (faster clock rate, more activity)
 Cost (larger area)

Carnegie Mellon
4
Branch Prediction
 Ideal pipelined processor: CPI = 1
 Branch misprediction increases CPI
 Static branch prediction:
 Check direction of branch (forward or backward)
 If backward, predict taken
 Otherwise, predict not taken
 Dynamic branch prediction:
 Keep history of last (several hundred) branches in a branch target
buffer which holds:
 Branch destination
 Whether branch was taken

Carnegie Mellon
5
Branch Prediction Example
add $s1, $0, $0 # sum = 0
add $s0, $0, $0 # i = 0
addi $t0, $0, 10 # $t0 = 10
for:
beq $s0, $t0, done # if i == 10, branch
add $s1, $s1, $s0 # sum = sum + i
addi $s0, $s0, 1 # increment i
j for
done:

Carnegie Mellon
6
1-Bit Branch Predictor
 Remembers whether branch was taken the last time and
does the same thing
 Mispredicts first and last branch of loop
add $s1, $0, $0 # sum = 0
add $s0, $0, $0 # i = 0
addi $t0, $0, 10 # $t0 = 10
for:
add $s1, $s1, $s0 # sum = sum + i
j for
done:

Carnegie Mellon
7
2-Bit Branch Predictor
 Only mispredicts last branch of loop
strongly
taken
predict
taken
weakly
taken
predict
taken
weakly
not taken
predict
not taken
strongly
not taken
predict
not taken
taken taken taken
taken
taken
taken
taken
taken
add $s1, $0, $0 # sum = 0
add $s0, $0, $0 # i = 0
addi $t0, $0, 10 # $t0 = 10
for:
add $s1, $s1, $s0 # sum = sum + i
j for
done:

Carnegie Mellon
8
Superscalar
 Multiple copies of datapath: Can issue multiple
instructions at per cycle
 Dependencies make it tricky to issue multiple instructions
at once
CLK CLK CLK CLK
A
RD A1
A2
RD1
A3
WD3
WD6
A4
A5
A6
RD4
RD2
RD5
Instruction
Memory
Register
File Data
Memory
ALUs
PC
CLK
A1
A2
WD1
WD2
RD1
RD2
Here: Ideal IPC = 2

Carnegie Mellon
9
Superscalar Example
lw $t0, 40($s0)
add $t1, $s1, $s2
sub $t2, $s1, $s3
and $t3, $s3, $s4
or $t4, $s1, $s5
sw $s5, 80($s0)
Time (cycles)
1 2 3 4 5 6 7 8
RF
40
$s0
RF
$t0
+
DM
IM
lw
add
lw $t0, 40($s0)
add $t1, $s1, $s2
sub $t2, $s1, $s3
and $t3, $s3, $s4
or $t4, $s1, $s5
sw $s5, 80($s0)
$t1
$s2
$s1
+
RF
$s3
$s1
RF
$t2
-
DM
IM
sub
and $t3
$s4
$s3
&
RF
$s5
$s1
RF
$t4
|
DM
IM
or
sw
80
$s0
+
$s5
Ideal IPC = 2
Actual IPC = 2 (6 instructions issued in 3 cycles)

Carnegie Mellon
10
Superscalar Example with Dependencies
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
or $t3, $s5, $s6
sw $s7, 80($t3)
Stall
Time (cycles)
1 2 3 4 5 6 7 8
RF
40
$s0
RF
$t0
+
DM
IM
lw
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
RF
$s1
$t0
add
RF
$s1
$t0
RF
$t1
+
DM
RF
$t0
$s4
RF
$t2
&
DM
IM
and
IM
or
and
sub
|
$s6
$s5
$t3
RF
80
$t3
RF
+
DM
sw
IM
$s7
9
$s3
$s2
$s3
$s2
-
$t0
or
or $t3, $s5, $s6
IM
Ideal IPC = 2
Actual IPC = 1.2 (6 instructions issued in 5 cycles)

Carnegie Mellon
11
Out of Order Processor
 Looks ahead across multiple instructions to issue as many as
possible at once
 Issues instructions out of order as long as no dependencies
 Dependencies:
 RAW (read after write): one instruction writes, and later instruction
reads a register
 WAR (write after read): one instruction reads, and a later instruction
writes a register (also called an antidependence)
 WAW (write after write): one instruction writes, and a later instruction
writes a register (also called an output dependence)

Carnegie Mellon
12
Out of Order Processor
 Instruction level parallelism: the number of instruction that
can be issued simultaneously
 Reorder buffer: stores instructions until they are executed
 Scoreboard: table that keeps track of:
 Instructions waiting to issue
 Available functional units
 Dependencies

Carnegie Mellon
13
Out of Order Processor Example
# program
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
or $t3, $s5, $s6
sw $s7, 80($t3)

Carnegie Mellon
14
Out of Order Processor Example
# program
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
or $t3, $s5, $s6
sw $s7, 80($t3)
# execution order
lw $t0, 40($s0) #1
or $t3, $s5, $s6 #1
sw $s7, 80($t3) #2
add $t1, $t0, $s1 #3
sub $t0, $s2, $s3 #3
and $t2, $s4, $t0 #4

Carnegie Mellon
15
Time (cycles)
1 2 3 4 5 6 7 8
RF
40
$s0
RF
$t0
+
DM
IM
lw
lw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
or
|
$s6
$s5
$t3
RF
80
$t3
RF
+
DM
sw $s7
or $t3, $s5, $s6
IM
RF
$s1
$t0
RF
$t1
+
DM
IM
add
sub
-
$s3
$s2
$t0
two cycle latency
between load and
use of $t0
RAW
WAR
RAW
RF
$t0
$s4
RF
&
DM
and
IM
$t2
RAW
# execution order
lw $t0, 40($s0) #1
or $t3, $s5, $s6 #1
sw $s7, 80($t3) #2
add $t1, $t0, $s1 #3
sub $t0, $s2, $s3 #3
and $t2, $s4, $t0 #4
Actual IPC = 1.5 (6 instructions issued in 4 cycles)

Carnegie Mellon
18
SIMD
 Single Instruction Multiple Data (SIMD)
 Single instruction acts on multiple pieces of data at once
 Common application: graphics
 Perform short arithmetic operations (also called packed arithmetic)
 For example: add four 8-bit numbers
 Must modify ALU to eliminate carries between 8-bit values
padd8 $s2, $s0, $s1
a0
0
7
8
15
16
23
24
32 Bit position
$s0
a1
a2
a3
b0
$s1
b1
b2
b3
a0
+ b0
$s2
a1
+ b1
a2
+ b2
a3
+ b3
+

Carnegie Mellon
19
Advanced Architecture Techniques
 Multithreading
 Wordprocessor: thread for typing, spell checking, printing
 Multiprocessors
 Multiple processors (cores) on a single chip

Carnegie Mellon
20
Multithreading: First Some Definitions
 Process: program running on a computer
 Multiple processes can run at once: e.g., surfing Web, playing
music, writing a paper
 Thread: part of a program
 Each process has multiple threads: e.g., a word processor may have
threads for typing, spell checking, printing

Carnegie Mellon
21
Threads in Conventional Processor
 One thread runs at once
 When one thread stalls (for example, waiting for memory):
 Architectural state of that thread is stored
 Architectural state of waiting thread is loaded into processor and it
runs
 Called context switching
 Appears to user like all threads running simultaneously

Carnegie Mellon
22
Multithreading
 Multiple copies of architectural state
 Multiple threads active at once:
 When one thread stalls, another runs immediately (no need to
store or restore architectural state)
 If one thread can’t keep all execution units busy, another thread
can use them
 Does not increase instruction-level parallelism (ILP) of
single thread, but does increase throughput

Carnegie Mellon
23
Multiprocessors
 Multiple processors (cores) with a method of
communication between them
 Types of multiprocessing:
 Symmetric multiprocessing (SMT): multiple cores with a shared
memory
 Asymmetric multiprocessing: separate cores for different tasks (for
example, DSP and CPU in cell phone)
 Clusters: each core has its own memory system

Carnegie Mellon
24
Some Historical Processors
 The following is an excerpt of processors from history

Carnegie Mellon
25
http://research.microsoft.com/en-us/um/people/gbell/CyberMuseum_contents/Microprocessor_Evolution_Poster.jpg

Carnegie Mellon
26
Sun Ultrasparc (1995)
500nm, 4 million transistors
 Early 64-bit architecture
 Four issue superscalar
 Thirty two 64-bit registers
 7 read – 3 write ports
 Nine stage integer pipeline
 Cache
 16 kByte data (direct)
 16 kByte Instruction (2-way)
 External L2 cache
http://www.cs.cmu.edu/afs/cs/academic/class/15740-f97/public/platform/ultrasparc.pdf

Carnegie Mellon
27
Dec Alpha 21264 (1996)
 Early high frequency/power
(600 MHz): 80–100W
 Architecture
 Out-of-order execution
 Peak CPI == 6
 Seven stage pipeline
 Up to 80 instructions active
 All instructions 32-bit (MIPS like)
 Cache
 64 kByte L1 Data & 64kByte L1 Instruction
 1-16 Mbyte L2 Cache external
http://www.ralph.timmermann.org/controller/ev6/chip.gif

Carnegie Mellon
28
Intel Pentium 4 (2000)
 Extreme pipelining
 Net Burst Architecture
 20-stage instruction pipeline
P6 has 10 stages, P5 has 5 stages
 2 ALUs, working at twice the clock
rate to increase IPC
 1 Power PC processor
 12 k Execution Trace Cache
(stores micro operations)
 8 kByte L1 data cache
 256 kByte L2 Cache
http://www.tayloredge.com/museum/processor/2000_Pentium4.jpg

Carnegie Mellon
29
IBM Cell (2006)
 Early heterogeneous multicore
 Heart of Playstation 3
 8 Synergistic Processing Elements
 256 kByte local storage
 128, 128-bit registers
 SIMD operation
(16x 8-bit, 8x 16-bit, 4x 32-bit)
 1 Power PC processor
 64 kByte L1 Cache +
512 kByte L2 Cache
http://www.ps3news.com/images/img_19889.jpg

Carnegie Mellon
30
AMD Bulldozer (2011)
32nm technology, 1.2 billion transistors
 Up to 4 modules
 2 INT + 1 FP core each
 Each INT core 2 ALUs
 Each FP core: 4 ADD + 4 MAC
 3 Levels of Cache on chip
 8 MByte L3
 2 MByte L2 per module
 64 kByte two-way
L1 instruction per module
 16 kByte four-way
L1 data cache per core
http://en.wikipedia.org/wiki/File:AMD_Bulldozer_block_diagram_(8_core_CPU).PNG
http://info.nuje.de/OrochiDieWithModule.jpg

Carnegie Mellon
31
Other Resources
 Patterson & Hennessy’s:
Computer Architecture: A Quantitative Approach
 Conferences:
 www.cs.wisc.edu/~arch/www/
 ISCA (International Symposium on Computer Architecture)
 HPCA (International Symposium on High Performance Computer
Architecture)

23_Advanced_Processors controller system

Recommended

Recommended

More Related Content

Similar to 23_Advanced_Processors controller system

Similar to 23_Advanced_Processors controller system (20)

Recently uploaded

Recently uploaded (20)

23_Advanced_Processors controller system