2. ARM organization
• Register file –
• 2 read ports, 1 write port +
1 read, 1 write port reserved for r15 (pc)
• Barrel shifter – shift or rotate one
operand for any number of bits
• ALU – performs the arithmetic and logic
functions required
• Memory address register + incrementer
• Memory data registers
• Instruction decoder and associated
control logic
2
multiply
data out register
instruction
decode
&
control
incrementer
register
bank
address register
barrel
shifter
A[31:0]
D[31:0]
data in register
ALU
control
P
C
PC
A
L
U
b
u
s
A
b
u
s
B
b
u
s
register
3. Three-stage pipeline
• Fetch
• the instruction is fetched from memory and placed in the instruction pipeline
• Decode
• the instruction is decoded and the datapath control signals prepared for the
next cycle; in this stage the instruction owns the decode logic but not the
datapath
• Execute
• the instruction owns the datapath; the register bank is read, an operand
shifted, the ALU register generated and written back into a destination
register
3
4. ARM single-cycle instruction pipeline
4
fetch decode execute
time
1
fetch decode execute
fetch decode execute
2
3
instruction
6. Structural Hazard
A structural hazard occurs when two (or more) instructions
that are already in pipeline need the same resource. One
solution to such resource hazard is to increase available
resources, such as having multiple ports into main memory and
multiple (ALU) units.
7. Data Hazard
Data Hazard occur when instructions that exhibit
data dependency modify data in different stages
of pipeline.
ADD R1,R2 ADD R1,R2
AND R0,R1 AND R4,R3
10. 2-phase non-overlapping clock scheme
• Most ARMs do not operate on edge-sensitive registers
• Instead the design is based around
2-phase non-overlapping clocks which are generated internally from
a single clock signal
• Data movement is controlled by passing the data alternatively
through latches
which are open during phase 1 or latches during phase 2
10
1 clock cycle
phase 1
phase 2
11. ARM datapath timing
• Register read
• Register read buses – dynamic, precharged during phase 2
• During phase 1 selected registers discharge the read buses
which become valid early in phase 1
• Shift operation
• second operand passes through barrel shifter
• ALU operation
• ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU
as soon as they are valid, but they close at the end of phase 1
so that the phase 2 precharge does not get through to the ALU
• ALU processes the operands during the phase 2, producing the valid output
towards the end of the phase
• the result is latched in the destination register
at the end of phase 2
11
12. ARM datapath timing (cont’d)
12
read bus valid
shift out valid
ALU out
shift time
ALU time
register
w rite time
register
read
time
ALU operands
latched
phase1
phase2
precharge
invalidates
buses
Minimum Datapath Delay =
Register read time +
Shifter Delay + ALU Delay +
Register write set-up time + Phase 2 to phase 1 non-overlap time
13. The ARM6 ALU organization
• Not easy to merge the arithmetic and logic functions =>
a separate logic unit runs in parallel with the adder,
and multiplexor selects the output
13
Z
N
V
C
logic/arithmetic
C in
function
invert A invert B
result
result mux
logic functions
A operand latch B operand latch
XOR gates XOR gates
adder
zero detect
14. The cross-bar switch barrel shifter
• Shifter delay is critical since it contributes directly to the datapath
cycle time
• Cross-bar switch matrix (32 x 32)
• Principle for 4x4 matrix
14
in[0]
in[1]
in[2]
in[3]
out[0]
out[1]
out[2]
out[3]
no shift
right 1
right 2
right 3
left 1
left 2
left 3
15. The cross-bar switch barrel shifter (cont’d)
• Precharged logic is used =>
each switch is a single NMOS transistor
• Precharging sets all outputs to logic 0, so those which are not
connected to any input during switching remain at 0 giving the zero
filling required by the shift semantics
15
16. The cross-bar switch barrel shifter (cont’d)
16
• For rotate right, the right shift diagonal is enabled +
complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)
• Arithmetic shift right:
use sign-extension => separate logic is used to decode the shift amount and
discharge those outputs appropriately
17. ARM register bank
17
• ARM has 31 general Purpose Registers containing 1KB of data (32 bits).
18. ARM2 register cell circuit
18
A bus
B bus
ALUbus
write
read
B
read
A
• Transistor circuit of register cell used in ARM cores upto ARM6 is based on
Asymmetric Cross Coupled Paires of CMOS inverters.
• This cross coupled CMOS is overdriven by strong signal from ALU bus when
register contents are charged.
19. ARM2 register cell circuit
• The Feedback inverter is made weak in order to minimize the cells
Resistance to a new value.
• A and B buses are pre-charged to Vdd during phase-2 of clock cycle
so that register cells needs only discharge the read bused , which it
does through the n-type pass transistor when read lines are enabled
19
A bus
B bus
ALUbus
write
read
B
read
A
20. ARM register bank floor plan
20
A bus read decoders
B bus read decoders
w rite decoders
register cells
PC
Vdd
Vss
ALU
bus
PC
bus
INC
bus
ALU
bus
A bus
B bus
• Register Cells are arranged in columns to for a 32-bit register and columns are
packed together to form complete register bank.
21. ARM register bank floor plan
• Decoder for read and write enable lines are packed above the
columns.
• Enable line runs vertically whereas databus horizontally across the
array of register cell.
21
A bus read decoders
B bus read decoders
w rite decoders
register cells
PC
Vdd
Vss
ALU
bus
PC
bus
INC
bus
ALU
bus
A bus
B bus
22. ARM core datapath buses
22
address register
incrementer
register bank
multiplier
ALU
shifter
data in
instruction pipe
data out
A B
W
instruction
Din
shift out
PC
Ad
inc
• The order of datapath buses is chosen to minimize the number of additional
busses passing over more complex functions.
Buses:
1. Register bank output buses(A,B)
2. Data/Inst. Input buses(Din)
3. Data write bus(W)
4. Address incrementation bus(Ad)
5. Data/Addr incremetation
bus(Inc)
6. PC load (PC)
7. Shift out
23. ARM control logic structure
23
decode
PLA
cycle
count
multiply
control
load/store
multiple
address
control
register
control
ALU
control
shifter
control
instruction
coprocessor
The Control Path in simpler
ARM cores has 3 structural
components :-
1. An instruction Decoder
PLA
2. Distributed secondary
control associated with
main functional units.
3. Decentralized control
units for specific
instruction
25. DATA PROCESSING INSTRUCTIONS
• A data processing instruction requires two operands, one of which is
always a register and the other is either a second register or an
immediate value.
• The second operand is passed through the barrel shifter where it is
subject to a general shift operation, then it is combined with the first
operand in the ALU using a general ALU operation. Finally, the result
from the ALU is written back into the destination register.
• The PC value in the address register is incremented and copied back
into both the address register and r15 in the register bank.
27. DATA TRANSFER INSTRUCTIONS
• A data transfer (load or store) instruction computes a memory
address in which register is used as the base address, to which is
added (or from which is subtracted) an offset which again may be
another register or an immediate value.
• The address is sent to the address register, and in a second cycle the
data transfer takes place.
• The incremented PC value is stored in the register bank at the end of
the first cycle so that the address register is free to accept the data
transfer address for the second cycle.
• The value sent to the address register in a cycle is the value used for
the memory access in the following cycle.
• The address register is, in effect, a pipeline register between the
processor datapath and the external memory.
28. STORE INSTRUCTION DATAPATH ACTIVITY
Compute
address
AR = Rn op
Disp
r15 = AR + 4
Store data
AR = PC
mem[AR] =
Rd<x:y>
29. BRANCHING INSTRUCTIONS
• Branch instructions compute the target address in the first cycle.
• A 24-bit immediate field is extracted from the instruction and then
shifted left two bit positions to give a word-aligned offset which is
added to the PC. The result is issued as an instruction fetch address,
and while the instruction pipeline refills the return address is copied
into the link register (r14).
• The third cycle, which is required to complete the pipeline refilling, is
also used to make a small correction to the value stored in the link
register in order that it points directly at the instruction which follows
the branch.
30. TWO CYCLE OF BRANCH INSTRUCTION
Compute target
address
AR = PC + Disp,lsl
#2
Save return address
(if required)
r14 = PC
AR = AR + 4