Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ARM Processor


Published on

This presentation is about ARM processor. It include it's architecture,it's ISA and pipelining structure.

Published in: Technology

ARM Processor

  1. 1. The ARM Architecture
  2. 2. ARM•Introduction and processor modes•Instruction Set Architecture – I•Instruction Set Architecture- II•Pipelining in ARM
  3. 3. ARM• ARM: Advanced RISC Machines• Most widely used 32- bit RISC instruction set architecture• The relative simplicity makes it suitable for low power devices• ARM7, ARM9, ARM11 and Cortex• Approximately 90% of all embedded 32-bit RISC processors• Used extensively in consumer electronics, including PDAs, mobile phones, digital media and music players, hand-held game consoles, calculators and computer peripherals such as hard drives and routers.
  4. 4. Product Code Description• M: Multiplier ARM processor have hardware multiplier unit doing multiplication• I: Embedded ICE Macrocel Hardware circuit used to generate trace information. Used in advance debugging.• E: Enhanced Instruction Set• J: Java Acceleration by Jazelle mode Hardware circuit used for running JAVA byte code• F: Vector Floating point Hardware implementation of floating operations.• S: Synthesizable Version The ARM architecture can be modified as it comes in terms of soft processor core.
  5. 5. Example• ARM7TDMI This is the ARM7 family processor which has T= Thumb instruction set, D= Debug Unit, M= MMU(Memory Management Unit), I= Embedded Trace core.• ARM946E-S 1. ARM9xx core 2. Enhanced Instruction set 3. Synthesizable
  6. 6. ARM• ARM has 3 instruction set states 1. 32-bit ARM instruction set 2. 16-bit Thumb instruction set 3. 8- bit Jazelle instruction set• ARM – 32 bit Load/Store architecture with every instruction being conditional.• Thumb- 16 bit with only branch instructions being conditional and only half of the registers used• Jazelle- Allows Java byte code to be directly executed in ARM architecture. Improves performance by 5x-10x
  7. 7. ARM- Processor Modes• Seven basic operating modes exist: 1. User: Unprivileged mode under which most tasks run 2. FIQ: Entered when a high priority interrupt is raised 3. IRQ: Entered when a low priority interrupt is raised 4. Supervisor: Entered on reset and when a software Interrupt instruction is executed 5. Abort: Used to handle memory access violations 6. Undef: Used to handle undefined instructions 7. System: Privileged mode using the same registers as user mode.
  8. 8. Register Organization Summary User FIQ IRQ SVC Undef Abort r0 r1 User r2 mode r3 r0-r7, r4 r15, User User User User Thumb state and mode mode mode mode r5 cpsr r0-r12, r0-r12, r0-r12, r0-r12, Low registers r6 r15, r15, r15, r15, r7 and and and and r8 r8 cpsr cpsr cpsr cpsr r9 r9 r10 r10 Thumb state r11 r11 High registers r12 r12 r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r15 (pc) cpsr spsr spsr spsr spsr spsrNote: System mode uses the User mode register set
  9. 9. ARM- The Registers• ARM has 37 registers all of which are 32-bits long. – 1 dedicated program counter – 1 dedicated current program status register – 5 dedicated saved program status registers – 30 general purpose registers• The current processor mode governs which of several banks is accessible. Each mode can access – a particular set of r0-r12 registers – a particular r13 (the stack pointer, sp) and r14 (the link register, lr) – the program counter, r15(pc) – the current program status register, cpsr Privileged modes (except System) can also access – a particular spsr (saved program status register)
  10. 10. Program Status Registers 31 28 27 24 23 16 15 8 7 6 5 4 0 NZ C VQ J U n d e f i n e d I F T mode f s x c• Condition code flags • Interrupt Disable bits. – N = Negative result from ALU – I = 1: Disables the IRQ. – Z = Zero result from ALU – F = 1: Disables the FIQ. – C = ALU operation Carried out – V = ALU operation overflowed • T Bit – Architecture xT only• Sticky Overflow flag - Q flag – T = 0: Processor in ARM state – Architecture 5TE/J only – T = 1: Processor in Thumb state – Indicates if saturation has occurred • Mode bits• J bit – Specify the processor mode – Architecture 5TEJ only – J = 1: Processor in Jazelle state
  11. 11. Program Counter (r15)• When the processor is executing in ARM state: – All instructions are 32 bits wide – All instructions must be word aligned – Therefore the PC value is stored in bits [31:2] with bits [1:0] undefined (as instruction cannot be halfword or byte aligned).• When the processor is executing in Thumb state: – All instructions are 16 bits wide – All instructions must be halfword aligned – Therefore the PC value is stored in bits [31:1] with bit [0] undefined (as instruction cannot be byte aligned).• When the processor is executing in Jazelle state: – All instructions are 8 bits wide – Processor performs a word access to read 4 instructions at once
  12. 12. Exception Handling• When an exception occurs, the ARM: – Copies CPSR into SPSR_<mode> – Sets appropriate CPSR bits • Change to ARM state 0x1C FIQ • Change to exception mode 0x18 IRQ • Disable interrupts (if appropriate) 0x14 (Reserved) – Stores the return address in 0x10 Data Abort LR_<mode> 0x0C Prefetch Abort 0x08 Software Interrupt – Sets PC to vector address 0x04 Undefined Instruction• To return, exception handler 0x00 Resetneeds to: Vector Table Vector table can be at – Restore CPSR from SPSR_<mode> 0xFFFF0000 on ARM720T and on ARM9/10 family – Restore PC from LR_<mode> devices This can only be done in ARM state.
  13. 13. Development of the ARM Architecture Improved Halfword ARM/Thumb 5TE Jazelle 4 and signed Interworking 5TEJ 1 Java bytecode halfword / execution CLZ byte support System SA-110 Saturated maths ARM9EJ-S ARM926EJ-S 2 mode DSP multiply- SA-1110 ARM7EJ-S ARM1026EJ-S accumulate instructions 3 ARM1020E SIMD Instructions Thumb 4T 6 instruction Multi-processing set XScaleEarly ARM V6 Memoryarchitectures architecture (VMSA) ARM7TDMI ARM9TDMI ARM9E-S Unaligned data ARM720T ARM940T ARM966E-S support ARM1136EJ-S
  14. 14. The ARM Instruction Set part1
  15. 15. Main features of the ARM Instruction Set• All instructions are 32 bits long.• Most instructions execute in a single cycle.• Every instruction can be conditionally executed.• A load/store architecture – Data processing instructions act only on registers • Three operand format • Combined ALU and shifter for high speed bit manipulation – Specific memory access instructions with powerful auto-indexing addressing modes.
  16. 16. Conditional Execution• Most instruction sets only allow branches to be executed conditionally by postfixing them with the appropriate condition code field..• However by reusing the condition evaluation hardware, ARM effectively increases number of instructions. – All instructions contain a condition field which determines whether the CPU will execute them. – Non-executed instructions soak up 1 cycle. • Still have to complete cycle so as to allow fetching and decoding of following instructions.• This removes the need for many branches, which stall the pipeline (3 cycles to refill). – Allows very dense in-line code, without branches. – The Time penalty of not executing several conditional instructions is frequently less than overhead of the branch or subroutine call that would otherwise be needed.
  17. 17. The Condition Field 31 28 24 20 16 12 8 4 0 Cond0000 = EQ - Z set (equal) 1001 = LS - C clear or Z (set unsigned0001 = NE - Z clear (not equal) lower or same)0010 = HS / CS - C set (unsigned 1010 = GE - N set and V set, or N clear higher or same) and V clear (>or =)0011 = LO / CC - C clear (unsigned 1011 = LT - N set and V clear, or N clear lower) and V set (>)0100 = MI -N set (negative) 1100 = GT - Z clear, and either N set and0101 = PL - N clear (positive or zero) V set, or N clear and V set (>)0110 = VS - V set (overflow) 1101 = LE - Z set, or N set and V clear,or0111 = VC - V clear (no overflow) N clear and V set (<, or =)1000 = HI - C set and Z clear 1110 = AL - always (unsigned higher) 1111 = NV - reserved.
  18. 18. Using and updating the Condition Field• To execute an instruction conditionally, simply postfix it with the appropriate condition: – For example an add instruction takes the form: • ADD r0,r1,r2 ; r0 = r1 + r2 (ADDAL) – To execute this only if the zero flag is set: • ADDEQ r0,r1,r2 ; If zero flag set then… ; ... r0 = r1 + r2• By default, data processing operations do not affect the condition flags (apart from the comparisons where this is the only effect). To cause the condition flags to be updated, the S bit of the instruction needs to be set by postfixing the instruction (and any condition code) with an “S”. – For example to add two numbers and set the condition flags: • ADDS r0,r1,r2 ; r0 = r1 + r2 ; ... and set flags
  19. 19. Data processing Instructions• Largest family of ARM instructions, all sharing the same instruction format.• Contains: – Arithmetic operations – Comparisons (no results - just set condition codes) – Logical operations – Data movement between registers• Remember, this is a load / store architecture – These instruction only work on registers, NOT memory.• They each perform a specific operation on one or two operands. – First operand always a register - Rn – Second operand sent to the ALU via barrel shifter.
  20. 20. Data Movement• Operations are: – MOV operand2 – MVN NOT operand2 Note that these make no use of operand1 i.e operand1 is ignored.• Syntax: – <Operation>{<cond>}{S} Rd, Operand2• Examples: – MOV r0, r1 – MOVS r2, #10 – MVNEQ r1,#0
  21. 21. Arithmetic Operations• Operations are: – ADD operand1 + operand2 – ADC operand1 + operand2 + carry – SUB operand1 - operand2 – SBC operand1 - operand2 + carry -1 – RSB operand2 - operand1 – RSC operand2 - operand1 + carry - 1• Syntax: – <Operation>{<cond>}{S} Rd, Rn, Operand2• Examples – ADD r0, r1, r2 – SUBGT r3, r3, #1 – RSBLES r4, r5, #5 – SUB r4,r5,r7,LSR r2 ; Logical right shift R7 by the number in ; the bottom byte of R2, subtract result ; from R5, and put the answer into R4.
  22. 22. Logical Operations• Operations are: – AND operand1 AND operand2 – EOR operand1 EOR operand2 – ORR operand1 OR operand2 – BIC operand1 AND NOT operand2 [ie bit clear]• Syntax: – <Operation>{<cond>}{S} Rd, Rn, Operand2• Examples: – AND r0, r1, r2 – BICEQ r2, r3, #7 – EORS r1,r3,r0
  23. 23. Multiplication Instructions• The Basic ARM provides two multiplication instructions.• Multiply – MUL{<cond>}{S} Rd, Rm, Rs ; Rd = Rm * Rs• Multiply Accumulate - does addition for free – MLA{<cond>}{S} Rd, Rm, Rs,Rn ; Rd = (Rm * Rs) + Rn• Restrictions on use: – Rd and Rm cannot be the same register • Can be avoid by swapping Rm and Rs around. This works because multiplication is commutative. – Cannot use PC. These will be picked up by the assembler if overlooked.• Operands can be considered signed or unsigned – Up to user to interpret correctly.
  24. 24. • The multiply form of the instruction gives Rd:=Rm*Rs. Rn is ignored, and should be set to zero for compatibility with possible future upgrades to the instruction set.
  25. 25. Multiplication Implementation • The ARM makes use of Booth’s Algorithm to perform integer multiplication. • On non-M ARMs this operates on 2 bits of Rs at a time. – For each pair of bits this takes 1 cycle (plus 1 cycle to start with). – However when there are no more 1’s left in Rs, the multiplication will early-terminate. • Example: Multiply 18 and -1 : Rd = Rm * Rs Rm 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 18 Rs Rs -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 Rm17 cycles 4 cycles • Note: Compiler does not use early termination criteria to decide on which order to place operands.
  26. 26. Booth’s Algorithm
  27. 27. Extended Multiply Instructions• M variants of ARM cores contain extended multiplication hardware. This provides three enhancements: – An 8 bit Booth’s Algorithm is used • Multiplication is carried out faster (maximum for standard instructions is now 5 cycles). – Early termination method improved so that now completes multiplication when all remaining bit sets contain • all zeroes (as with non-M ARMs), or • all ones. Thus the previous example would early terminate in 2 cycles in both cases. – 64 bit results can now be produced from two 32bit operands • Higher accuracy. • Pair of registers used to store result.
  28. 28. Multiply-Long and Multiply-Accumulate Long• Instructions are – MULL which gives RdHi,RdLo:=Rm*Rs – MLAL which gives RdHi,RdLo:=(Rm*Rs)+RdHi,RdLo• However the full 64 bit of the result now matter (lower precision multiply instructions simply throws top 32bits away) – Need to specify whether operands are signed or unsigned• Therefore syntax of new instructions are: – UMULL{<cond>}{S} RdLo,RdHi,Rm,Rs – UMLAL{<cond>}{S} RdLo,RdHi,Rm,Rs – SMULL{<cond>}{S} RdLo, RdHi, Rm, Rs – SMLAL{<cond>}{S} RdLo, RdHi, Rm, Rs• Not generated by the compiler. Warning : Unpredictable on non-M ARMs.
  29. 29. Operand restrictions • R15 must not be used as an operand or as a destination register. • RdHi, RdLo, and Rm must all specify different registers.
  30. 30. ISA part 1
  31. 31. Data Transfer• ARM is a load/store architecture• Involves -Load data from memory to register -Store data from register into memory• ARM has three types of load/store instructions -LDR/STR -LDM/STM -SWP
  32. 32. LDR/STR Instructions
  33. 33. Types of load/store instructionsSimple load/store has options like the following• LDR/STR  involved in storing/loading words(32 bits)• LDRB/STRB involved with a byte transfer• In ARM v4 we also have support for halfwords(16 bits) LDRH/STRH without sign extension LDRSB/STRSB with sign extension• Condition codes can also be suffixed LDREQB/STREQB• General syntax looks somewhat like.. <LDR|STR>{<cond>}{<size>} Rd, <address>
  34. 34. Base Register• STR r0,[r1] Stores content in address contained in r1 in r0 LDR r2,[r1] Loads content in address contained in r1 to r2 r0 Memory Source 0x5 Register for STR r1 r2 Base Destination 0x200 0x200 0x5 0x5 Register Register for LDR
  35. 35. Off set from the base register• ARM also supports accessing locations pointed out as an offset from the base register• The offset can be An unsigned 12 bit immediate value(0-4096) A register with the option of shift• Option exists for ‘+’ or ‘-’ from base register• Offset can be applied - before transfer is made optionally auto incremnets base register by using ‘!’ -after transfer is made base register auto incremented
  36. 36. Pre-Indexed Addressing• Example :STR r0,[r1,#12] r0 Source Memory 0x5 Register Offset for STR 12 0x20c 0x5 r1 Base 0x200 0x200Register •Offset value can as well be -12 (STR r0,[r1,#-12]) •To perform auto increment on base reg STR r0,[r1,#12]! -updates base register to value 0x20C •If r2 contains 3 then this will yield the same result STR r0,[r1,r2,LSL#2] •Useful if only a particular element is to be accessed
  37. 37. Post Indexed Addressing• Example :STR r0,[r1],#12 MemoryUpdated r1 Offset r0 Source Base 0x20c 12 0x20c 0x5 RegisterRegister for STR 0x200 0x5Original r1 Base 0x200Register •If r2 contains 3 then this will also yield the same result STR r0,[r1],r2,LSL #2 •Useful if traversal is required through elements
  38. 38. For half words/signed byte access• Instructions can be used in much the same way except - the offset value is restricted to 8 bits(0-255) - the registers cannot be shifted
  39. 39. For LDRH/STRH register offset
  40. 40. For LDRH/STRH immediate offset
  41. 41. LDM/STM (Block data transfer)• Allow for transfer between 1-16 registers to or from memory• The transferred registers can be: - Any subset of the current bank of registers (default). - Any subset of the user mode bank of registers when in a privileged mode (postfix instruction with a ‘^’).
  42. 42. Instruction Format
  43. 43. Block Data Transfer• Base register determines where memory access can occur• Base register can be updated after data transfer by suffixing a ‘!’• These instructions are useful for - Saving and restoring context - moving large chunks of data to/from memory
  44. 44. Stack Example
  45. 45. Block Data Transfer• One use of stacks is to temporary create register space for subroutines STMFD sp!,{r0-r12, lr} ; stack all registers ........ ; and the return address ........ LDMFD sp!,{r0-r12, pc} ; load all the registers ; and return automatically• If the pop instruction also had the ‘S’ bit set (using ‘^’) then the transfer of the PC when in a priviledged mode would also cause the SPSR to be copied into the CPSR (see exception handling module).
  46. 46. Direct functionality Of Block Data Transfer• When not being used for a stack operation these instructions can also be used in a generic way• The LDM/STM support a further set of instructions – STMIA / LDMIA : Increment After – STMIB / LDMIB : Increment Before – STMDA / LDMDA : Decrement After – STMDB / LDMDB : Decrement Before
  47. 47. Criteria for different block data transfer
  48. 48. Swap Instruction
  49. 49. Swap Instruction• The instruction is used to swap data between a register and a memory• This instruction is atomic (cannot be interrupted)• The swap address is determined by the contents of the base register (Rn).• The processor first reads the contents of the swap address. Then it writes the contents of the source register (Rm) to the swap address, and stores the old memory contents in the destination register (Rd).• The same register may be specified as both the source and destination
  50. 50. Branch and Exchange•Used to switch between the Thumb state and the ARM state
  51. 51. Branch and Branch Link
  52. 52. Branch and Branch with Link• Branch instructions contain a signed 2’s complement 24 bit offset.• This is shifted left two bits, sign extended to 32 bits, and added to the PC.• The instruction can therefore specify a branch of +/- 32Mbytes.• The branch offset must take account of the prefetch operation, which causes the PC to be 2 words (8 bytes) ahead of the current instruction.• Branches beyond +/- 32Mbytes must use an offset or absolute destination which has been previously loaded into a register. In this case the PC should be manually saved in R14 if a Branch with Link type operation is required.
  53. 53. Link Bit• Branch with Link (BL) writes the old PC into the link register (R14) of the current bank.• The PC value written into R14 is adjusted to allow for the prefetch, and contains the address of the instruction following the branch and link instruction.• The CPSR is not saved with the PC
  54. 54. Barrel Shifter• A barrel shifter is a digital circuit that can shift a data word by a specified number of bits in one clock cycle.• It can be implemented as a sequence of multiplexers (mux.), and in such an implementation the output of one mux is connected to the input of the next mux in a way that depends on the shift distance.• A barrel shifter is often implemented as a cascade of parallel 2×1 multiplexers.
  55. 55. Using the Barrel Shifter•There are 2 options for shifting - where shift amount is stored in a base register bottom byte - shift amount as a % bit unsigned integer
  56. 56. Shift Operations• Shifts Left by specified amount (multiplies)• Example: LSL #5 CF Destination 0
  57. 57. Shift Operations• Logical Shift Right• Shifts right without preserving sign bit ...0 Destination CF• Arithmetic Shift Right• Preserves the sign bit Destination CF Sign bit shifted in
  58. 58. Rotate• Rotate Right Same as ASR but the bits wrap around as they rotate The rotated bit also used as carry flag Rotate Right Destination CF
  59. 59. Comparison• The only effect of the comparisons is to – UPDATE THE CONDITION FLAGS. Thus no need to set S bit.• Operations are: – CMP operand1 - operand2, but result not written – CMN operand1 + operand2, but result not written – TST operand1 AND operand2, but result not written – TEQ operand1 EOR operand2, but result not written• Syntax: – <Operation>{<cond>} Rn, Operand2• Examples: – CMP r0, r1 – TSTEQ r2, #5
  60. 60. Pipelining• Initially implemented a 3-stage pipeline organization. (upto ARM7) – Fetch – Decode – Execute
  61. 61. • 3-stage pipeline organization – Principal components • The register bank • The barrel shifter – Can shift or rotate one operand by any number of bits • The ALU • The address register and incrementer – Select and hold all memory addresses and generate sequential addresses • The data registers • The instruction decoder and associated control logic
  62. 62. • Fetch - The instruction is fetched from memory and placed in the instruction pipeline• Decode - The instruction is decoded and the datapath control signals prepared for the next cycle• Execute - The register bank is read, an operand shifted, the ALU result generated and written back into destination register
  63. 63. • At any time slice, 3 different instructions may occupy each of these stages, so the hardware in each stage has to be capable of independent operations• When the processor is executing data processing instructions , the latency = 3 cycles and the throughput = 1 instruction/cycle• Drawback: Every data transfer instruction causes a pipeline “stall”. (Single memory for data and instruction- next instruction cannot be fetched while data is being read)
  64. 64. 5-stage Pipeline Organization• Implemented in ARM9TDMI• Tprog = Ninst * CPI / fclk – Tprog: the time taken to execute a given program – Ninst: the number of ARM instructions executed in the program (compiler dependent) – CPI: average number of clock cycles per instructions => hazard causes pipeline stalls – fclk: frequency
  65. 65. • Fetch – The instruction is fetched from memory and placed in the instruction pipeline• Decode – The instruction is decoded and register operands read from the register files. There are 3 operand read ports in the register file so most ARM instructions can source all their operands in one cycle• Execute – An operand is shifted and the ALU result generated. If the instruction is a load or store, the memory address is computed in the ALU
  66. 66. • Buffer/Data – Data memory is accessed if required. Otherwise the ALU result is simply buffered for one cycle.• Write back – The result generated by the instruction are written back to the register file, including any data loaded from memory.
  67. 67. 5-stage pipeline organization• Moved the register read step from the execute stage to the decode stage• Execute stage was split into 3 stages- ALU, memory access, write back.• Result: Better balanced pipeline with minimized latencies between stages, which can run at a faster clock speed.
  68. 68. Pipeline Hazards• There are situations, called hazards, that prevent the next instruction in the instruction stream from being executed during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining.• There are three classes of hazards: – Structural Hazards – Data Hazards – Control Hazards
  69. 69. Structural Hazards• When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline.• If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard.
  70. 70. • Ex. A machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference (load), it will conflict with the instruction reference for a later instruction (instr 3):
  71. 71. Solution• To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stall is actually implemented.
  72. 72. Solution• Another solution is to use separate instruction and data memories.• ARM has moved from the von-Neumann architecture to the Harvard architecture in ARM9. – Implemented a 5-stage pipeline and separate data and instruction memory. – Doesn’t suffer from this hazard.
  73. 73. Data Hazards• They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.• The problem with data hazards can be solved with a hardware technique called data forwarding (by making use of feedback paths).• Without forwarding, the pipeline would have to be stalled to get the results from the respective registers• Example:
  74. 74. Data Hazards• The first forwarding is for value of R1 from EXadd to EXsub.• The second forwarding is also for value of R1 from MEMadd to EXand.• This code now can be executed without stalls.• Forwarding can be generalized to include passing the result directly to the functional unit that requires it: a result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit.
  75. 75. Control Hazards• They arise from the pipelining of branches and other instructions that change the PC.
  76. 76. Further Improvements
  77. 77. THANK YOU•Alok Sharma•Aniket Thakur•Paritosh Ramanan•Pavan A.R.