SlideShare a Scribd company logo
1 of 71
ARM Programming Using Assembly
Language
MODULE 2
(Chapter 6)
Code Optimization
Gayana M N SJEC
2
 Optimizing code reduces the
 system power consumption and
 reduce the clock speed needed for real-time
operation.
 Optimization can turn an infeasible system into a
feasible one, or an uncompetitive system into a
competitive one.
 Maximum performance, can be achieved using hand-
written assembly code.
 Writing assembly code gives you direct control of three
optimization tools that you cannot explicitly use by
writing C source:
Three optimization tools
Gayana M N SJEC
3
■ Instruction scheduling:
Reordering the instructions in a code sequence to avoid
processor Stalls because of dependency.
■ Register allocation:
Deciding how variables should be allocated to ARM
registers or stack locations for maximum performance.
■ Conditional execution:
Accessing the full range of ARM condition codes and
conditional instructions..
6.1 Writing Assembly Code
Gayana M N SJEC
4
 This section gives examples showing how to convert the C
function to basic assembly code. Consider the C program.
main.c
Writing Assembly Code…
Gayana M N SJEC
5
 Let’s see how to replace square by an assembly function that performs
the same action. Remove the C definition of square, but not the
declaration (the second line) to produce a new C file main1.c. Next add
an armasm assembler file square.s with the following contents:
Gayana M N SJEC
6
 The AREA directive names the area or code section that the code
lives in.
 If you use non-alphanumeric characters in a symbol or area name,
then enclose the name in vertical bars.
 Define a Read-only code area called .text.
 The EXPORT directive makes the symbol square available for
external linking.
 At line six we define the symbol square as a code label.
Writing Assembly Code…
Gayana M N SJEC
7
 armasm treats non-indented text as a label definition.
 When square is called, the input argument is passed in
register r0, and the return value is returned in register r0.
 Return control back
 The multiply instruction has a restriction that the destination
register must not be the same as the first argument register.
 Therefore we place the multiply result into r1 and move this to
r0.
 The END directive marks the end of the assembly file.
 Comments follow a semicolon.
MOV pc, lr ; return r0
Writing Assembly Code…
Gayana M N SJEC
8
 Building this example using command line tools:
armcc -c main1.c
armasm square.s
armlink -o main1.axf main1.o square.o
 This only works if you are compiling your C as ARM
code. i.e., armcc
 To Compile the C code as Thumb code:
 When calling ARM code from C program compiled as Thumb, the only
change required to the assembly code of previous example is to change
the return instruction to a BX.
 Use BX lr instead of MOV pc, lr
Writing Assembly Code…
Gayana M N SJEC
9
Building C Program as Thumb Code:
tcc -c main1.c
armasm -apcs /interwork square2.s
armlink -o main2.axf main1.o square2.o
-apcs/Interwork option to tell armasm that the square2.s arm state
program interworks with C thumb code.
APCS –Automated Project Control System
Writing Assembly Code…
Gayana M N SJEC
10
 Calling C subroutine from assembly Code:
• Consider the following assembly file, main3.s
Writing Assembly Code…
Gayana M N SJEC
11
 IMPORT directive is used to declare symbols that are defined in
other files
 The imported symbol Lib$$Request$$armlib makes a request that
the linker links with the standard ARM C library.
 The WEAK specifier prevents the linker from giving an error if the
symbol is not found at link time. If the symbol is not found, it will
take the value zero.
 The second imported symbol ___main is the start of the C library
initialization code.
 You only need to import these symbols if you are defining your own
main; a main defined in C code will import these automatically for you.
 Importing printf allows us to call that C library function.
 The RN directive allows us to define i as an alternate name for
register r4.
Writing Assembly Code…
Gayana M N SJEC
12
 Here, function must preserve registers r4 to r11 and sp.
 We corrupt i(r4), and calling printf will corrupt lr. Therefore we stack these two
registers at the start of the function using an STMFD instruction.
 The LDMFD instruction pulls these registers from the stack and returns by
writing the return address to pc.
 The DCB directive defines byte data described as a string or a comma-
separated list of bytes.
 To build this example you can use the following command line script:
armasm main3.s
armlink -o main3.axf main3.o
 If the code can be called from Thumb code , then, change the last instruction to
the two instructions:
LDMFD sp!, {i, lr}
BX lr
Writing Assembly Code…
Gayana M N SJEC
13
 Example
 The sumof function is written in assembly and can accept any
number of arguments.
 C program file main4.c:
Writing Assembly Code…
Gayana M N SJEC
14
 sumof function in an assembly file sumof.s:
Writing Assembly
Code…
6.2 Profiling and Cycle Counting
Gayana M N SJEC
15
 The optimization process requires the identification of the critical
routines and measure their current performance.
A profiler:
- is a tool that measures the time or cycles taken for each
subroutine.
- Profiler is used to identify the most critical routines.
A cycle counter:
- measures the number of cycles taken by a specific routine.
- cycle counter can be used as a benchmark to measure the
subroutine performance before and after an optimization.
Note: The ARM simulator called the ARMulator provides profiling and
cycle counting features. The ARMulator profiler works by sampling the
program counter pc at regular intervals. The profiler identifies the
function the pc points to and updates a hit counter for each function it
encounters. Another approach is to use the trace output of a simulator
as a source for analysis.
6.3 Instruction Scheduling
Gayana M N SJEC
16
 The time taken to execute instructions depends on the
implementation of pipeline.
 Here we consider the ARM9TDMI pipeline timings.
 All conditional instructions on the value of the ARM
condition codes in the cpsr take one cycle if the condition is
not met.
 If the condition is met, then the following rules apply:
1. ALU operations such as addition, subtraction, and
logical operations take one cycle. This includes a shift
by an immediate value.
If you use a register-specified shift, then add one
cycle.
If the instruction writes to the pc, then add two cycles.
Instruction Scheduling …
Gayana M N SJEC
17
2. Load instructions that load 32-bit words of memory such as
LDR and LDM take N cycles to issue. If the instruction
loads pc, then add two cycles.
3. Load instructions that load 16-bit or 8-bit data such as
LDRB, LDRSB, LDRH, and LDRSH take one cycle to issue.
4. Branch instructions take three cycles.
5. Store instructions that store N values take N cycles.
6. Multiply instructions take a varying number of cycles
depending on the value of the second operand
Instruction Scheduling …
Gayana M N SJEC
18
ARM pipeline and dependencies:
 To understand how to schedule code efficiently on the
ARM, we need to understand the ARM pipeline and
dependencies.
 The ARM9TDMI processor performs five operations in
parallel:
1. Fetch:
 Fetches instruction from memory using pc.
 The instruction is then loaded into the core and then
processed down the core pipeline.
2. Decode:
 Decode the instruction that was fetched in the previous
cycle.
 The processor also reads the input operands from the
register bank or if they are not available via one of the
Instruction Scheduling …
Gayana M N SJEC
19
3. Execute (ALU):
 Executes the instruction that was decoded in the previous
cycle.
 This instruction was originally fetched from address pc −
8 (ARM state) or pc − 4 (Thumb state).
 This stage calculates
- the answer for a data processing operation,
- the address for a load, store, or branch
operation.
 Some instructions takes several cycles in this stage.
- For example, multiply and register-controlled shift
operations take several ALU cycles.
Instruction Scheduling …
Gayana M N SJEC
20
4. LS1:
 This stage Load or store the data specified by a load or
store instruction.
 If the instruction is not a load or store, then this
stage has no effect or bypassed.
5. LS2:
 Sign-extend the data loaded by a byte or halfword load
instruction.
 If the instruction is not a load of an 8-bit byte or 16-bit
halfword item, then this stage has no effect.
Instruction Scheduling …
Gayana M N SJEC
21
How does the pipeline affect the timing of
instructions?
 If an instruction requires the result of a previous
instruction that is not available, then the processor
stalls.
 This is called a pipeline hazard or pipeline interlock.
 Example with no pipeline interlock:
 The ALU calculates r0 + r1 in one cycle.
 Therefore this result is available for the ALU to calculate
r0 + r2 in the second cycle.
Instruction Scheduling …
Gayana M N SJEC
22
 Example shows a one-cycle pipeline interlock caused by
load use.
• This instruction pair takes 3 cycles.
• The ALU calculates the address r2 + 4 in the first cycle while decoding the ADD
instruction in parallel.
• However, the ADD cannot proceed on the second cycle because the load
instruction has not yet loaded the value of r1.
• The pipeline stalls for one cycle while the load instruction completes the LS1
stage. Now that r1 is ready, the processor executes the ADD in the ALU on the
third cycle.
• The processor stalls the ADD instruction for one cycle in the ALU stage of the
pipeline while the load instruction completes the LS1 stage. We’ve denoted this
stall by an italic ADD. Since the LDR instruction proceeds down the pipeline,
but the ADD instruction is stalled, a gap opens up between them. This gap is
sometimes called a pipeline bubble. We’ve marked the bubble with a dash.
Instruction Scheduling …
Gayana M N SJEC
23
 Example shows a one-cycle interlock caused by
delayed load use
 This instruction triplet takes 4 cycles.
 Although the ADD proceeds on the cycle following the load byte, the EOR
instruction cannot start on the third cycle.
 The r1 value is not ready until the load instruction completes the LS2 stage
of the pipeline.
 The processor stalls the EOR instruction for one cycle.
 Note that the ADD instruction does not affect the timing at all. The sequence
takes four cycles whether it is there or not.
 The ADD doesn’t cause any stalls since the ADD does not use r1, the result
of the load.
Instruction Scheduling …
Gayana M N SJEC
24
 Example shows why a branch instruction takes three
cycles
 The processor must flush the pipeline when jumping to a new address.
 The three executed instructions take a total of five cycles.
 The MOV instruction executes on the first cycle.
 On the second cycle, the branch instruction calculates the destination address.
This causes the core to flush the pipeline and refill it using this new pc value.
 The refill takes two cycles.
 Finally, the SUB instruction executes normally.
 The pipeline drops the two instructions following the branch when the branch
takes place.
6.3.1 Scheduling of load instructions
Gayana M N SJEC
25
 Load instructions occur frequently in compiled code,
accounting for approximately one- third of all instructions.
 Careful scheduling of load instructions so that pipeline
stalls don’t occur can improve performance.
 The compiler attempts to schedule the code as best it
can, but the aliasing problems of C limits the available
optimizations. The compiler cannot move a load
instruction before a store instruction unless it is certain
that the two pointers used do not point to the same
address.
 Let’s consider an example of a memory-intensive task.
 The following function, str_tolower, copies a zero-
terminated string of characters from in to out. It converts
the string to lowercase in the process.
6.3.1 Scheduling of load instructions…..
Gayana M N SJEC
26
 The following function, str_tolower, copies a zero-
terminated string of characters from in to out. It
converts the string to lowercase in the process.
6.3.1 Scheduling of load instructions…..
Gayana M N SJEC
27
 Compiler generates the following compiled output.
 Notice that the compiler optimizes the condition (c>=‘A’ && c<=‘Z’) to
check that 0<=c-‘A’<=‘Z’-‘A’.
 The ARM9TDMI pipeline will stall for two cycles. The compiler can’t
do any better since everything following the load of c depends on its
value. However, there are two ways you can alter the structure of the
algorithm to avoid the cycles by using assembly. We call these
methods load scheduling by preloading and unrolling.
6.3.1 Scheduling of load instructions…..
Gayana M N SJEC
28
Cycle Count:
str_tolower
LDRB r2,[r1],#1 ; c = *(in++) cycle-1
cycle-2 stall
cycle-3 stall
SUB r3,r2,#0x41 ; r3 = c -‘A’ cycle-4
CMP r3,#0x19 ; if (c <=‘Z’-‘A’) cycle-5
ADDLS r2,r2,#0x20 ; c +=‘a’-‘A’ cycle-6
STRB r2,[r0],#1 ; *(out++) = (char)c cycle-7
CMP r2,#0 ; if (c!=0) cycle -8
BNE str_tolower ; goto str_tolower cycle-9
cycle-10 flush
cycle-11 flush
MOV pc,r14 ; return
6.3.1.1 Load scheduling by preloading
Gayana M N SJEC
29
 In this method of load scheduling, we load the data required for the
loop at the end of the previous loop, rather than at the beginning of
the current loop. To get performance improvement with little increase
in code size, we don’t unroll the loop.
 The ARM architecture is particularly well suited to this type of
preloading because instructions can be executed conditionally.
 Since loop i is loading the data for loop i + 1 there is always a
problem with the first and last loops.
 For the first loop, we can preload data by inserting extra load
instructions before the loop starts.
 For the last loop it is essential that the loop does not read any data,
or it will read beyond the end of the array. This could cause a data
abort!
 With ARM, we can easily solve this problem by making the load
instruction conditional.
6.3.1.1 Load scheduling by preloading…..
Gayana M N SJEC
30
6.3.1.1 Load scheduling by preloading…..
Gayana M N SJEC
31
LDRB c, [in], #1 ; c = *(in++)
loop
SUB t, c, #’A’ ; t = c-’A’ cycle-1
CMP t, #’Z’-’A’ ; if (t <= ’Z’-’A’) cycle-2
ADDLS c, c, #’a’-’A’ ; c += ’a’-’A’; cycle-3
STRB c, [out], #1 ; *(out++) = (char)c;
cycle-4
TEQ c, #0 ; test if c==0 cycle-5
LDRNEB c, [in], #1 ; if (c!=0) { c=*in++; cycle-6
BNE loop ; goto loop; } cycle-7
cycle-8
flush
cycle-9
flush
MOV pc, lr ; return
Scheduling with Preload
speed improvement over
the original code is
= 11/9
= 1.22 time more faster
than the original code
6.3.1.2 Load scheduling by Unrolling…..
Gayana M N SJEC
32
 This method of load scheduling works by unrolling and
then interleaving the body of the loop.
 For example, we can perform loop iterations i, i + 1, i +
2 interleaved.
 When the result of an operation from loop i is not ready,
we can perform an operation from loop i + 1 that avoids
waiting for the loop i result.
6.3.1.2 Load scheduling by Unrolling…..
Gayana M N SJEC
33
6.3.1.2 Load scheduling by Unrolling…..
Gayana M N SJEC
34
6.3.1.2 Load scheduling by Unrolling…..
Gayana M N SJEC
35
loop_next3
Clock Cycles
LDRB ca0, [in], #1 ; ca0 = *in++; c1
LDRB ca1, [in], #1 ; ca1 = *in++; c2
LDRB ca2, [in], #1 ; ca2 = *in++; c3
SUB t, ca0, #’A’ ; convert ca0 to lower
case c4
CMP t, #’Z’-’A’
c5
ADDLS ca0, ca0, #’a’-’A’
c6
SUB t, ca1, #’A’ ; convert ca1 to lower
case c7
CMP t, #’Z’-’A’ c8
ADDLS ca1, ca1, #’a’-’A’ c9
SUB t, ca2, #’A’ ; convert ca2 to lower
case c10
CMP t, #’Z’-’A’
c11
ADDLS ca2, ca2, #’a’-’A’
c12
STRB ca0, [out], #1 ; *out++ = ca0;
c13
TEQ ca0, #0 ; if (ca0!=0)
c14
3 iterations – 21
cycles
1 Iteration – 7 cycles
Therefore, the
performance
enhancement of
unrolled code
=11/7
= 1.57 time more
faster than the
original code
SUMMARY : Instruction Scheduling
Gayana M N SJEC
36
 ARM cores have a pipeline architecture. The pipeline may
delay the results of certain instructions for several cycles. If
you use these results as source operands in a following
instruction, the processor will insert stall cycles until the
value is ready.
 Load and multiply instructions have delayed results in
many implementations. ( See Appendix D for the cycle
timings and delay for your specific ARM processor core.)
 Two software methods available to remove interlocks
following load instructions: You can preload so that loop i
loads the data for loop i + 1, or you can unroll the loop and
interleave the code for loops i and i + 1.
6.4 Register Allocation
Gayana M N SJEC
37
 Among 16 visible ARM registers, 14 registers are used to
hold general-purpose data.
 The remaining two registers are the
- stack pointer r13 and
- the program counter r15.
 For a function to be ATPCS – ARM-Thumb Procedure Call
Standards compliant it must preserve the callee values of
registers r4 to r11.
 ATPCS also specifies that the stack should be eight-byte
aligned;
 -stored at an address which is multiple of 8 bytes.
6.4 Register Allocation…..
Gayana M N SJEC
38
 Template for optimized assembly routines requiring
many registers:
r12 is also stacked just to keep the stack eight-byte aligned.
Address starts from 0 and ends at 28 for r4-r11. 28 is not
multiple of 8.
Address starts from 0 and ends at 32 for r4-r12. 32 is
multiple of 8.
r12 need not be stacked if your routine doesn’t call other ATPCS
routines.
6.4 Register Allocation…..
Gayana M N SJEC
39
 For ARMv5 and above you can use the preceding
template even when being called from Thumb code.
 For ARM4 and below you have to use the following
template.
6.4.1 Allocating Variables to Register
Numbers:
Gayana M N SJEC
40
 It is best to use alternate names to the registers,
rather than explicit register numbers.
 Benefits of using alternate names:
 Allows you to change the allocation of variables to
register numbers easily.
 Allows to use different register names for the same
physical register number when their use doesn’t
overlap.
 Register names increase the clarity and readability of
optimized code.
6.4.1 Allocating Variables to Register
Numbers…
Gayana M N SJEC
41
 There are several cases where the physical number of the register
is important:
■ Argument registers.
- The ATPCS convention defines that the first four arguments to a function
are placed in registers r0 to r3.
- Further arguments are placed on the stack.
- The return value must be placed in r0.
■ Registers used in a load or store multiple.
- Load and store multiple instructions LDM and STM operate on a list of
registers in order of ascending register number.
- For example if r0 and r1 appear in the register list, then the processor will
always load or store r0 using a lower address than r1 and so on.
■ Load and store double word.
The LDRD and STRD instructions introduced in ARMv5E operate on a pair
of registers with sequential register numbers, Rd and Rd + 1.
6.4.1 Allocating Variables to Register
Numbers…
Gayana M N SJEC
42
 How to allocate registers when writing assembly?
suppose if we want
- to shift an array of N bits upwards in memory by k bits.
- For simplicity assume that N is large and a multiple of
256.
- Also assume that 0 ≤ k < 32 and that the input and
output pointers are word aligned I,e. 2 to the power k .
- Block a copy from one bit or byte alignment to a
different bit or byte alignment
6.4.1 Allocating Variables to Register
Numbers…
Gayana M N SJEC
43
 C Function:
6.4.1 Allocating Variables to Register
Numbers…
Gayana M N SJEC
44
6.4.1 Allocating Variables to Register
Numbers…
Gayana M N SJEC
45
 Improving the efficiency by loop unrolling
6.4.1 Allocating Variables to Register
Numbers…
Gayana M N SJEC
46
 Register Allocation
All registers are
finished.
There are two
remaining
variables carry
and kr, but only
one remaining
free register lr.
6.4.1 Allocating Variables to Register
Numbers…
Gayana M N SJEC
47
 Possible ways to deal , when we run out of registers:
 Reduce the number of registers required by performing
fewer operations in each loop.
In previous example we could load four words in each load
multiple rather than eight.
 Use the stack to store the least-used values to free up more
registers.
In previous example we could store the loop counter N on
the stack.
 Alter the code implementation to free up more registers.
This is the solution we consider in the next example.
6.4.1 Allocating Variables to Register
Numbers…
Gayana M N SJEC
48
 Modified Implementation
 We notice that the carry value need not stay in the
same register.
 We can start off with the carry value in y0 and then
move it to y1 when x0 is no longer required, and so
on.
 We complete the routine by allocating kr to lr and
recoding so that carry is not required.
6.4.1 Allocating Variables to Register
Numbers…
Gayana M N SJEC
49
Variables
Gayana M N SJEC
50
 If you need more than 14 local 32-
bit variables in a routine, then you
must store some variables on the
stack.
 Pop them in the reverse order.
 The standard procedure is to work
outwards from the inner- most loop
of the algorithm, since the
innermost loop has the greatest
performance impact.
 This example shows three nested
loops, each loop requiring state
information inherited from the loop
surrounding it.
 You may find that there are
insufficient registers for the
innermost loop
 Then you need to swap inner loop
variables out to the stack.
 Since assembly code is very hard
to maintain and debug if you use
numbers as stack address offsets,
the assembler provides an
Variables…..
Gayana M N SJEC
51
 ARM assembler directives MAP (alias ∧) and FIELD
(alias #)
 Defines and allocates space for variables and arrays on the
processor stack.
 The directives perform a similar function to the struct operator
in C.
Variables…..
Gayana M N SJEC
52
 ARM assembler directives MAP (alias ∧) and FIELD
(alias #)..
Example STMFD sp!, {r4-r11, lr} ; save callee registers
SUB sp, sp, #length ; create stack frame
; ...
STR r0, [sp, #a] ;a= r0;
LDRSH r1, [sp, #b] ; r1 = b;
ADD r2, sp, #d ; r2 = &d[0]
; ...
ADD sp, sp, #length ; restore the stack
pointer
LDMFD sp!, {r4-r11, pc} ; return
6.4.3 Making the Most of Available
Registers
Gayana M N SJEC
53
 It is more efficient to access values held in registers than
values held in memory.
 There are several tricks you can use to fit several sub-
32-bit length variables into a single 32-bit register and
thus can reduce code size and increase performance.
6.4.3 Making the Most of Available
Registers…..
Gayana M N SJEC
54
 Example1:
 Suppose we want to step through an array by a programmable increment.
A common example is to step through a sound sample at various rates to
produce different pitched notes. We can express this in C code as:
sample = table[index]; index += increment;
 Commonly index and increment are small enough to be held as 16-bit
values. We can pack these two variables into a single 32-bit variable
indinc:
Note : if index and increment are 16-bit values, then putting index in the top 16
bits of indinc correctly implements 16-bit-wrap-around. In other words, index =
(short)(index + increment). This can be useful if you are using a circular buffer.
6.4.3 Making the Most of Available
Registers…..
Gayana M N SJEC
55
 Example2: When you shift by a register amount, the ARM uses bits 0 to 7
as the shift amount. The ARM ignores bits 8 to 31 of the register.
Therefore you can use bits 8 to 31 to hold a second variable distinct from
the shift amount.
 example shows combining register-specified shift shift and loop counter
count to shift an array of 40 entries right by shift bits. Define a variable
cntshf that combines count and shift:
6.4.3 Making the Most of Available
Registers…..
Gayana M N SJEC
56
 Example3: SIMD- Single Issue Multiple Data
 Example shows how to process multiple 8-bit pixels in an image using
normal ADD and MUL instructions to achieve some SIMD operations.
 Suppose we want to merge two images X and Y to produce a new image
Z.
Let xn, yn, and zn denote the nth 8-bit pixel in these images,
respectively.
Let 0 ≤ a ≤ 256 be a scaling factor.
To merge the images, we set
6.4.3 Making the Most of Available
Registers…..
Gayana M N SJEC
57
Summary : Register Allocation
Gayana M N SJEC
58
 ARM has 14 available registers for general-purpose use: r0 to
r12 and r14. The stack pointer r13 and program counter r15
cannot be used for general-purpose data.
 If you need more than 14 local variables, push the variables to
the stack.
 Use register names instead of physical register numbers when
writing assembly routines. This makes it easier to reallocate
registers and to maintain the code.
 Store multiple values in the same register.
 For example, you can store a loop counter and a shift in one
register.
 You can also store multiple pixels in one register.
6.5 Conditional Execution
Gayana M N SJEC
59
 The processor core can conditionally execute most ARM
instructions.
 By combining conditional execution and conditional setting
of the flags, it is possible to implement simple if statements
without any need for branches.
 This improves efficiency since branches can take many
cycles and also reduces code size.
6.5 Conditional Execution…..
Gayana M N SJEC
60
 Example1:
 The following C code
converts an unsigned
integer 0 ≤ i ≤ 15 to a
hexadecimal character c:
 We can write this in
assembly using conditional
execution rather than
conditional branches:
6.5 Conditional Execution…..
Gayana M N SJEC
61
 Example 2:
 Counting vowels
 We can write this in assembly using conditional
execution :
6.5 Conditional Execution…..
Gayana M N SJEC
62
 Example 3:
 Detect if c is a letter:
 To implement this efficiently, we can use an addition or
subtraction to move each range to the form 0<= c <=limit .
Then we use unsigned comparisons to detect this range and
conditional comparisons to chain together ranges. The
following assembly implements this efficiently:
6.6 Looping Constructs
Gayana M N SJEC
63
 ARM loops are fastest when they count down
towards zero.
 Here we understand to implement
- count down loops efficiently in assembly and
- unroll loops for maximum performance.
6.6 Looping Constructs…..
Gayana M N SJEC
64
 6.6.1 Decremented Counted Loops : N to 1
(inclusive)
 In decrementing loop of N iterations, the loop
counter i counts down from N to 1 inclusive.
 The loop terminates with i = 0.
 An efficient implementation is
6.6 Looping Constructs
Gayana M N SJEC
65
 6.6.1 Decremented Counted Loops : N-1 to 0
(inclusive)
 In decrementing loop of N iterations, the loop
counter i counts down from N-1 to 0 inclusive.
 An efficient implementation is
6.6 Looping Constructs
Gayana M N SJEC
66
 6.6.1 Decremented Counted Loops : N to N/3
(inclusive)
 Suppose we require N/3 loops.
 Rather than attempting to divide N by three, it is far more
efficient to subtract three from the loop counter on each
iteration.
 An efficient implementation is
6.6 Looping Constructs…..
Gayana M N SJEC
67
 6.6.4 Other Counted Loops
 Negative Indexing
 This loop structure counts from −N to 0 (inclusive or
exclusive) in steps of size STEP.
6.6 Looping Constructs…..
Gayana M N SJEC
68
 6.6.4 Other Counted Loops
 Logarithmic Indexing
 This loop structure counts down from 2N to 1 in powers
of two.
 For example, if N = 4, then it counts 16, 8, 4, 2, 1.
6.6 Looping Constructs…..
Gayana M N SJEC
69
 6.6.4 Other Counted Loops
 Logarithmic Indexing
 The following loop structure counts down from an N-bit
mask to a one-bit mask.
 For example, if N = 4, then it counts 15, 7, 3, 1.
Summary : Looping Constructs
Gayana M N SJEC
70
 ARM requires two instructions to implement a counted
loop: a subtract that sets flags and a conditional branch.
 Unroll loops to improve loop performance. Do not over
unroll because this will hurt cache performance. Number
of unroll is limited by the number of registers available.
 Nested loops only require a single loop counter register,
which can improve efficiency by freeing up registers for
other uses.
 ARM can implement negative and logarithmic indexed
loops efficiently.
Gayana M N SJEC
71

More Related Content

What's hot

Chapter 03 arithmetic for computers
Chapter 03   arithmetic for computersChapter 03   arithmetic for computers
Chapter 03 arithmetic for computers
Bảo Hoang
 

What's hot (20)

Memory Reference instruction
Memory Reference instructionMemory Reference instruction
Memory Reference instruction
 
Compiler Design(NANTHU NOTES)
Compiler Design(NANTHU NOTES)Compiler Design(NANTHU NOTES)
Compiler Design(NANTHU NOTES)
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
Pipelining , structural hazards
Pipelining , structural hazardsPipelining , structural hazards
Pipelining , structural hazards
 
Unit II arm 7 Instruction Set
Unit II arm 7 Instruction SetUnit II arm 7 Instruction Set
Unit II arm 7 Instruction Set
 
Assembly level language
Assembly level languageAssembly level language
Assembly level language
 
Interrupts in pic
Interrupts in picInterrupts in pic
Interrupts in pic
 
Arm processor
Arm processorArm processor
Arm processor
 
Multiplication algorithm, hardware and flowchart
Multiplication algorithm, hardware and flowchartMultiplication algorithm, hardware and flowchart
Multiplication algorithm, hardware and flowchart
 
Arm instruction set
Arm instruction setArm instruction set
Arm instruction set
 
Pipelining In computer
Pipelining In computer Pipelining In computer
Pipelining In computer
 
Register Reference Instructions | Computer Science
Register Reference Instructions | Computer ScienceRegister Reference Instructions | Computer Science
Register Reference Instructions | Computer Science
 
MicroProgrammed Explained .
MicroProgrammed Explained .MicroProgrammed Explained .
MicroProgrammed Explained .
 
EE6008 MBSD
EE6008  MBSDEE6008  MBSD
EE6008 MBSD
 
Chapter 03 arithmetic for computers
Chapter 03   arithmetic for computersChapter 03   arithmetic for computers
Chapter 03 arithmetic for computers
 
ARM Introduction
ARM IntroductionARM Introduction
ARM Introduction
 
Register transfer language & its micro operations
Register transfer language & its micro operationsRegister transfer language & its micro operations
Register transfer language & its micro operations
 
Register transfer language
Register transfer languageRegister transfer language
Register transfer language
 
pipelining
pipeliningpipelining
pipelining
 
ARM - Advance RISC Machine
ARM - Advance RISC MachineARM - Advance RISC Machine
ARM - Advance RISC Machine
 

Similar to 18CS44-MES-Module-2(Chapter 6).pptx

Microcontroladores: introducción a la programación en lenguaje ensamblador AVR
Microcontroladores: introducción a la programación en lenguaje ensamblador AVRMicrocontroladores: introducción a la programación en lenguaje ensamblador AVR
Microcontroladores: introducción a la programación en lenguaje ensamblador AVR
SANTIAGO PABLO ALBERTO
 
cse211 power point presentation for engineering
cse211 power point presentation for engineeringcse211 power point presentation for engineering
cse211 power point presentation for engineering
VishnuVinay6
 
WinARM - Simulating Advanced RISC Machine Architecture .docx
WinARM - Simulating Advanced RISC Machine Architecture   .docxWinARM - Simulating Advanced RISC Machine Architecture   .docx
WinARM - Simulating Advanced RISC Machine Architecture .docx
alanfhall8953
 

Similar to 18CS44-MES-Module-2(Chapter 6).pptx (20)

IS.pptx
IS.pptxIS.pptx
IS.pptx
 
Instruction scheduling
Instruction schedulingInstruction scheduling
Instruction scheduling
 
arm-intro.ppt
arm-intro.pptarm-intro.ppt
arm-intro.ppt
 
Mod.2.pptx
Mod.2.pptxMod.2.pptx
Mod.2.pptx
 
Arm programmer's model
Arm programmer's modelArm programmer's model
Arm programmer's model
 
MES_MODULE 2.pptx
MES_MODULE 2.pptxMES_MODULE 2.pptx
MES_MODULE 2.pptx
 
07-arm_overview.ppt
07-arm_overview.ppt07-arm_overview.ppt
07-arm_overview.ppt
 
ARM.ppt
ARM.pptARM.ppt
ARM.ppt
 
07-arm_overview.ppt
07-arm_overview.ppt07-arm_overview.ppt
07-arm_overview.ppt
 
Computer Architecture Assignment Help
Computer Architecture Assignment HelpComputer Architecture Assignment Help
Computer Architecture Assignment Help
 
Unit 4 _ ARM Processors .pptx
Unit 4 _ ARM Processors .pptxUnit 4 _ ARM Processors .pptx
Unit 4 _ ARM Processors .pptx
 
Arm
ArmArm
Arm
 
The ARM Architecture: ARM : ARM Architecture
The ARM Architecture: ARM : ARM ArchitectureThe ARM Architecture: ARM : ARM Architecture
The ARM Architecture: ARM : ARM Architecture
 
Microcontroladores: introducción a la programación en lenguaje ensamblador AVR
Microcontroladores: introducción a la programación en lenguaje ensamblador AVRMicrocontroladores: introducción a la programación en lenguaje ensamblador AVR
Microcontroladores: introducción a la programación en lenguaje ensamblador AVR
 
cse211 power point presentation for engineering
cse211 power point presentation for engineeringcse211 power point presentation for engineering
cse211 power point presentation for engineering
 
WinARM - Simulating Advanced RISC Machine Architecture .docx
WinARM - Simulating Advanced RISC Machine Architecture   .docxWinARM - Simulating Advanced RISC Machine Architecture   .docx
WinARM - Simulating Advanced RISC Machine Architecture .docx
 
16201104.ppt
16201104.ppt16201104.ppt
16201104.ppt
 
Lecture8
Lecture8Lecture8
Lecture8
 
Computer organisation and architecture jntuh 2rd year 2nd unit # central proc...
Computer organisation and architecture jntuh 2rd year 2nd unit # central proc...Computer organisation and architecture jntuh 2rd year 2nd unit # central proc...
Computer organisation and architecture jntuh 2rd year 2nd unit # central proc...
 
ARM Architecture Subroutine and Flags
ARM Architecture Subroutine and FlagsARM Architecture Subroutine and Flags
ARM Architecture Subroutine and Flags
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Recently uploaded (20)

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 

18CS44-MES-Module-2(Chapter 6).pptx

  • 1. ARM Programming Using Assembly Language MODULE 2 (Chapter 6)
  • 2. Code Optimization Gayana M N SJEC 2  Optimizing code reduces the  system power consumption and  reduce the clock speed needed for real-time operation.  Optimization can turn an infeasible system into a feasible one, or an uncompetitive system into a competitive one.  Maximum performance, can be achieved using hand- written assembly code.  Writing assembly code gives you direct control of three optimization tools that you cannot explicitly use by writing C source:
  • 3. Three optimization tools Gayana M N SJEC 3 ■ Instruction scheduling: Reordering the instructions in a code sequence to avoid processor Stalls because of dependency. ■ Register allocation: Deciding how variables should be allocated to ARM registers or stack locations for maximum performance. ■ Conditional execution: Accessing the full range of ARM condition codes and conditional instructions..
  • 4. 6.1 Writing Assembly Code Gayana M N SJEC 4  This section gives examples showing how to convert the C function to basic assembly code. Consider the C program. main.c
  • 5. Writing Assembly Code… Gayana M N SJEC 5  Let’s see how to replace square by an assembly function that performs the same action. Remove the C definition of square, but not the declaration (the second line) to produce a new C file main1.c. Next add an armasm assembler file square.s with the following contents:
  • 6. Gayana M N SJEC 6  The AREA directive names the area or code section that the code lives in.  If you use non-alphanumeric characters in a symbol or area name, then enclose the name in vertical bars.  Define a Read-only code area called .text.  The EXPORT directive makes the symbol square available for external linking.  At line six we define the symbol square as a code label. Writing Assembly Code…
  • 7. Gayana M N SJEC 7  armasm treats non-indented text as a label definition.  When square is called, the input argument is passed in register r0, and the return value is returned in register r0.  Return control back  The multiply instruction has a restriction that the destination register must not be the same as the first argument register.  Therefore we place the multiply result into r1 and move this to r0.  The END directive marks the end of the assembly file.  Comments follow a semicolon. MOV pc, lr ; return r0 Writing Assembly Code…
  • 8. Gayana M N SJEC 8  Building this example using command line tools: armcc -c main1.c armasm square.s armlink -o main1.axf main1.o square.o  This only works if you are compiling your C as ARM code. i.e., armcc  To Compile the C code as Thumb code:  When calling ARM code from C program compiled as Thumb, the only change required to the assembly code of previous example is to change the return instruction to a BX.  Use BX lr instead of MOV pc, lr Writing Assembly Code…
  • 9. Gayana M N SJEC 9 Building C Program as Thumb Code: tcc -c main1.c armasm -apcs /interwork square2.s armlink -o main2.axf main1.o square2.o -apcs/Interwork option to tell armasm that the square2.s arm state program interworks with C thumb code. APCS –Automated Project Control System Writing Assembly Code…
  • 10. Gayana M N SJEC 10  Calling C subroutine from assembly Code: • Consider the following assembly file, main3.s Writing Assembly Code…
  • 11. Gayana M N SJEC 11  IMPORT directive is used to declare symbols that are defined in other files  The imported symbol Lib$$Request$$armlib makes a request that the linker links with the standard ARM C library.  The WEAK specifier prevents the linker from giving an error if the symbol is not found at link time. If the symbol is not found, it will take the value zero.  The second imported symbol ___main is the start of the C library initialization code.  You only need to import these symbols if you are defining your own main; a main defined in C code will import these automatically for you.  Importing printf allows us to call that C library function.  The RN directive allows us to define i as an alternate name for register r4. Writing Assembly Code…
  • 12. Gayana M N SJEC 12  Here, function must preserve registers r4 to r11 and sp.  We corrupt i(r4), and calling printf will corrupt lr. Therefore we stack these two registers at the start of the function using an STMFD instruction.  The LDMFD instruction pulls these registers from the stack and returns by writing the return address to pc.  The DCB directive defines byte data described as a string or a comma- separated list of bytes.  To build this example you can use the following command line script: armasm main3.s armlink -o main3.axf main3.o  If the code can be called from Thumb code , then, change the last instruction to the two instructions: LDMFD sp!, {i, lr} BX lr Writing Assembly Code…
  • 13. Gayana M N SJEC 13  Example  The sumof function is written in assembly and can accept any number of arguments.  C program file main4.c: Writing Assembly Code…
  • 14. Gayana M N SJEC 14  sumof function in an assembly file sumof.s: Writing Assembly Code…
  • 15. 6.2 Profiling and Cycle Counting Gayana M N SJEC 15  The optimization process requires the identification of the critical routines and measure their current performance. A profiler: - is a tool that measures the time or cycles taken for each subroutine. - Profiler is used to identify the most critical routines. A cycle counter: - measures the number of cycles taken by a specific routine. - cycle counter can be used as a benchmark to measure the subroutine performance before and after an optimization. Note: The ARM simulator called the ARMulator provides profiling and cycle counting features. The ARMulator profiler works by sampling the program counter pc at regular intervals. The profiler identifies the function the pc points to and updates a hit counter for each function it encounters. Another approach is to use the trace output of a simulator as a source for analysis.
  • 16. 6.3 Instruction Scheduling Gayana M N SJEC 16  The time taken to execute instructions depends on the implementation of pipeline.  Here we consider the ARM9TDMI pipeline timings.  All conditional instructions on the value of the ARM condition codes in the cpsr take one cycle if the condition is not met.  If the condition is met, then the following rules apply: 1. ALU operations such as addition, subtraction, and logical operations take one cycle. This includes a shift by an immediate value. If you use a register-specified shift, then add one cycle. If the instruction writes to the pc, then add two cycles.
  • 17. Instruction Scheduling … Gayana M N SJEC 17 2. Load instructions that load 32-bit words of memory such as LDR and LDM take N cycles to issue. If the instruction loads pc, then add two cycles. 3. Load instructions that load 16-bit or 8-bit data such as LDRB, LDRSB, LDRH, and LDRSH take one cycle to issue. 4. Branch instructions take three cycles. 5. Store instructions that store N values take N cycles. 6. Multiply instructions take a varying number of cycles depending on the value of the second operand
  • 18. Instruction Scheduling … Gayana M N SJEC 18 ARM pipeline and dependencies:  To understand how to schedule code efficiently on the ARM, we need to understand the ARM pipeline and dependencies.  The ARM9TDMI processor performs five operations in parallel: 1. Fetch:  Fetches instruction from memory using pc.  The instruction is then loaded into the core and then processed down the core pipeline. 2. Decode:  Decode the instruction that was fetched in the previous cycle.  The processor also reads the input operands from the register bank or if they are not available via one of the
  • 19. Instruction Scheduling … Gayana M N SJEC 19 3. Execute (ALU):  Executes the instruction that was decoded in the previous cycle.  This instruction was originally fetched from address pc − 8 (ARM state) or pc − 4 (Thumb state).  This stage calculates - the answer for a data processing operation, - the address for a load, store, or branch operation.  Some instructions takes several cycles in this stage. - For example, multiply and register-controlled shift operations take several ALU cycles.
  • 20. Instruction Scheduling … Gayana M N SJEC 20 4. LS1:  This stage Load or store the data specified by a load or store instruction.  If the instruction is not a load or store, then this stage has no effect or bypassed. 5. LS2:  Sign-extend the data loaded by a byte or halfword load instruction.  If the instruction is not a load of an 8-bit byte or 16-bit halfword item, then this stage has no effect.
  • 21. Instruction Scheduling … Gayana M N SJEC 21 How does the pipeline affect the timing of instructions?  If an instruction requires the result of a previous instruction that is not available, then the processor stalls.  This is called a pipeline hazard or pipeline interlock.  Example with no pipeline interlock:  The ALU calculates r0 + r1 in one cycle.  Therefore this result is available for the ALU to calculate r0 + r2 in the second cycle.
  • 22. Instruction Scheduling … Gayana M N SJEC 22  Example shows a one-cycle pipeline interlock caused by load use. • This instruction pair takes 3 cycles. • The ALU calculates the address r2 + 4 in the first cycle while decoding the ADD instruction in parallel. • However, the ADD cannot proceed on the second cycle because the load instruction has not yet loaded the value of r1. • The pipeline stalls for one cycle while the load instruction completes the LS1 stage. Now that r1 is ready, the processor executes the ADD in the ALU on the third cycle. • The processor stalls the ADD instruction for one cycle in the ALU stage of the pipeline while the load instruction completes the LS1 stage. We’ve denoted this stall by an italic ADD. Since the LDR instruction proceeds down the pipeline, but the ADD instruction is stalled, a gap opens up between them. This gap is sometimes called a pipeline bubble. We’ve marked the bubble with a dash.
  • 23. Instruction Scheduling … Gayana M N SJEC 23  Example shows a one-cycle interlock caused by delayed load use  This instruction triplet takes 4 cycles.  Although the ADD proceeds on the cycle following the load byte, the EOR instruction cannot start on the third cycle.  The r1 value is not ready until the load instruction completes the LS2 stage of the pipeline.  The processor stalls the EOR instruction for one cycle.  Note that the ADD instruction does not affect the timing at all. The sequence takes four cycles whether it is there or not.  The ADD doesn’t cause any stalls since the ADD does not use r1, the result of the load.
  • 24. Instruction Scheduling … Gayana M N SJEC 24  Example shows why a branch instruction takes three cycles  The processor must flush the pipeline when jumping to a new address.  The three executed instructions take a total of five cycles.  The MOV instruction executes on the first cycle.  On the second cycle, the branch instruction calculates the destination address. This causes the core to flush the pipeline and refill it using this new pc value.  The refill takes two cycles.  Finally, the SUB instruction executes normally.  The pipeline drops the two instructions following the branch when the branch takes place.
  • 25. 6.3.1 Scheduling of load instructions Gayana M N SJEC 25  Load instructions occur frequently in compiled code, accounting for approximately one- third of all instructions.  Careful scheduling of load instructions so that pipeline stalls don’t occur can improve performance.  The compiler attempts to schedule the code as best it can, but the aliasing problems of C limits the available optimizations. The compiler cannot move a load instruction before a store instruction unless it is certain that the two pointers used do not point to the same address.  Let’s consider an example of a memory-intensive task.  The following function, str_tolower, copies a zero- terminated string of characters from in to out. It converts the string to lowercase in the process.
  • 26. 6.3.1 Scheduling of load instructions….. Gayana M N SJEC 26  The following function, str_tolower, copies a zero- terminated string of characters from in to out. It converts the string to lowercase in the process.
  • 27. 6.3.1 Scheduling of load instructions….. Gayana M N SJEC 27  Compiler generates the following compiled output.  Notice that the compiler optimizes the condition (c>=‘A’ && c<=‘Z’) to check that 0<=c-‘A’<=‘Z’-‘A’.  The ARM9TDMI pipeline will stall for two cycles. The compiler can’t do any better since everything following the load of c depends on its value. However, there are two ways you can alter the structure of the algorithm to avoid the cycles by using assembly. We call these methods load scheduling by preloading and unrolling.
  • 28. 6.3.1 Scheduling of load instructions….. Gayana M N SJEC 28 Cycle Count: str_tolower LDRB r2,[r1],#1 ; c = *(in++) cycle-1 cycle-2 stall cycle-3 stall SUB r3,r2,#0x41 ; r3 = c -‘A’ cycle-4 CMP r3,#0x19 ; if (c <=‘Z’-‘A’) cycle-5 ADDLS r2,r2,#0x20 ; c +=‘a’-‘A’ cycle-6 STRB r2,[r0],#1 ; *(out++) = (char)c cycle-7 CMP r2,#0 ; if (c!=0) cycle -8 BNE str_tolower ; goto str_tolower cycle-9 cycle-10 flush cycle-11 flush MOV pc,r14 ; return
  • 29. 6.3.1.1 Load scheduling by preloading Gayana M N SJEC 29  In this method of load scheduling, we load the data required for the loop at the end of the previous loop, rather than at the beginning of the current loop. To get performance improvement with little increase in code size, we don’t unroll the loop.  The ARM architecture is particularly well suited to this type of preloading because instructions can be executed conditionally.  Since loop i is loading the data for loop i + 1 there is always a problem with the first and last loops.  For the first loop, we can preload data by inserting extra load instructions before the loop starts.  For the last loop it is essential that the loop does not read any data, or it will read beyond the end of the array. This could cause a data abort!  With ARM, we can easily solve this problem by making the load instruction conditional.
  • 30. 6.3.1.1 Load scheduling by preloading….. Gayana M N SJEC 30
  • 31. 6.3.1.1 Load scheduling by preloading….. Gayana M N SJEC 31 LDRB c, [in], #1 ; c = *(in++) loop SUB t, c, #’A’ ; t = c-’A’ cycle-1 CMP t, #’Z’-’A’ ; if (t <= ’Z’-’A’) cycle-2 ADDLS c, c, #’a’-’A’ ; c += ’a’-’A’; cycle-3 STRB c, [out], #1 ; *(out++) = (char)c; cycle-4 TEQ c, #0 ; test if c==0 cycle-5 LDRNEB c, [in], #1 ; if (c!=0) { c=*in++; cycle-6 BNE loop ; goto loop; } cycle-7 cycle-8 flush cycle-9 flush MOV pc, lr ; return Scheduling with Preload speed improvement over the original code is = 11/9 = 1.22 time more faster than the original code
  • 32. 6.3.1.2 Load scheduling by Unrolling….. Gayana M N SJEC 32  This method of load scheduling works by unrolling and then interleaving the body of the loop.  For example, we can perform loop iterations i, i + 1, i + 2 interleaved.  When the result of an operation from loop i is not ready, we can perform an operation from loop i + 1 that avoids waiting for the loop i result.
  • 33. 6.3.1.2 Load scheduling by Unrolling….. Gayana M N SJEC 33
  • 34. 6.3.1.2 Load scheduling by Unrolling….. Gayana M N SJEC 34
  • 35. 6.3.1.2 Load scheduling by Unrolling….. Gayana M N SJEC 35 loop_next3 Clock Cycles LDRB ca0, [in], #1 ; ca0 = *in++; c1 LDRB ca1, [in], #1 ; ca1 = *in++; c2 LDRB ca2, [in], #1 ; ca2 = *in++; c3 SUB t, ca0, #’A’ ; convert ca0 to lower case c4 CMP t, #’Z’-’A’ c5 ADDLS ca0, ca0, #’a’-’A’ c6 SUB t, ca1, #’A’ ; convert ca1 to lower case c7 CMP t, #’Z’-’A’ c8 ADDLS ca1, ca1, #’a’-’A’ c9 SUB t, ca2, #’A’ ; convert ca2 to lower case c10 CMP t, #’Z’-’A’ c11 ADDLS ca2, ca2, #’a’-’A’ c12 STRB ca0, [out], #1 ; *out++ = ca0; c13 TEQ ca0, #0 ; if (ca0!=0) c14 3 iterations – 21 cycles 1 Iteration – 7 cycles Therefore, the performance enhancement of unrolled code =11/7 = 1.57 time more faster than the original code
  • 36. SUMMARY : Instruction Scheduling Gayana M N SJEC 36  ARM cores have a pipeline architecture. The pipeline may delay the results of certain instructions for several cycles. If you use these results as source operands in a following instruction, the processor will insert stall cycles until the value is ready.  Load and multiply instructions have delayed results in many implementations. ( See Appendix D for the cycle timings and delay for your specific ARM processor core.)  Two software methods available to remove interlocks following load instructions: You can preload so that loop i loads the data for loop i + 1, or you can unroll the loop and interleave the code for loops i and i + 1.
  • 37. 6.4 Register Allocation Gayana M N SJEC 37  Among 16 visible ARM registers, 14 registers are used to hold general-purpose data.  The remaining two registers are the - stack pointer r13 and - the program counter r15.  For a function to be ATPCS – ARM-Thumb Procedure Call Standards compliant it must preserve the callee values of registers r4 to r11.  ATPCS also specifies that the stack should be eight-byte aligned;  -stored at an address which is multiple of 8 bytes.
  • 38. 6.4 Register Allocation….. Gayana M N SJEC 38  Template for optimized assembly routines requiring many registers: r12 is also stacked just to keep the stack eight-byte aligned. Address starts from 0 and ends at 28 for r4-r11. 28 is not multiple of 8. Address starts from 0 and ends at 32 for r4-r12. 32 is multiple of 8. r12 need not be stacked if your routine doesn’t call other ATPCS routines.
  • 39. 6.4 Register Allocation….. Gayana M N SJEC 39  For ARMv5 and above you can use the preceding template even when being called from Thumb code.  For ARM4 and below you have to use the following template.
  • 40. 6.4.1 Allocating Variables to Register Numbers: Gayana M N SJEC 40  It is best to use alternate names to the registers, rather than explicit register numbers.  Benefits of using alternate names:  Allows you to change the allocation of variables to register numbers easily.  Allows to use different register names for the same physical register number when their use doesn’t overlap.  Register names increase the clarity and readability of optimized code.
  • 41. 6.4.1 Allocating Variables to Register Numbers… Gayana M N SJEC 41  There are several cases where the physical number of the register is important: ■ Argument registers. - The ATPCS convention defines that the first four arguments to a function are placed in registers r0 to r3. - Further arguments are placed on the stack. - The return value must be placed in r0. ■ Registers used in a load or store multiple. - Load and store multiple instructions LDM and STM operate on a list of registers in order of ascending register number. - For example if r0 and r1 appear in the register list, then the processor will always load or store r0 using a lower address than r1 and so on. ■ Load and store double word. The LDRD and STRD instructions introduced in ARMv5E operate on a pair of registers with sequential register numbers, Rd and Rd + 1.
  • 42. 6.4.1 Allocating Variables to Register Numbers… Gayana M N SJEC 42  How to allocate registers when writing assembly? suppose if we want - to shift an array of N bits upwards in memory by k bits. - For simplicity assume that N is large and a multiple of 256. - Also assume that 0 ≤ k < 32 and that the input and output pointers are word aligned I,e. 2 to the power k . - Block a copy from one bit or byte alignment to a different bit or byte alignment
  • 43. 6.4.1 Allocating Variables to Register Numbers… Gayana M N SJEC 43  C Function:
  • 44. 6.4.1 Allocating Variables to Register Numbers… Gayana M N SJEC 44
  • 45. 6.4.1 Allocating Variables to Register Numbers… Gayana M N SJEC 45  Improving the efficiency by loop unrolling
  • 46. 6.4.1 Allocating Variables to Register Numbers… Gayana M N SJEC 46  Register Allocation All registers are finished. There are two remaining variables carry and kr, but only one remaining free register lr.
  • 47. 6.4.1 Allocating Variables to Register Numbers… Gayana M N SJEC 47  Possible ways to deal , when we run out of registers:  Reduce the number of registers required by performing fewer operations in each loop. In previous example we could load four words in each load multiple rather than eight.  Use the stack to store the least-used values to free up more registers. In previous example we could store the loop counter N on the stack.  Alter the code implementation to free up more registers. This is the solution we consider in the next example.
  • 48. 6.4.1 Allocating Variables to Register Numbers… Gayana M N SJEC 48  Modified Implementation  We notice that the carry value need not stay in the same register.  We can start off with the carry value in y0 and then move it to y1 when x0 is no longer required, and so on.  We complete the routine by allocating kr to lr and recoding so that carry is not required.
  • 49. 6.4.1 Allocating Variables to Register Numbers… Gayana M N SJEC 49
  • 50. Variables Gayana M N SJEC 50  If you need more than 14 local 32- bit variables in a routine, then you must store some variables on the stack.  Pop them in the reverse order.  The standard procedure is to work outwards from the inner- most loop of the algorithm, since the innermost loop has the greatest performance impact.  This example shows three nested loops, each loop requiring state information inherited from the loop surrounding it.  You may find that there are insufficient registers for the innermost loop  Then you need to swap inner loop variables out to the stack.  Since assembly code is very hard to maintain and debug if you use numbers as stack address offsets, the assembler provides an
  • 51. Variables….. Gayana M N SJEC 51  ARM assembler directives MAP (alias ∧) and FIELD (alias #)  Defines and allocates space for variables and arrays on the processor stack.  The directives perform a similar function to the struct operator in C.
  • 52. Variables….. Gayana M N SJEC 52  ARM assembler directives MAP (alias ∧) and FIELD (alias #).. Example STMFD sp!, {r4-r11, lr} ; save callee registers SUB sp, sp, #length ; create stack frame ; ... STR r0, [sp, #a] ;a= r0; LDRSH r1, [sp, #b] ; r1 = b; ADD r2, sp, #d ; r2 = &d[0] ; ... ADD sp, sp, #length ; restore the stack pointer LDMFD sp!, {r4-r11, pc} ; return
  • 53. 6.4.3 Making the Most of Available Registers Gayana M N SJEC 53  It is more efficient to access values held in registers than values held in memory.  There are several tricks you can use to fit several sub- 32-bit length variables into a single 32-bit register and thus can reduce code size and increase performance.
  • 54. 6.4.3 Making the Most of Available Registers….. Gayana M N SJEC 54  Example1:  Suppose we want to step through an array by a programmable increment. A common example is to step through a sound sample at various rates to produce different pitched notes. We can express this in C code as: sample = table[index]; index += increment;  Commonly index and increment are small enough to be held as 16-bit values. We can pack these two variables into a single 32-bit variable indinc: Note : if index and increment are 16-bit values, then putting index in the top 16 bits of indinc correctly implements 16-bit-wrap-around. In other words, index = (short)(index + increment). This can be useful if you are using a circular buffer.
  • 55. 6.4.3 Making the Most of Available Registers….. Gayana M N SJEC 55  Example2: When you shift by a register amount, the ARM uses bits 0 to 7 as the shift amount. The ARM ignores bits 8 to 31 of the register. Therefore you can use bits 8 to 31 to hold a second variable distinct from the shift amount.  example shows combining register-specified shift shift and loop counter count to shift an array of 40 entries right by shift bits. Define a variable cntshf that combines count and shift:
  • 56. 6.4.3 Making the Most of Available Registers….. Gayana M N SJEC 56  Example3: SIMD- Single Issue Multiple Data  Example shows how to process multiple 8-bit pixels in an image using normal ADD and MUL instructions to achieve some SIMD operations.  Suppose we want to merge two images X and Y to produce a new image Z. Let xn, yn, and zn denote the nth 8-bit pixel in these images, respectively. Let 0 ≤ a ≤ 256 be a scaling factor. To merge the images, we set
  • 57. 6.4.3 Making the Most of Available Registers….. Gayana M N SJEC 57
  • 58. Summary : Register Allocation Gayana M N SJEC 58  ARM has 14 available registers for general-purpose use: r0 to r12 and r14. The stack pointer r13 and program counter r15 cannot be used for general-purpose data.  If you need more than 14 local variables, push the variables to the stack.  Use register names instead of physical register numbers when writing assembly routines. This makes it easier to reallocate registers and to maintain the code.  Store multiple values in the same register.  For example, you can store a loop counter and a shift in one register.  You can also store multiple pixels in one register.
  • 59. 6.5 Conditional Execution Gayana M N SJEC 59  The processor core can conditionally execute most ARM instructions.  By combining conditional execution and conditional setting of the flags, it is possible to implement simple if statements without any need for branches.  This improves efficiency since branches can take many cycles and also reduces code size.
  • 60. 6.5 Conditional Execution….. Gayana M N SJEC 60  Example1:  The following C code converts an unsigned integer 0 ≤ i ≤ 15 to a hexadecimal character c:  We can write this in assembly using conditional execution rather than conditional branches:
  • 61. 6.5 Conditional Execution….. Gayana M N SJEC 61  Example 2:  Counting vowels  We can write this in assembly using conditional execution :
  • 62. 6.5 Conditional Execution….. Gayana M N SJEC 62  Example 3:  Detect if c is a letter:  To implement this efficiently, we can use an addition or subtraction to move each range to the form 0<= c <=limit . Then we use unsigned comparisons to detect this range and conditional comparisons to chain together ranges. The following assembly implements this efficiently:
  • 63. 6.6 Looping Constructs Gayana M N SJEC 63  ARM loops are fastest when they count down towards zero.  Here we understand to implement - count down loops efficiently in assembly and - unroll loops for maximum performance.
  • 64. 6.6 Looping Constructs….. Gayana M N SJEC 64  6.6.1 Decremented Counted Loops : N to 1 (inclusive)  In decrementing loop of N iterations, the loop counter i counts down from N to 1 inclusive.  The loop terminates with i = 0.  An efficient implementation is
  • 65. 6.6 Looping Constructs Gayana M N SJEC 65  6.6.1 Decremented Counted Loops : N-1 to 0 (inclusive)  In decrementing loop of N iterations, the loop counter i counts down from N-1 to 0 inclusive.  An efficient implementation is
  • 66. 6.6 Looping Constructs Gayana M N SJEC 66  6.6.1 Decremented Counted Loops : N to N/3 (inclusive)  Suppose we require N/3 loops.  Rather than attempting to divide N by three, it is far more efficient to subtract three from the loop counter on each iteration.  An efficient implementation is
  • 67. 6.6 Looping Constructs….. Gayana M N SJEC 67  6.6.4 Other Counted Loops  Negative Indexing  This loop structure counts from −N to 0 (inclusive or exclusive) in steps of size STEP.
  • 68. 6.6 Looping Constructs….. Gayana M N SJEC 68  6.6.4 Other Counted Loops  Logarithmic Indexing  This loop structure counts down from 2N to 1 in powers of two.  For example, if N = 4, then it counts 16, 8, 4, 2, 1.
  • 69. 6.6 Looping Constructs….. Gayana M N SJEC 69  6.6.4 Other Counted Loops  Logarithmic Indexing  The following loop structure counts down from an N-bit mask to a one-bit mask.  For example, if N = 4, then it counts 15, 7, 3, 1.
  • 70. Summary : Looping Constructs Gayana M N SJEC 70  ARM requires two instructions to implement a counted loop: a subtract that sets flags and a conditional branch.  Unroll loops to improve loop performance. Do not over unroll because this will hurt cache performance. Number of unroll is limited by the number of registers available.  Nested loops only require a single loop counter register, which can improve efficiency by freeing up registers for other uses.  ARM can implement negative and logarithmic indexed loops efficiently.
  • 71. Gayana M N SJEC 71