This document discusses embedded and real-time systems. It covers several topics:
- The CPU bus, which forms the backbone of computer hardware systems and allows communication between the CPU, memory, and I/O devices.
- Memory components like DRAM, SRAM, and flash memory that are used in embedded systems.
- Designing embedded computing platforms, including considerations like system architectures, evaluation boards, and the PC as an embedded platform.
- Platform-level performance analysis through measuring aspects like bandwidth of the memory, bus, and CPU fetches when transferring data in the system.
3. The CPU Bus
Memory devices and systems
Designing with computing platforms
Consumer electronics architecture
Platform-level performance analysis
Components for embedded programs
Models of programs- Assembly, linking and loading –
compilation techniques
Program level performance analysis
Software performance optimization
Program level energy and power analysis and
optimization
Analysis and optimization of program size- Program
validation and testing.
UNIT II EMBEDDED COMPUTING PLATFORM
DESIGN
4. In this topic, we concentrate on bus-based computer systems created
using microprocessors, I/O devices, and memory components.
The microprocessor is an important element of the embedded
computing system, but it cannot do its job without memories and I/O
devices.
We need to understand how to interconnect microprocessors and
devices using the CPU bus.
5. The CPU bus, which forms the backbone of the hardware
system.
A computer system encompasses much more than the
CPU; it also includes memory and I/O devices. The bus is
the mechanism by which the CPU communicates with
memory and devices.
A bus is, at a minimum, a collection of wires, but the bus
also defines a protocol by which the CPU, memory, and
devices communicate.
One of the major roles of the bus is to provide an interface
to memory , I/O devices.
CPU BUS
6. Bus Protocols
The basic building block of most bus protocols is the four-cycle handshake,
illustrated in Figure
The four cycles are described below.
1. Device 1 raises its output to signal an
enquiry, which tells device 2 that it
should get ready to listen for data.
2. When device 2 is ready to receive, it
raises its output to signal an
acknowledgment. At this point, devices 1
and 2 can transmit or receive.
3. Once the data transfer is complete,
device 2 lowers its output, signaling that
it has received the data.
4. After seeing that ack has been released,
device 1 lowers its output.
At the end of the handshake, both
handshaking signals are low, just as
they were at the start of the
handshake.
7. A typical microprocessor bus.
The term bus is used in two ways.
The most basic use is as a set of related wires, such as address wires.
However, the term may also mean a protocol for communicating between
components.
The fundamental bus operations are reading and writing.
8. The behavior of a bus is most often specified as a timing diagram.
A timing diagram shows how the signals on a bus vary over time.
10. DMA
Standard bus transactions require the CPU to be in the middle of every read
and write transaction.
However, there are certain types of data transfers in which the CPU does not
need to be involved.
Direct memory access (DMA) is a bus operation that allows reads and writes
not controlled by the CPU. A DMA transfer is controlled by a DMA controller,
which requests control of the bus from the CPU
11. Bus mastership
• By default, CPU is bus master and initiates transfers.
• DMA must become bus master to perform its work.
– CPU can’t use bus while DMA operates.
• Bus mastership protocol:
– Bus request.
– Bus grant.
• Direct memory access (DMA) performs data transfers without
executing instructions.
– CPU sets up transfer.
– DMA engine fetches, writes.
• DMA controller is a separate unit.
12. DMA operation
• CPU sets DMA registers for start address, length.
• DMA status register controls the unit.
• Once DMA is bus master, it transfers automatically.
– May run continuously until complete.
– May use every nth bus cycle.
14. System bus configurations
• Multiple busses allow
parallelism:
– Slow devices on one bus.
– Fast devices on separate
bus.
• A bridge connects two
busses.
CPU slow device
memory
high-speed
device
bridge
slow device
15. ARM AMBA bus
• Since the ARM CPU is manufactured by many different
vendors, the bus provided off-chip can vary from chip to chip.
ARM has created a separate bus specification for single-chip
systems.
• The AMBA bus [ARM99A] supports CPUs, memories, and
peripherals integrated in a system-on-silicon.
• Two varieties:
– AHB (AMBA high-performance bus) supports pipelining,
burst transfers, split transactions, and multiple bus
masters..
– APB (AMBA peripherals bus) is simple, lower-speed, lower
cost.
– All devices are slaves on APB.
16.
17.
18. Memory components
• Several types of
memory:
– DRAM.
– SRAM.
– Flash.
• Each type of memory
comes in varying:
– Capacities.
– Widths.
19. Random-access memory
• Dynamic RAM is dense, requires refresh.
– Synchronous DRAM is dominant type.
– SDRAM uses clock to improve performance, pipeline
memory accesses.
– DDR(double data rate) SDRAMs
• Static RAM is faster, less dense, consumes more
power.
• For PCs, SIMMs(single in-line memory modules),
DIMMs are used.
Random-access memories can be both read and written. They are called
random access because, unlike magnetic disks, addresses can be read in
any order.
21. Read-only memory
• ROM may be programmed at factory.
• Flash is dominant form of field-programmable
ROM.
– Electrically erasable, must be block erased.
– Random access, but write/erase is much slower
than read.
– NOR flash is more flexible.
– NAND flash is more dense.
22. Flash memory
• Non-volatile memory.
– Flash can be programmed in-circuit.
• Random access for read.
• To write:
– Erase a block to 1.
– Write bits to 0.
23. Flash writing
• Write is much slower than read.
– 1.6 ms write, 70 ns read.
• Blocks are large (approx. 1 Mb).
• Writing causes wear that eventually destroys
the device.
– Modern lifetime approx. 1 million writes.
24. Types of flash
• NOR:
– Word-accessible read.
– Erase by blocks.
• NAND:
– Read by pages (512-4K bytes).
– Erase by blocks.
• NAND is cheaper, has faster erase, sequential
access times.
27. Designing with computing platforms (Microprocessors)
In this topic , we are going to see.,
How to create an initial working embedded system
How to ensure that the system works properly.
by considering possible architectures for embedded computing
systems.
by studying techniques for designing the hardware
components of embedded systems.
To describes the use of the PC as an embedded
computing platform.
28. System architectures
• The architecture of an embedded computing
system is the blueprint for implementing that
system—it tells you what components you
need and how you put them together
• Architectures and components:
– software;
– hardware.
• Some software is very hardware-dependent.
29. Hardware platform architecture
Contains several elements:
• CPU
• bus
• memory
• I/O devices: networking, sensors, actuators,
etc.
How big/fast much each one be?
30. Software architecture
Functional description must be broken into pieces:
• division among people
• conceptual organization
• performance
• testability
• maintenance
Mixing together different types of functionality into a single code
module leads to spaghetti code, which has poorly structured
control flow, excessive use of global data, and generally unreliable
programs.
31. Hardware and software architectures
Hardware and software are intimately related:
• software doesn’t run without hardware;
• how much hardware you need is determined
by the software requirements:
– speed;
– memory.
32. • Designed by CPU manufacturer or others.
• Includes CPU, memory, some I/O devices.
• May include prototyping section.
• CPU manufacturer often gives out evaluation
board netlist---can be used as starting point
for your custom board design.
Evaluation boards
33. Adding logic to a board
• Programmable logic devices (PLDs) provide
low/medium density logic.
• Field-programmable gate arrays (FPGAs)
provide more logic and multi-level logic.
• Application-specific integrated circuits (ASICs)
are manufactured for a single purpose.
34. The PC as a platform
• Advantages:
– Cheap (for Industries) and easy to get;
– rich and familiar software environment.
• Disadvantages:
– requires a lot of hardware resources;
– not well-adapted to real-time.
– It is larger, more power hungry, and more
expensive than a custom hardware platform
would be
35. Typical PC hardware platform
CPU
CPU bus
memory
DMA
controller
timers
bus
interface
bus
interface
high-speed bus
low-speed bus
device
device
intr
ctrl
■ ROM holds the boot program.
■ RAM is used for program storage.
■ PCI: standard for high-speed interfacing
33 or 66 MHz.
PCI Express.
■ USB (Universal Serial Bus), Fire wire (IEEE 1394): relatively low-cost serial
interface with high speed.
36. Software elements
• IBM PC uses BIOS (Basic I/O System) to
implement low-level functions:
– boot-up
– minimal device drivers.
• BIOS has become a generic term for the
lowest-level system software.
37. Example for Single Chip System:
StrongARM(SA-1100)
• StrongARM system includes:
– CPU chip (3.686 MHz clock)
– system control module (32.768
kHz clock).
– Real-time clock;
– operating system timer
– general-purpose I/O;
– interrupt controller;
– power manager controller;
– reset controller.
38. Debugging embedded systems
• Challenges:
– target system may be hard to observe;
– target may be hard to control;
– may be hard to generate realistic inputs;
– setup sequence may be complex.
39. Host/target design
• Use a host system to prepare software for
target system:
target
system
host system
serial line
40. Host-based tools
• Cross compiler:
– compiles code on host for target system.(i.e., A cross-
compiler is a compiler that runs on one type of machine but generates code for another.)
• Cross debugger:
– displays target state, allows target system to be
controlled.
41. Software debuggers
• A monitor program residing on the target
provides basic debugger functions.
• Debugger should have a minimal footprint in
memory.
• User program must be careful not to destroy
debugger program, but , should be able to
recover from some damage caused by user
code.
42. Breakpoints
• A breakpoint allows the user to stop
execution, examine system state, and change
state.
• Replace the breakpointed instruction with a
subroutine call to the monitor program.
43. ARM breakpoints
0x400 MUL r4,r6,r6
0x404 ADD r2,r2,r4
0x408 ADD r0,r0,#1
0x40c B loop
uninstrumented code
0x400 MUL r4,r6,r6
0x404 ADD r2,r2,r4
0x408 ADD r0,r0,#1
0x40c BL bkpoint
code with breakpoint
44. Breakpoint handler actions
• Save registers.
• Allow user to examine machine.
• Before returning, restore system state.
– Safest way to execute the instruction is to replace
it and execute in place.
– Put another breakpoint after the replaced
breakpoint to allow restoring the original
breakpoint.
45. In-circuit emulators
• A microprocessor in-circuit emulator is a
specially-instrumented microprocessor.
• Allows you to stop execution, examine CPU
state, modify registers.
46. Boundary scan
• Simplifies testing of
multiple chips on a
board.
– Registers on pins can be
configured as a scan
chain.
– Used for debuggers, in-
circuit emulators.
47. How to exercise code
• Run on host system.
• Run on target system.
• Run in instruction-level simulator.
• Run on cycle-accurate simulator.
• Run in hardware/software co-simulation
environment.
48. Debugging real-time code
• Bugs in drivers can cause non-deterministic
behavior in the foreground problem.
• Bugs may be timing-dependent.
49.
50. (Logic analyzers)
• A logic analyzer is an array of low-grade
oscilloscopes:
CONSUMER ELECTRONICS ARCHITECTURE
51. Logic analyzer architecture
The analyzer can sample many different signals simultaneously
(tens to hundreds) but can display only 0, 1, or changing values
for each
L
a
t
c
h
52. The logic analyzer records the values on the signals into an internal
memory and then displays the results on a display once the memory
is full or the run is aborted.
A typical logic analyzer can acquire data in either of two modes that
are typically called state and timing modes.
State and timing mode represent different ways of sampling the
values.
Timing mode uses an internal clock that is fast enough to take
several samples per clock period in a typical system.
State mode, on the other hand, uses the system’s own clock to
control sampling, so it samples each signal only once per clock cycle.
As a result, timing mode requires more memory to store a given
number of system clock cycles.
Logic analyzer architecture
53. • The system’s data signals are sampled at a latch within the logic
analyzer; the latch is controlled by either the system clock or the
internal logic analyzer sampling clock, depending on whether
the analyzer is being used in state or timing mode.
• Each sample is copied into a vector memory under the control
of a state machine.
• The latch, timing circuitry, sample memory, and controller must
be designed to run at high speed since several samples per
system clock cycle may be required in timing mode.
• After the sampling is complete, an embedded microprocessor
takes over to control the display of the data captured in the
sample memory.
• Logic analyzers typically provide a number of formats for
viewing data. One format is a timing diagram format
Logic analyzer - Operation
55. System-level performance analysis
• Performance depends on all the elements of the system:
– CPU.
– Cache.
– Bus.
– Main memory.
– I/O device.
In this section, we will develop some basic techniques for analyzing
the performance of bus-based systems.
56. System level data flows and performance
We want to move data from memory to the CPU to process it. To get the data
from memory to the CPU we must:
■ read from the memory;
■ transfer over the bus to the cache; and
■ transfer from the cache to the CPU.
The time required to transfer
from the cache to the CPU is
included in the instruction
execution time, but the other two
times are not.
The most basic measure of
performance we are interested in
is bandwidth →
57. Bandwidth as performance
• Bandwidth(the rate at which we can move data)
applies to several components:
– Memory.
– Bus.
– CPU fetches.
• Different parts of the system run at different clock
rates.
• Different components may have different widths
(bus, memory).
• We have to make sure that we apply the right clock
rate to each part of the performance estimate
when we convert from clock cycles to seconds.
• Bandwidth questions often come up when we are
transferring large blocks of data.
Example
58. Bandwidth and data transfers
• Video frame: 320 x 240 x 3 = 230,400 bytes.
– Transfer in 1/30 sec.
• Transfer 1 byte/msec, leads to 0.23 sec per frame
(230400msec). i.e, more than 1/30 sec
– Too slow.
– i.e, we have to increase the transfer rate by 7 times
• We can increase bandwidth in two ways:
– We can increase the clock rate of the bus ( 2Mhz)
– or we can increase the amount of data transferred per
clock cycle (4bytes).
Considering the bandwidth provided by only one system component, the bus.
Consider an image of 320 X 240 pixels, with each pixel composed of 3 bytes of
data.
59. Bus bandwidth
• T: # bus cycles.
• P: time/bus cycle.
• Total time for transfer:
t = TP.
• D: data payload length.
• O1 + O2 = overhead O.
O1 D O2
W
Tbasic(N) = (D+O)N/W
60. Bus burst transfer bandwidth
• T: # bus cycles.
• P: time/bus cycle.
• Total time for transfer:
– t = TP.
• D: data payload length.
• O1 + O2 = overhead O.
B O
W
Tburst(N) = (BD+O)N/(BW)
2
1
…
62. Memory access times
• Memory component access times comes from
chip data sheet.
– Page modes allow faster access for successive
transfers on same page.
• If data doesn’t fit naturally into physical
words:
– A = [(E/w)mod W]+1
63. Bus performance bottlenecks
• Transfer 320 x 240
video frame @ 30
frames/sec = 612,000
bytes/sec.
• Is performance
bottleneck bus or
memory?
memory CPU
65. Performance spreadsheet
bus memory
clock period 1.00E-06 clock period 1.00E-08
W 2 W 0.5
D 1 D 1
O 3 O 4
B 4
N 612000 N 612000
T_basic 1224000 T_mem 2448000
t 1.22E+00 t 2.45E-02
66. Parallelism
• Speed things up by running
several units at once.
• DMA provides parallelism if
CPU doesn’t need the bus:
– DMA + bus.
– CPU.
Transfer with DMA
67.
68. Components for embedded programs
In this section, we study in detail the process of programming embedded processors.
The creation of embedded programs is at the heart of embedded system design.
Embedded code must not only provide rich functionality, it must also often run at a
required rate to meet system deadlines, fit into the allowed amount of memory, and
meet power consumption requirements. Designing code that simultaneously meets
multiple design constraints is a considerable challenge, but luckily there are
techniques and tools that we can use to help us through the design process.
we consider code for three structures or components that are commonly
used in embedded software:
1. The state machine,
2. The circular buffer,
3. The queue.
State machines are well suited to reactive systems such as user interfaces;
Circular buffers and Queues are useful in digital signal processing.
69. Software State Machine
• State machine keeps internal state as a
variable, changes state based on inputs.
• Uses:
– control-dominated code;
– reactive systems.
When inputs appear intermittently rather than as periodic
samples, it is often convenient to think of the system as reacting
to those inputs. The reaction of most systems can be
characterized in terms of the input received and the current
state of the system. This leads naturally to a finite-state machine
70. State machine example
(Seat belt controller)
idle
buzzer seated
belted
no seat/-
seat/timer on
no belt
and no
timer/-
no belt/timer on
belt/-
belt/
buzzer off
Belt/buzzer on
no seat/-
no seat/
buzzer off
The controller’s job is to turn on a buzzer if a person sits in a seat and does not
fasten the seat belt within a fixed amount of time.
This system has three inputs and one output.
The inputs are a sensor for the seat to know when a person has sat down, a seat
belt sensor that tells when the belt is fastened, and a timer that goes off when
the required time interval has elapsed.
The output is the buzzer.
71. C implementation
#define IDLE 0
#define SEATED 1
#define BELTED 2
#define BUZZER 3
switch (state) {
case IDLE: if (seat) { state = SEATED; timer_on = TRUE; }
break;
case SEATED: if (belt) state = BELTED;
else if (timer) state = BUZZER;
break;
…
72. Circular buffer
• Commonly used in signal processing:
– new data constantly arrives;
– each datum(factual information derived from measurement) has a
limited lifetime.
• Use a circular buffer to hold the data stream.
– EX: FIR filter
– For each sample, the filter must emit one output
that depends on the values of the last n inputs.
The circular buffer is a data structure that lets us handle
streaming data in an efficient way.
74. Circular buffers
• Indexes locate currently used data, current
input data:
d1
d2
d3
d4
time t1
use
input d5
d2
d3
d4
time t1+1
use
input
75. Circular buffer implementation: FIR
filter
int circ_buffer[N], circ_buffer_head = 0;
int c[N]; /* coefficients */
…
int ibuf, ic;
for (f=0, ibuff=circ_buff_head, ic=0;
ic<N; ibuff=(ibuff==N-1?0:ibuff++), ic++)
f = f + c[ic]*circ_buffer[ibuf];
76. Queues
• Queues are also used in signal processing and event
processing.
• Queues are used whenever data may arrive and
depart at somewhat unpredictable times or when
variable amounts of data may arrive.
• A queue is often referred to as an elastic buffer,
which holds data that arrives irregularly.
• One way to build a queue is with a linked list.
– This approach allows the queue to grow to an arbitrary
size.
• Another way to design the queue is to use an array to
hold all the data.
77. Buffer-based queues(to manage
interrupt-driven data;)
#define Q_SIZE 32
#define Q_MAX (Q_SIZE-1)
int q[Q_MAX], head, tail;
void initialize_queue() { head = tail =
0; }
void enqueue(int val) {
if (((tail+1)%Q_SIZE) == head)
error();
q[tail]=val;
if (tail == Q_MAX) tail = 0; else
tail++;
}
int dequeue() {
int returnval;
if (head == tail) error();
returnval = q[head];
if (head == Q_MAX) head = 0;
else head++;
return returnval;
}
78.
79. Models of programs
• Source code is not a good representation for
programs:
– clumsy;
– leaves much information implicit.
• Compilers derive intermediate representations
to manipulate and optimize the program.
In this section , we develop models for programs that are more
general than source code like ALP, C …and so on...
Our fundamental model for programs is the
control/data flow graph (CDFG).
80. Data flow graph
• DFG: data flow graph.
• Does not represent control.
• Models basic block: code with no entry or exit.
• Describes the minimal ordering requirements
on operations.
81. Single assignment form
x = a + b;
y = c - d;
z = x * y;
y = b + d;
original basic block
x = a + b;
y = c - d;
z = x * y;
y1 = b + d;
single assignment form
82. Data flow graph
x = a + b;
y = c - d;
z = x * y;
y1 = b + d;
single assignment form
+ -
+
*
DFG
a b c d
z
x
y
y1
83. DFGs and partial orders
Partial order:
• a+b, c-d; b+d x*y
Can do pairs of operations
in any order.
+ -
+
*
a b c d
z
x
y
y1
84. Control-data flow graph
• CDFG: represents control and data.
• Uses data flow graphs as components.
• Two types of nodes:
– decision;
– data flow.
85. Data flow node
Encapsulates a data flow graph:
Write operations in basic block form for simplicity.
x = a + b;
y = c + d
87. CDFG example
if (cond1) bb1();
else bb2();
bb3();
switch (test1) {
case c1: bb4(); break;
case c2: bb5(); break;
case c3: bb6(); break;
}
cond1 bb1()
bb2()
bb3()
bb4()
test1
bb5() bb6()
T
F
c1
c2
c3
88. for loop
for (i=0; i<N; i++)
loop_body();
for loop
i=0;
while (i<N) {
loop_body(); i++; }
equivalent
i<N
loop_body()
T
F
i=0
89.
90. Assembly , linking, loading
• Assembly and linking are the last steps in the compilation process.
• they turn a list of instructions into an image of the program’s bits in
memory.
• Loading actually puts the program in memory so that it can be
executed.
compiler
HLL assembly
assembler
HLL
HLL assembly
assembly
linker
Executable
Binary
loader
Object Code
Object Code
91. • As the figure shows, most compilers do not directly generate
machine code, but instead create the instruction-level
program in the form of human-readable assembly language.
• The assembler’s job is to translate symbolic assembly
language statements into bit-level representations of
instructions known as object code.
• A linker allows a program to be stitched together out of
several smaller pieces. The linker operates on the object files
created by the assembler and modifies the assembled code to
make the necessary links between files.
• The linker, which produces an executable binary file.
• That file may not necessarily be located in the CPU’s memory,
however, unless the linker happens to create the executable
directly in RAM.
• The program that brings the program into memory for
execution is called a loader.
Assembly , linking, loading
92. Assemblers
• Major tasks:
– generate binary for symbolic instructions;
– translate labels into addresses;
– handle pseudo-ops (data, etc.).
• Generally one-to-one translation.
• Assembly labels:
ORG 100
label1 ADR r4,c
93. Pseudo-operations
• Pseudo-ops do not generate instructions:
– ORG sets program location.
– EQU generates symbol table entry without
advancing PLC.
– Data statements define data blocks.
94. Linking
• Combines several object modules into a single
executable module.
• Jobs:
– put modules in order;
– resolve labels across modules.
95. Dynamic linking
• Some operating systems link modules
dynamically at run time:
– shares one copy of library among all executing
programs;
– allows programs to be updated with new versions
of libraries.
96. COMPILATION TECHNIQUES
It is useful to understand how a high-level language program is
translated into instructions.
Understanding how the compiler works can help you know when
you cannot rely on the compiler.
Next, because many applications are also performance sensitive,
understanding how code is generated can help you meet your
performance goals, either by writing high-level code that gets
compiled into the instructions you want or by recognizing when you
must write your own assembly code.
Compilation combines translation and optimization.
98. Basic compilation phases
HLL
parsing, symbol table
machine-independent
optimizations
machine-dependent
Optimizations
assembly
The high-level language program is
parsed to break it into statements
and expressions.
In addition, a symbol table is
generated, which includes all the
named objects in the program.
Simplifying arithmetic expressions
is one example of a machine-
independent optimization.
Instruction –level optimization and
code generation
99. Statement translation and
optimization
• Source code is translated into intermediate
form such as CDFG.
• CDFG is transformed/optimized.
• CDFG is translated into instructions with
optimization decisions.
• Instructions are further optimized.
100. Compiling an Arithmetic expressions
a*b + 5*(c-d)
W,X,Y,Z are temp
variables
expression
DFG
* -
*
+
a b c d
5
W
Z
Y
X
101. 2
3
4
1
Compilation of Arithmetic expressions,
cont’d.
ADR r4,a
MOV r1,[r4]
ADR r4,b
MOV r2,[r4]
ADD r3,r1,r2
DFG
* -
*
+
a b c d
5
ADR r4,c
MOV r1,[r4]
ADR r4,d
MOV r5,[r4]
SUB r6,r4,r5
MUL r7,r6,#5
ADD r8,r7,r3
code
102. Similarly for Control code generation
if (a+b > 0)
x = 5;
else
x = 7;
a+b>0 x=5
x=7
104. Procedure linkage
• Need code to:
– call and return;
– pass parameters and results.
• Parameters and returns are passed on stack.
– Procedures with few parameters may use
registers.
Another major code generation problem is the creation of
procedures
105. Procedure stacks
proc1
growth
proc1(int a) {
proc2(5);
}
proc2
SP
stack pointer
(defines the end of the current frame)
FP
frame pointer
(defines the end of the Last frame)
5 accessed relative to SP
When a new procedure is called, the sp and fp are modified to push
another frame onto the stack.
106. ARM procedure linkage
• APCS (ARM Procedure Call Standard):
– r0-r3 pass parameters into procedure. Extra
parameters are put on stack frame.
– r0 holds return value.
– r4-r7 hold register values.
– r11 is frame pointer, r13 is stack pointer.
– r10 holds limiting address on stack size to check
for stack overflows.
107. Data structures
• Different types of data structures use different
data layouts.
• Some offsets into data structure can be
computed at compile time, others must be
computed at run time.
• An array element must in general be
computed at run time, since the array index
may change.
• Let us first consider one-dimensional arrays:
The compiler must also translate references to data structures into
references to raw memories. In general, this requires address
computations.
110. Structures
• Fields within structures are static offsets:
field1
field2
aptr
struct {
int field1;
char field2;
} mystruct;
struct mystruct a, *aptr = &a;
4 bytes
*(aptr+4)
111. Using your compiler
• Understand various optimization levels (-O1, -
O2, etc.)
• Look at mixed compiler/assembler output.
• Modifying compiler output requires care:
– correctness;
– loss of hand-tweaked code.
112. Interpreters and JIT(Just In-Time)compilers
• Interpreter: translates and executes program statements on-
the-fly. An interpreter translates program statements one at
a time.
• The interpreter sits between the program and the machine.
• The interpreter may or may not generate an explicit piece of
code to represent the statement. Because the interpreter
translates only a very small piece of the program at any given
time,
• A small amount of memory is used to hold intermediate
representations of the program.
Programs are not always compiled and then separately executed. In
some cases, it may make sense to translate the program into
instructions during execution.
Two well-known techniques for on-the-fly translation are
interpretation and just-in-time (JIT ) compilation.
113. JIT compiler: compiles small sections of code into instructions
during program execution.
Eliminates some translation overhead.
Often requires more memory.
Best suited for Java environments
Interpreters and JIT(Just In-Time)compilers
A JIT compiler is somewhere between an interpreter and a stand-
alone compiler. A JIT compiler produces executable code segments
for pieces of the program. However, it compiles a section of the
program (such as a function) only when it knows it will be executed.
Unlike an interpreter, it saves the compiled version of the code so
that the code does not have to be retranslated the next time it is
executed.
The JIT compiler usually generates machine code directly rather
than building intermediate program representation data structures
such as the CDFG.
114.
115. Program design and analysis
• Program-level performance analysis.
• Optimizing for:
– Execution time.
– Energy/power.
– Program size.
• Program validation and testing.
116. Program-level performance analysis
• Need to understand performance in detail:
– Real-time behavior, not just typical.
– On complex platforms.
• Program performance CPU performance:
– Pipeline, cache are windows into program.
– We must analyze the entire program.
117. Complexities of analyzing program
performance
• The execution time of a program often varies
with the input data values because those values
select different execution paths in the program.
- For example, loops
• Cache effects.
– The cache’s behavior depends in part on the data
values input to the program.
• Instruction-level performance variations:
– Pipeline interlocks.
– Fetch times.
118. How to measure program performance
• Simulate execution of the CPU (Simulator).
– Makes CPU state visible.
– Be careful for some microprocessor performance
simulators are not 100% accurate, and simulation of I/O-
intensive code may be difficult.
– Also measures execution time of program
• Measure on real CPU using timer.
– A timer connected to the microprocessor bus can be
used to measure performance of executing sections of
code.
– Requires modifying the program to control the timer.
• Measure on real CPU using logic analyzer.
– By measuring the start and stop times of a code segment
– Requires events visible on the pins.
119. Program performance metrics
• Average-case execution time.
– Typically used in application programming.
• Worst-case execution time.
– A component in deadline satisfaction.
• Best-case execution time.
– Task-level interactions can cause best-case program
behavior to result in worst-case system behavior.
120. Elements of program performance
• Basic program execution time formula:
– execution time = program path + instruction timing
• Solving these problems independently helps simplify
analysis.
– Easier to separate on simpler CPUs.
• Accurate performance analysis requires:
– Assembly/binary code.
– Execution platform.
121. Data-dependent paths in an if statement
if (a || b) { /* T1 */
if ( c ) /* T2 */
x = r*s+t; /* A1 */
else y=r+s; /* A2 */
z = r+s+u; /* A3 */
}
else {
if ( c ) /* T3 */
y = r-t; /* A4 */
}
a b c path
0 0 0 T1=F, T3=F: no assignments
0 0 1 T1=F, T3=T: A4
0 1 0 T1=T, T2=F: A2, A3
0 1 1 T1=T, T2=T: A1, A3
1 0 0 T1=T, T2=F: A2, A3
1 0 1 T1=T, T2=T: A1, A3
1 1 0 T1=T, T2=F: A2, A3
1 1 1 T1=T, T2=T: A1, A3
122. Paths in a loop
for (i=0, f=0; i<N; i++)
f = f + c[i] * x[i];
i=0
f=0
i<N
f = f + c[i] * x[i]
i = i + 1
N
Y
Loop
exit
123. Instruction timing
• Not all instructions take the same amount of time.
– Multi-cycle instructions (RISC, Fixed length instruction)
– Fetches.
• Execution times of instructions are not independent.
(many CPUs use register bypassing to speed up instruction sequences
when the result of one instruction is used in the next instruction.)
– Pipeline interlocks.
– Cache effects.
• Execution times may vary with operand value.
– This is clearly true of floating-point instructions in which a different
number of iterations may be required to calculate the result
– Some multi-cycle integer operations.
Once we know the execution path of the program, we have to measure the execution time of
the instructions executed along that path.
However , even ignoring cache effects, this technique is simplistic for the reasons summarized
below.
124. Measurement-driven performance
analysis
• Not so easy as it sounds:
– Must actually have access to the CPU.
– Must know data inputs that give worst/best case
performance.
– Must make state visible.
• Still an important method for performance
analysis.
125. Feeding the program
• Need to know the desired input values.
• May need to write software scaffolding to
generate the input values.
• Software scaffolding may also need to
examine outputs to generate feedback-driven
inputs.
126. Trace-driven measurement
• Trace-driven:
– Instrument the program.
– Save information about the path.
• Requires modifying the program.
• Trace files are large.
• Widely used for cache analysis.
127. Physical measurement
• In-circuit emulator allows tracing.
– Affects execution timing.
• Logic analyzer can measure behavior at pins.
– Address bus can be analyzed to look for events.
– Code can be modified to make events visible.
• Particularly important for real-world input streams.
128. CPU simulation
• Some simulators are less accurate.
• Cycle-accurate simulator provides accurate
clock-cycle timing.
– Simulator models CPU internals.
– Simulator writer must know how CPU works.
129. SimpleScalar FIR filter simulation
int x[N] = {8, 17, … };
int c[N] = {1, 2, … };
main() {
int i, k, f;
for (k=0; k<COUNT; k++)
for (i=0; i<N; i++)
f += c[i]*x[i];
}
N total sim
cycles
sim cycles per
filter execution
100 25854 259
1,000 155759 156
1,0000 1451840 145
130. Performance optimization motivation
• Embedded systems must often meet
deadlines.
– Faster may not be fast enough.
• Need to be able to analyze execution time.
– Worst-case, not typical.
• Need techniques for reliably improving
execution time.
131. Programs and performance analysis
• Best results come from analyzing optimized
instructions, not high-level language code:
– non-obvious translations of HLL statements into
instructions;
– code may move;
– cache effects are hard to predict.
133. Loop optimizations
• Loops are important targets for optimization
because programs with loops tend to spend a
lot of time executing those loops.
• There are three important techniques in
optimizing loops:
– code motion,
– induction variable elimination, and
– Strength reduction (x*2 -> x<<1).
134. Code motion
for (i=0; i<N*M; i++)
z[i] = a[i] + b[i];
Code motion lets us move unnecessary code out of a loop.
We can avoid
N *M-1
unnecessary
executions of this
statement by
moving it before
the loop,
as shown in the
figure 2.
135. Induction variable elimination
• Induction variable: A nested loop is a good
example of the use of induction variables.
• Consider loop:
for (i=0; i<N; i++)
for (j=0; j<M; j++)
z[i,j] = b[i,j];
• Rather than recompute i*M+j for each array in
each iteration, share induction variable between
arrays, increment at end of loop body.
An induction variable is a variable whose value is derived from the loop iteration
variable’s value. The compiler often introduces induction variables to help
it implement the loop.
136. The compiler uses induction variables to help it address the
arrays.
Let us rewrite the loop in C using induction variables and pointers.
In the above code, zptr and bptr are pointers to the heads of the z and b arrays
and zbinduct is the shared induction variable.
137. • Strength reduction helps us reduce the cost of a
loop iteration.
• Consider the following assignment:
y = x * 2;
– In integer arithmetic, we can use a left shift rather
than a multiplication by 2 (as long as we properly
keep track of overflows).
– If the shift is faster than the multiply, we probably
want to perform the substitution.
Strength reduction
138. Performance optimization hints
• Use registers efficiently.
• Use page mode memory accesses.
• Analyze cache behavior:
– instruction conflicts can be handled by rewriting
code, rescheudling;
– conflicting scalar data can easily be moved;
– conflicting array data can be moved, padded.
140. Energy/power optimization
• Energy: ability to do work.
– Most important in battery-powered systems.
• Power: energy per unit time.
– Important even in wall-plug systems---power
becomes heat.
141. Opportunities for saving power
■ We may be able to replace the algorithms with
others that do things in clever ways that consume
less power.
■ Memory accesses are a major component of
power consumption in many applications. By
optimizing memory accesses we may be able to
significantly reduce power.
■ We may be able to turn off parts of the system—
such as subsystems of the CPU, chips in the
system, and so on—when we do not need them in
order to save power.
142. Measuring energy consumption
• Execute a small loop, measure current:
Figure executes the code
under test over and over in
a loop. By measuring
the current flowing into
the CPU, we are measuring
the power consumption of
the complete loop,
including both the body
and other code.
By separately measuring
the power consumption of
a loop with no body
143. Sources of energy consumption
• Relative energy per operation (Catthoor et al):
– memory transfer: 33
– external I/O: 10
– SRAM write: 9
– SRAM read: 4.4
– multiply: 3.6
– add: 1
144. Cache behavior is important
• Energy consumption has a sweet spot as
cache size changes:
– cache too small: program thrashes, burning
energy on external memory accesses;
– cache too large: cache itself burns too much
power.
145. Optimizing for energy
• Use registers efficiently.
• Identify and eliminate cache conflicts.
• Moderate loop unrolling eliminates some loop
overhead instructions.
• Eliminate pipeline stalls.
• Inlining procedures may help: reduces linkage,
but may increase cache thrashing.
146. Efficient loops
• General rules:
– Don’t use function calls.
– Keep loop body small to enable local repeat (only
forward branches).
– Use unsigned integer for loop counter.
– Use <= to test loop counter.
– Make use of compiler---global optimization,
software pipelining.
147. Program validation and testing
• But does it work?
• Concentrate here on functional verification.
• Major testing strategies:
– Black box doesn’t look at the source code.
– Clear box (white box) does look at the source
code.
148. Clear-box testing
• Examine the source code to determine whether it
works:
– Can you actually exercise a path?
– Do you get the value you expect along a path?
• Testing procedure:
– Controllability: Provide program with inputs.
– Execute.
– Observability: examine outputs.
149. How much testing is enough?
• Exhaustive testing is impractical.
• One important measure of test quality---bugs
escaping into field.
• Good organizations can test software to give very low
field bug report rates.
• Error injection measures test quality:
– Add known bugs.
– Run your tests.
– Determine % injected bugs that are caught.