SlideShare a Scribd company logo
1 of 150
EC6703
EMBEDDED AND REAL
TIME SYSTEMS
The CPU Bus
Memory devices and systems
Designing with computing platforms
Consumer electronics architecture
Platform-level performance analysis
Components for embedded programs
Models of programs- Assembly, linking and loading –
compilation techniques
Program level performance analysis
Software performance optimization
Program level energy and power analysis and
optimization
Analysis and optimization of program size- Program
validation and testing.
UNIT II EMBEDDED COMPUTING PLATFORM
DESIGN
In this topic, we concentrate on bus-based computer systems created
using microprocessors, I/O devices, and memory components.
The microprocessor is an important element of the embedded
computing system, but it cannot do its job without memories and I/O
devices.
We need to understand how to interconnect microprocessors and
devices using the CPU bus.
The CPU bus, which forms the backbone of the hardware
system.
A computer system encompasses much more than the
CPU; it also includes memory and I/O devices. The bus is
the mechanism by which the CPU communicates with
memory and devices.
A bus is, at a minimum, a collection of wires, but the bus
also defines a protocol by which the CPU, memory, and
devices communicate.
One of the major roles of the bus is to provide an interface
to memory , I/O devices.
CPU BUS
Bus Protocols
The basic building block of most bus protocols is the four-cycle handshake,
illustrated in Figure
The four cycles are described below.
1. Device 1 raises its output to signal an
enquiry, which tells device 2 that it
should get ready to listen for data.
2. When device 2 is ready to receive, it
raises its output to signal an
acknowledgment. At this point, devices 1
and 2 can transmit or receive.
3. Once the data transfer is complete,
device 2 lowers its output, signaling that
it has received the data.
4. After seeing that ack has been released,
device 1 lowers its output.
At the end of the handshake, both
handshaking signals are low, just as
they were at the start of the
handshake.
A typical microprocessor bus.
The term bus is used in two ways.
The most basic use is as a set of related wires, such as address wires.
However, the term may also mean a protocol for communicating between
components.
The fundamental bus operations are reading and writing.
The behavior of a bus is most often specified as a timing diagram.
A timing diagram shows how the signals on a bus vary over time.
Bus multiplexing
CPU
adrs
device
data
adrs
data enable
Adrs enable
DMA
Standard bus transactions require the CPU to be in the middle of every read
and write transaction.
However, there are certain types of data transfers in which the CPU does not
need to be involved.
Direct memory access (DMA) is a bus operation that allows reads and writes
not controlled by the CPU. A DMA transfer is controlled by a DMA controller,
which requests control of the bus from the CPU
Bus mastership
• By default, CPU is bus master and initiates transfers.
• DMA must become bus master to perform its work.
– CPU can’t use bus while DMA operates.
• Bus mastership protocol:
– Bus request.
– Bus grant.
• Direct memory access (DMA) performs data transfers without
executing instructions.
– CPU sets up transfer.
– DMA engine fetches, writes.
• DMA controller is a separate unit.
DMA operation
• CPU sets DMA registers for start address, length.
• DMA status register controls the unit.
• Once DMA is bus master, it transfers automatically.
– May run continuously until complete.
– May use every nth bus cycle.
Bus transfer sequence diagram
System bus configurations
• Multiple busses allow
parallelism:
– Slow devices on one bus.
– Fast devices on separate
bus.
• A bridge connects two
busses.
CPU slow device
memory
high-speed
device
bridge
slow device
ARM AMBA bus
• Since the ARM CPU is manufactured by many different
vendors, the bus provided off-chip can vary from chip to chip.
ARM has created a separate bus specification for single-chip
systems.
• The AMBA bus [ARM99A] supports CPUs, memories, and
peripherals integrated in a system-on-silicon.
• Two varieties:
– AHB (AMBA high-performance bus) supports pipelining,
burst transfers, split transactions, and multiple bus
masters..
– APB (AMBA peripherals bus) is simple, lower-speed, lower
cost.
– All devices are slaves on APB.
Memory components
• Several types of
memory:
– DRAM.
– SRAM.
– Flash.
• Each type of memory
comes in varying:
– Capacities.
– Widths.
Random-access memory
• Dynamic RAM is dense, requires refresh.
– Synchronous DRAM is dominant type.
– SDRAM uses clock to improve performance, pipeline
memory accesses.
– DDR(double data rate) SDRAMs
• Static RAM is faster, less dense, consumes more
power.
• For PCs, SIMMs(single in-line memory modules),
DIMMs are used.
Random-access memories can be both read and written. They are called
random access because, unlike magnetic disks, addresses can be read in
any order.
SDRAM operation
Read-only memory
• ROM may be programmed at factory.
• Flash is dominant form of field-programmable
ROM.
– Electrically erasable, must be block erased.
– Random access, but write/erase is much slower
than read.
– NOR flash is more flexible.
– NAND flash is more dense.
Flash memory
• Non-volatile memory.
– Flash can be programmed in-circuit.
• Random access for read.
• To write:
– Erase a block to 1.
– Write bits to 0.
Flash writing
• Write is much slower than read.
– 1.6 ms write, 70 ns read.
• Blocks are large (approx. 1 Mb).
• Writing causes wear that eventually destroys
the device.
– Modern lifetime approx. 1 million writes.
Types of flash
• NOR:
– Word-accessible read.
– Erase by blocks.
• NAND:
– Read by pages (512-4K bytes).
– Erase by blocks.
• NAND is cheaper, has faster erase, sequential
access times.
I/O DEVICES
Timers and Counters
ADC / DAC
Key boards / Pads
LEDs, Displays, Touchscreen
Designing with computing platforms (Microprocessors)
In this topic , we are going to see.,
How to create an initial working embedded system
How to ensure that the system works properly.
by considering possible architectures for embedded computing
systems.
by studying techniques for designing the hardware
components of embedded systems.
To describes the use of the PC as an embedded
computing platform.
System architectures
• The architecture of an embedded computing
system is the blueprint for implementing that
system—it tells you what components you
need and how you put them together
• Architectures and components:
– software;
– hardware.
• Some software is very hardware-dependent.
Hardware platform architecture
Contains several elements:
• CPU
• bus
• memory
• I/O devices: networking, sensors, actuators,
etc.
How big/fast much each one be?
Software architecture
Functional description must be broken into pieces:
• division among people
• conceptual organization
• performance
• testability
• maintenance
Mixing together different types of functionality into a single code
module leads to spaghetti code, which has poorly structured
control flow, excessive use of global data, and generally unreliable
programs.
Hardware and software architectures
Hardware and software are intimately related:
• software doesn’t run without hardware;
• how much hardware you need is determined
by the software requirements:
– speed;
– memory.
• Designed by CPU manufacturer or others.
• Includes CPU, memory, some I/O devices.
• May include prototyping section.
• CPU manufacturer often gives out evaluation
board netlist---can be used as starting point
for your custom board design.
Evaluation boards
Adding logic to a board
• Programmable logic devices (PLDs) provide
low/medium density logic.
• Field-programmable gate arrays (FPGAs)
provide more logic and multi-level logic.
• Application-specific integrated circuits (ASICs)
are manufactured for a single purpose.
The PC as a platform
• Advantages:
– Cheap (for Industries) and easy to get;
– rich and familiar software environment.
• Disadvantages:
– requires a lot of hardware resources;
– not well-adapted to real-time.
– It is larger, more power hungry, and more
expensive than a custom hardware platform
would be
Typical PC hardware platform
CPU
CPU bus
memory
DMA
controller
timers
bus
interface
bus
interface
high-speed bus
low-speed bus
device
device
intr
ctrl
■ ROM holds the boot program.
■ RAM is used for program storage.
■ PCI: standard for high-speed interfacing
33 or 66 MHz.
PCI Express.
■ USB (Universal Serial Bus), Fire wire (IEEE 1394): relatively low-cost serial
interface with high speed.
Software elements
• IBM PC uses BIOS (Basic I/O System) to
implement low-level functions:
– boot-up
– minimal device drivers.
• BIOS has become a generic term for the
lowest-level system software.
Example for Single Chip System:
StrongARM(SA-1100)
• StrongARM system includes:
– CPU chip (3.686 MHz clock)
– system control module (32.768
kHz clock).
– Real-time clock;
– operating system timer
– general-purpose I/O;
– interrupt controller;
– power manager controller;
– reset controller.
Debugging embedded systems
• Challenges:
– target system may be hard to observe;
– target may be hard to control;
– may be hard to generate realistic inputs;
– setup sequence may be complex.
Host/target design
• Use a host system to prepare software for
target system:
target
system
host system
serial line
Host-based tools
• Cross compiler:
– compiles code on host for target system.(i.e., A cross-
compiler is a compiler that runs on one type of machine but generates code for another.)
• Cross debugger:
– displays target state, allows target system to be
controlled.
Software debuggers
• A monitor program residing on the target
provides basic debugger functions.
• Debugger should have a minimal footprint in
memory.
• User program must be careful not to destroy
debugger program, but , should be able to
recover from some damage caused by user
code.
Breakpoints
• A breakpoint allows the user to stop
execution, examine system state, and change
state.
• Replace the breakpointed instruction with a
subroutine call to the monitor program.
ARM breakpoints
0x400 MUL r4,r6,r6
0x404 ADD r2,r2,r4
0x408 ADD r0,r0,#1
0x40c B loop
uninstrumented code
0x400 MUL r4,r6,r6
0x404 ADD r2,r2,r4
0x408 ADD r0,r0,#1
0x40c BL bkpoint
code with breakpoint
Breakpoint handler actions
• Save registers.
• Allow user to examine machine.
• Before returning, restore system state.
– Safest way to execute the instruction is to replace
it and execute in place.
– Put another breakpoint after the replaced
breakpoint to allow restoring the original
breakpoint.
In-circuit emulators
• A microprocessor in-circuit emulator is a
specially-instrumented microprocessor.
• Allows you to stop execution, examine CPU
state, modify registers.
Boundary scan
• Simplifies testing of
multiple chips on a
board.
– Registers on pins can be
configured as a scan
chain.
– Used for debuggers, in-
circuit emulators.
How to exercise code
• Run on host system.
• Run on target system.
• Run in instruction-level simulator.
• Run on cycle-accurate simulator.
• Run in hardware/software co-simulation
environment.
Debugging real-time code
• Bugs in drivers can cause non-deterministic
behavior in the foreground problem.
• Bugs may be timing-dependent.
(Logic analyzers)
• A logic analyzer is an array of low-grade
oscilloscopes:
CONSUMER ELECTRONICS ARCHITECTURE
Logic analyzer architecture
The analyzer can sample many different signals simultaneously
(tens to hundreds) but can display only 0, 1, or changing values
for each
L
a
t
c
h
The logic analyzer records the values on the signals into an internal
memory and then displays the results on a display once the memory
is full or the run is aborted.
A typical logic analyzer can acquire data in either of two modes that
are typically called state and timing modes.
State and timing mode represent different ways of sampling the
values.
Timing mode uses an internal clock that is fast enough to take
several samples per clock period in a typical system.
State mode, on the other hand, uses the system’s own clock to
control sampling, so it samples each signal only once per clock cycle.
As a result, timing mode requires more memory to store a given
number of system clock cycles.
Logic analyzer architecture
• The system’s data signals are sampled at a latch within the logic
analyzer; the latch is controlled by either the system clock or the
internal logic analyzer sampling clock, depending on whether
the analyzer is being used in state or timing mode.
• Each sample is copied into a vector memory under the control
of a state machine.
• The latch, timing circuitry, sample memory, and controller must
be designed to run at high speed since several samples per
system clock cycle may be required in timing mode.
• After the sampling is complete, an embedded microprocessor
takes over to control the display of the data captured in the
sample memory.
• Logic analyzers typically provide a number of formats for
viewing data. One format is a timing diagram format
Logic analyzer - Operation
PLATFORM-LEVEL PERFORMANCE
ANALYSIS
System-level performance analysis
• Performance depends on all the elements of the system:
– CPU.
– Cache.
– Bus.
– Main memory.
– I/O device.
In this section, we will develop some basic techniques for analyzing
the performance of bus-based systems.
System level data flows and performance
We want to move data from memory to the CPU to process it. To get the data
from memory to the CPU we must:
■ read from the memory;
■ transfer over the bus to the cache; and
■ transfer from the cache to the CPU.
The time required to transfer
from the cache to the CPU is
included in the instruction
execution time, but the other two
times are not.
The most basic measure of
performance we are interested in
is bandwidth →
Bandwidth as performance
• Bandwidth(the rate at which we can move data)
applies to several components:
– Memory.
– Bus.
– CPU fetches.
• Different parts of the system run at different clock
rates.
• Different components may have different widths
(bus, memory).
• We have to make sure that we apply the right clock
rate to each part of the performance estimate
when we convert from clock cycles to seconds.
• Bandwidth questions often come up when we are
transferring large blocks of data.
Example
Bandwidth and data transfers
• Video frame: 320 x 240 x 3 = 230,400 bytes.
– Transfer in 1/30 sec.
• Transfer 1 byte/msec, leads to 0.23 sec per frame
(230400msec). i.e, more than 1/30 sec
– Too slow.
– i.e, we have to increase the transfer rate by 7 times
• We can increase bandwidth in two ways:
– We can increase the clock rate of the bus ( 2Mhz)
– or we can increase the amount of data transferred per
clock cycle (4bytes).
Considering the bandwidth provided by only one system component, the bus.
Consider an image of 320 X 240 pixels, with each pixel composed of 3 bytes of
data.
Bus bandwidth
• T: # bus cycles.
• P: time/bus cycle.
• Total time for transfer:
t = TP.
• D: data payload length.
• O1 + O2 = overhead O.
O1 D O2
W
Tbasic(N) = (D+O)N/W
Bus burst transfer bandwidth
• T: # bus cycles.
• P: time/bus cycle.
• Total time for transfer:
– t = TP.
• D: data payload length.
• O1 + O2 = overhead O.
B O
W
Tburst(N) = (BD+O)N/(BW)
2
1
…
Memory aspect ratios
64 M
16 M
8 M
1 4 8
Memory access times
• Memory component access times comes from
chip data sheet.
– Page modes allow faster access for successive
transfers on same page.
• If data doesn’t fit naturally into physical
words:
– A = [(E/w)mod W]+1
Bus performance bottlenecks
• Transfer 320 x 240
video frame @ 30
frames/sec = 612,000
bytes/sec.
• Is performance
bottleneck bus or
memory?
memory CPU
Bus performance bottlenecks, cont’d.
• Bus: assume 1 MHz bus, D=1, O=3:
– Tbasic = (1+3)612,000/2 = 1,224,000 cycles = 1.224
sec.
• Memory: try burst mode B=4, width w=0.5.
– Tmem = (4*1+4)612,000/(4*0.5) = 2,448,000 cycles
= 0.2448 sec.
Performance spreadsheet
bus memory
clock period 1.00E-06 clock period 1.00E-08
W 2 W 0.5
D 1 D 1
O 3 O 4
B 4
N 612000 N 612000
T_basic 1224000 T_mem 2448000
t 1.22E+00 t 2.45E-02
Parallelism
• Speed things up by running
several units at once.
• DMA provides parallelism if
CPU doesn’t need the bus:
– DMA + bus.
– CPU.
Transfer with DMA
Components for embedded programs
In this section, we study in detail the process of programming embedded processors.
The creation of embedded programs is at the heart of embedded system design.
Embedded code must not only provide rich functionality, it must also often run at a
required rate to meet system deadlines, fit into the allowed amount of memory, and
meet power consumption requirements. Designing code that simultaneously meets
multiple design constraints is a considerable challenge, but luckily there are
techniques and tools that we can use to help us through the design process.
we consider code for three structures or components that are commonly
used in embedded software:
1. The state machine,
2. The circular buffer,
3. The queue.
 State machines are well suited to reactive systems such as user interfaces;
 Circular buffers and Queues are useful in digital signal processing.
Software State Machine
• State machine keeps internal state as a
variable, changes state based on inputs.
• Uses:
– control-dominated code;
– reactive systems.
When inputs appear intermittently rather than as periodic
samples, it is often convenient to think of the system as reacting
to those inputs. The reaction of most systems can be
characterized in terms of the input received and the current
state of the system. This leads naturally to a finite-state machine
State machine example
(Seat belt controller)
idle
buzzer seated
belted
no seat/-
seat/timer on
no belt
and no
timer/-
no belt/timer on
belt/-
belt/
buzzer off
Belt/buzzer on
no seat/-
no seat/
buzzer off
The controller’s job is to turn on a buzzer if a person sits in a seat and does not
fasten the seat belt within a fixed amount of time.
This system has three inputs and one output.
The inputs are a sensor for the seat to know when a person has sat down, a seat
belt sensor that tells when the belt is fastened, and a timer that goes off when
the required time interval has elapsed.
The output is the buzzer.
C implementation
#define IDLE 0
#define SEATED 1
#define BELTED 2
#define BUZZER 3
switch (state) {
case IDLE: if (seat) { state = SEATED; timer_on = TRUE; }
break;
case SEATED: if (belt) state = BELTED;
else if (timer) state = BUZZER;
break;
…
Circular buffer
• Commonly used in signal processing:
– new data constantly arrives;
– each datum(factual information derived from measurement) has a
limited lifetime.
• Use a circular buffer to hold the data stream.
– EX: FIR filter
– For each sample, the filter must emit one output
that depends on the values of the last n inputs.
The circular buffer is a data structure that lets us handle
streaming data in an efficient way.
Circular buffer
x1 x2 x3 x4 x5 x6
t1 t2 t3
Data stream
x1 x2 x3 x4
Circular buffer
x5 x6 x7
Circular buffers
• Indexes locate currently used data, current
input data:
d1
d2
d3
d4
time t1
use
input d5
d2
d3
d4
time t1+1
use
input
Circular buffer implementation: FIR
filter
int circ_buffer[N], circ_buffer_head = 0;
int c[N]; /* coefficients */
…
int ibuf, ic;
for (f=0, ibuff=circ_buff_head, ic=0;
ic<N; ibuff=(ibuff==N-1?0:ibuff++), ic++)
f = f + c[ic]*circ_buffer[ibuf];
Queues
• Queues are also used in signal processing and event
processing.
• Queues are used whenever data may arrive and
depart at somewhat unpredictable times or when
variable amounts of data may arrive.
• A queue is often referred to as an elastic buffer,
which holds data that arrives irregularly.
• One way to build a queue is with a linked list.
– This approach allows the queue to grow to an arbitrary
size.
• Another way to design the queue is to use an array to
hold all the data.
Buffer-based queues(to manage
interrupt-driven data;)
#define Q_SIZE 32
#define Q_MAX (Q_SIZE-1)
int q[Q_MAX], head, tail;
void initialize_queue() { head = tail =
0; }
void enqueue(int val) {
if (((tail+1)%Q_SIZE) == head)
error();
q[tail]=val;
if (tail == Q_MAX) tail = 0; else
tail++;
}
int dequeue() {
int returnval;
if (head == tail) error();
returnval = q[head];
if (head == Q_MAX) head = 0;
else head++;
return returnval;
}
Models of programs
• Source code is not a good representation for
programs:
– clumsy;
– leaves much information implicit.
• Compilers derive intermediate representations
to manipulate and optimize the program.
In this section , we develop models for programs that are more
general than source code like ALP, C …and so on...
Our fundamental model for programs is the
control/data flow graph (CDFG).
Data flow graph
• DFG: data flow graph.
• Does not represent control.
• Models basic block: code with no entry or exit.
• Describes the minimal ordering requirements
on operations.
Single assignment form
x = a + b;
y = c - d;
z = x * y;
y = b + d;
original basic block
x = a + b;
y = c - d;
z = x * y;
y1 = b + d;
single assignment form
Data flow graph
x = a + b;
y = c - d;
z = x * y;
y1 = b + d;
single assignment form
+ -
+
*
DFG
a b c d
z
x
y
y1
DFGs and partial orders
Partial order:
• a+b, c-d; b+d x*y
Can do pairs of operations
in any order.
+ -
+
*
a b c d
z
x
y
y1
Control-data flow graph
• CDFG: represents control and data.
• Uses data flow graphs as components.
• Two types of nodes:
– decision;
– data flow.
Data flow node
Encapsulates a data flow graph:
Write operations in basic block form for simplicity.
x = a + b;
y = c + d
Control
cond
T
F
Equivalent forms
value
v1
v2 v3
v4
CDFG example
if (cond1) bb1();
else bb2();
bb3();
switch (test1) {
case c1: bb4(); break;
case c2: bb5(); break;
case c3: bb6(); break;
}
cond1 bb1()
bb2()
bb3()
bb4()
test1
bb5() bb6()
T
F
c1
c2
c3
for loop
for (i=0; i<N; i++)
loop_body();
for loop
i=0;
while (i<N) {
loop_body(); i++; }
equivalent
i<N
loop_body()
T
F
i=0
Assembly , linking, loading
• Assembly and linking are the last steps in the compilation process.
• they turn a list of instructions into an image of the program’s bits in
memory.
• Loading actually puts the program in memory so that it can be
executed.
compiler
HLL assembly
assembler
HLL
HLL assembly
assembly
linker
Executable
Binary
loader
Object Code
Object Code
• As the figure shows, most compilers do not directly generate
machine code, but instead create the instruction-level
program in the form of human-readable assembly language.
• The assembler’s job is to translate symbolic assembly
language statements into bit-level representations of
instructions known as object code.
• A linker allows a program to be stitched together out of
several smaller pieces. The linker operates on the object files
created by the assembler and modifies the assembled code to
make the necessary links between files.
• The linker, which produces an executable binary file.
• That file may not necessarily be located in the CPU’s memory,
however, unless the linker happens to create the executable
directly in RAM.
• The program that brings the program into memory for
execution is called a loader.
Assembly , linking, loading
Assemblers
• Major tasks:
– generate binary for symbolic instructions;
– translate labels into addresses;
– handle pseudo-ops (data, etc.).
• Generally one-to-one translation.
• Assembly labels:
ORG 100
label1 ADR r4,c
Pseudo-operations
• Pseudo-ops do not generate instructions:
– ORG sets program location.
– EQU generates symbol table entry without
advancing PLC.
– Data statements define data blocks.
Linking
• Combines several object modules into a single
executable module.
• Jobs:
– put modules in order;
– resolve labels across modules.
Dynamic linking
• Some operating systems link modules
dynamically at run time:
– shares one copy of library among all executing
programs;
– allows programs to be updated with new versions
of libraries.
COMPILATION TECHNIQUES
It is useful to understand how a high-level language program is
translated into instructions.
Understanding how the compiler works can help you know when
you cannot rely on the compiler.
Next, because many applications are also performance sensitive,
understanding how code is generated can help you meet your
performance goals, either by writing high-level code that gets
compiled into the instructions you want or by recognizing when you
must write your own assembly code.
Compilation combines translation and optimization.
Compilation
• Compiler determines quality of code:
– use of CPU resources;
– memory access scheduling;
– code size.
Basic compilation phases
HLL
parsing, symbol table
machine-independent
optimizations
machine-dependent
Optimizations
assembly
The high-level language program is
parsed to break it into statements
and expressions.
In addition, a symbol table is
generated, which includes all the
named objects in the program.
Simplifying arithmetic expressions
is one example of a machine-
independent optimization.
Instruction –level optimization and
code generation
Statement translation and
optimization
• Source code is translated into intermediate
form such as CDFG.
• CDFG is transformed/optimized.
• CDFG is translated into instructions with
optimization decisions.
• Instructions are further optimized.
Compiling an Arithmetic expressions
a*b + 5*(c-d)
W,X,Y,Z are temp
variables
expression
DFG
* -
*
+
a b c d
5
W
Z
Y
X
2
3
4
1
Compilation of Arithmetic expressions,
cont’d.
ADR r4,a
MOV r1,[r4]
ADR r4,b
MOV r2,[r4]
ADD r3,r1,r2
DFG
* -
*
+
a b c d
5
ADR r4,c
MOV r1,[r4]
ADR r4,d
MOV r5,[r4]
SUB r6,r4,r5
MUL r7,r6,#5
ADD r8,r7,r3
code
Similarly for Control code generation
if (a+b > 0)
x = 5;
else
x = 7;
a+b>0 x=5
x=7
3
2
1
Control code generation, cont’d.
ADR r5,a
LDR r1,[r5]
ADR r5,b
LDR r2,b
ADD r3,r1,r2
BLE label3
a+b>0 x=5
x=7
LDR r3,#5
ADR r5,x
STR r3,[r5]
B stmtent
LDR r3,#7
ADR r5,x
STR r3,[r5]
stmtent ...
Procedure linkage
• Need code to:
– call and return;
– pass parameters and results.
• Parameters and returns are passed on stack.
– Procedures with few parameters may use
registers.
Another major code generation problem is the creation of
procedures
Procedure stacks
proc1
growth
proc1(int a) {
proc2(5);
}
proc2
SP
stack pointer
(defines the end of the current frame)
FP
frame pointer
(defines the end of the Last frame)
5 accessed relative to SP
When a new procedure is called, the sp and fp are modified to push
another frame onto the stack.
ARM procedure linkage
• APCS (ARM Procedure Call Standard):
– r0-r3 pass parameters into procedure. Extra
parameters are put on stack frame.
– r0 holds return value.
– r4-r7 hold register values.
– r11 is frame pointer, r13 is stack pointer.
– r10 holds limiting address on stack size to check
for stack overflows.
Data structures
• Different types of data structures use different
data layouts.
• Some offsets into data structure can be
computed at compile time, others must be
computed at run time.
• An array element must in general be
computed at run time, since the array index
may change.
• Let us first consider one-dimensional arrays:
The compiler must also translate references to data structures into
references to raw memories. In general, this requires address
computations.
One-dimensional arrays
• C array name points to 0th element:
a[0]
a[1]
a[2]
a
= *(a + 1)
Two-dimensional arrays
• Column-major layout:
a[0,0]
a[0,1]
a[1,0]
a[1,1]
= a[i*M+j]
...
M
...
N
Structures
• Fields within structures are static offsets:
field1
field2
aptr
struct {
int field1;
char field2;
} mystruct;
struct mystruct a, *aptr = &a;
4 bytes
*(aptr+4)
Using your compiler
• Understand various optimization levels (-O1, -
O2, etc.)
• Look at mixed compiler/assembler output.
• Modifying compiler output requires care:
– correctness;
– loss of hand-tweaked code.
Interpreters and JIT(Just In-Time)compilers
• Interpreter: translates and executes program statements on-
the-fly. An interpreter translates program statements one at
a time.
• The interpreter sits between the program and the machine.
• The interpreter may or may not generate an explicit piece of
code to represent the statement. Because the interpreter
translates only a very small piece of the program at any given
time,
• A small amount of memory is used to hold intermediate
representations of the program.
Programs are not always compiled and then separately executed. In
some cases, it may make sense to translate the program into
instructions during execution.
Two well-known techniques for on-the-fly translation are
interpretation and just-in-time (JIT ) compilation.
JIT compiler: compiles small sections of code into instructions
during program execution.
Eliminates some translation overhead.
Often requires more memory.
Best suited for Java environments
Interpreters and JIT(Just In-Time)compilers
A JIT compiler is somewhere between an interpreter and a stand-
alone compiler. A JIT compiler produces executable code segments
for pieces of the program. However, it compiles a section of the
program (such as a function) only when it knows it will be executed.
Unlike an interpreter, it saves the compiled version of the code so
that the code does not have to be retranslated the next time it is
executed.
The JIT compiler usually generates machine code directly rather
than building intermediate program representation data structures
such as the CDFG.
Program design and analysis
• Program-level performance analysis.
• Optimizing for:
– Execution time.
– Energy/power.
– Program size.
• Program validation and testing.
Program-level performance analysis
• Need to understand performance in detail:
– Real-time behavior, not just typical.
– On complex platforms.
• Program performance  CPU performance:
– Pipeline, cache are windows into program.
– We must analyze the entire program.
Complexities of analyzing program
performance
• The execution time of a program often varies
with the input data values because those values
select different execution paths in the program.
- For example, loops
• Cache effects.
– The cache’s behavior depends in part on the data
values input to the program.
• Instruction-level performance variations:
– Pipeline interlocks.
– Fetch times.
How to measure program performance
• Simulate execution of the CPU (Simulator).
– Makes CPU state visible.
– Be careful for some microprocessor performance
simulators are not 100% accurate, and simulation of I/O-
intensive code may be difficult.
– Also measures execution time of program
• Measure on real CPU using timer.
– A timer connected to the microprocessor bus can be
used to measure performance of executing sections of
code.
– Requires modifying the program to control the timer.
• Measure on real CPU using logic analyzer.
– By measuring the start and stop times of a code segment
– Requires events visible on the pins.
Program performance metrics
• Average-case execution time.
– Typically used in application programming.
• Worst-case execution time.
– A component in deadline satisfaction.
• Best-case execution time.
– Task-level interactions can cause best-case program
behavior to result in worst-case system behavior.
Elements of program performance
• Basic program execution time formula:
– execution time = program path + instruction timing
• Solving these problems independently helps simplify
analysis.
– Easier to separate on simpler CPUs.
• Accurate performance analysis requires:
– Assembly/binary code.
– Execution platform.
Data-dependent paths in an if statement
if (a || b) { /* T1 */
if ( c ) /* T2 */
x = r*s+t; /* A1 */
else y=r+s; /* A2 */
z = r+s+u; /* A3 */
}
else {
if ( c ) /* T3 */
y = r-t; /* A4 */
}
a b c path
0 0 0 T1=F, T3=F: no assignments
0 0 1 T1=F, T3=T: A4
0 1 0 T1=T, T2=F: A2, A3
0 1 1 T1=T, T2=T: A1, A3
1 0 0 T1=T, T2=F: A2, A3
1 0 1 T1=T, T2=T: A1, A3
1 1 0 T1=T, T2=F: A2, A3
1 1 1 T1=T, T2=T: A1, A3
Paths in a loop
for (i=0, f=0; i<N; i++)
f = f + c[i] * x[i];
i=0
f=0
i<N
f = f + c[i] * x[i]
i = i + 1
N
Y
Loop
exit
Instruction timing
• Not all instructions take the same amount of time.
– Multi-cycle instructions (RISC, Fixed length instruction)
– Fetches.
• Execution times of instructions are not independent.
(many CPUs use register bypassing to speed up instruction sequences
when the result of one instruction is used in the next instruction.)
– Pipeline interlocks.
– Cache effects.
• Execution times may vary with operand value.
– This is clearly true of floating-point instructions in which a different
number of iterations may be required to calculate the result
– Some multi-cycle integer operations.
Once we know the execution path of the program, we have to measure the execution time of
the instructions executed along that path.
However , even ignoring cache effects, this technique is simplistic for the reasons summarized
below.
Measurement-driven performance
analysis
• Not so easy as it sounds:
– Must actually have access to the CPU.
– Must know data inputs that give worst/best case
performance.
– Must make state visible.
• Still an important method for performance
analysis.
Feeding the program
• Need to know the desired input values.
• May need to write software scaffolding to
generate the input values.
• Software scaffolding may also need to
examine outputs to generate feedback-driven
inputs.
Trace-driven measurement
• Trace-driven:
– Instrument the program.
– Save information about the path.
• Requires modifying the program.
• Trace files are large.
• Widely used for cache analysis.
Physical measurement
• In-circuit emulator allows tracing.
– Affects execution timing.
• Logic analyzer can measure behavior at pins.
– Address bus can be analyzed to look for events.
– Code can be modified to make events visible.
• Particularly important for real-world input streams.
CPU simulation
• Some simulators are less accurate.
• Cycle-accurate simulator provides accurate
clock-cycle timing.
– Simulator models CPU internals.
– Simulator writer must know how CPU works.
SimpleScalar FIR filter simulation
int x[N] = {8, 17, … };
int c[N] = {1, 2, … };
main() {
int i, k, f;
for (k=0; k<COUNT; k++)
for (i=0; i<N; i++)
f += c[i]*x[i];
}
N total sim
cycles
sim cycles per
filter execution
100 25854 259
1,000 155759 156
1,0000 1451840 145
Performance optimization motivation
• Embedded systems must often meet
deadlines.
– Faster may not be fast enough.
• Need to be able to analyze execution time.
– Worst-case, not typical.
• Need techniques for reliably improving
execution time.
Programs and performance analysis
• Best results come from analyzing optimized
instructions, not high-level language code:
– non-obvious translations of HLL statements into
instructions;
– code may move;
– cache effects are hard to predict.
Software performance optimization
Loop optimizations
• Loops are important targets for optimization
because programs with loops tend to spend a
lot of time executing those loops.
• There are three important techniques in
optimizing loops:
– code motion,
– induction variable elimination, and
– Strength reduction (x*2 -> x<<1).
Code motion
for (i=0; i<N*M; i++)
z[i] = a[i] + b[i];
Code motion lets us move unnecessary code out of a loop.
We can avoid
N *M-1
unnecessary
executions of this
statement by
moving it before
the loop,
as shown in the
figure 2.
Induction variable elimination
• Induction variable: A nested loop is a good
example of the use of induction variables.
• Consider loop:
for (i=0; i<N; i++)
for (j=0; j<M; j++)
z[i,j] = b[i,j];
• Rather than recompute i*M+j for each array in
each iteration, share induction variable between
arrays, increment at end of loop body.
An induction variable is a variable whose value is derived from the loop iteration
variable’s value. The compiler often introduces induction variables to help
it implement the loop.
The compiler uses induction variables to help it address the
arrays.
Let us rewrite the loop in C using induction variables and pointers.
In the above code, zptr and bptr are pointers to the heads of the z and b arrays
and zbinduct is the shared induction variable.
• Strength reduction helps us reduce the cost of a
loop iteration.
• Consider the following assignment:
y = x * 2;
– In integer arithmetic, we can use a left shift rather
than a multiplication by 2 (as long as we properly
keep track of overflows).
– If the shift is faster than the multiply, we probably
want to perform the substitution.
Strength reduction
Performance optimization hints
• Use registers efficiently.
• Use page mode memory accesses.
• Analyze cache behavior:
– instruction conflicts can be handled by rewriting
code, rescheudling;
– conflicting scalar data can easily be moved;
– conflicting array data can be moved, padded.
PROGRAM-LEVEL ENERGY AND
POWER ANALYSIS
AND OPTIMIZATION
Energy/power optimization
• Energy: ability to do work.
– Most important in battery-powered systems.
• Power: energy per unit time.
– Important even in wall-plug systems---power
becomes heat.
Opportunities for saving power
■ We may be able to replace the algorithms with
others that do things in clever ways that consume
less power.
■ Memory accesses are a major component of
power consumption in many applications. By
optimizing memory accesses we may be able to
significantly reduce power.
■ We may be able to turn off parts of the system—
such as subsystems of the CPU, chips in the
system, and so on—when we do not need them in
order to save power.
Measuring energy consumption
• Execute a small loop, measure current:
Figure executes the code
under test over and over in
a loop. By measuring
the current flowing into
the CPU, we are measuring
the power consumption of
the complete loop,
including both the body
and other code.
By separately measuring
the power consumption of
a loop with no body
Sources of energy consumption
• Relative energy per operation (Catthoor et al):
– memory transfer: 33
– external I/O: 10
– SRAM write: 9
– SRAM read: 4.4
– multiply: 3.6
– add: 1
Cache behavior is important
• Energy consumption has a sweet spot as
cache size changes:
– cache too small: program thrashes, burning
energy on external memory accesses;
– cache too large: cache itself burns too much
power.
Optimizing for energy
• Use registers efficiently.
• Identify and eliminate cache conflicts.
• Moderate loop unrolling eliminates some loop
overhead instructions.
• Eliminate pipeline stalls.
• Inlining procedures may help: reduces linkage,
but may increase cache thrashing.
Efficient loops
• General rules:
– Don’t use function calls.
– Keep loop body small to enable local repeat (only
forward branches).
– Use unsigned integer for loop counter.
– Use <= to test loop counter.
– Make use of compiler---global optimization,
software pipelining.
Program validation and testing
• But does it work?
• Concentrate here on functional verification.
• Major testing strategies:
– Black box doesn’t look at the source code.
– Clear box (white box) does look at the source
code.
Clear-box testing
• Examine the source code to determine whether it
works:
– Can you actually exercise a path?
– Do you get the value you expect along a path?
• Testing procedure:
– Controllability: Provide program with inputs.
– Execute.
– Observability: examine outputs.
How much testing is enough?
• Exhaustive testing is impractical.
• One important measure of test quality---bugs
escaping into field.
• Good organizations can test software to give very low
field bug report rates.
• Error injection measures test quality:
– Add known bugs.
– Run your tests.
– Determine % injected bugs that are caught.
EC6703 Embedded and Real-Time Systems Design

More Related Content

Similar to EC6703 Embedded and Real-Time Systems Design

Embedded_System_wireless_Technolgy_with_Microcontrollers
Embedded_System_wireless_Technolgy_with_MicrocontrollersEmbedded_System_wireless_Technolgy_with_Microcontrollers
Embedded_System_wireless_Technolgy_with_Microcontrollersdundappabhangari
 
Board support package_on_linux
Board support package_on_linuxBoard support package_on_linux
Board support package_on_linuxVandana Salve
 
Introduction to intel galileo board gen2
Introduction to intel galileo board gen2Introduction to intel galileo board gen2
Introduction to intel galileo board gen2Harshit Srivastava
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptxPratik Gohel
 
Embedded systems 101 final
Embedded systems 101 finalEmbedded systems 101 final
Embedded systems 101 finalKhalid Elmeadawy
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisationKarunamoorthy B
 
Unit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationUnit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationPavithra S
 
Computer configuration 210804
Computer configuration 210804Computer configuration 210804
Computer configuration 210804mayurpandere12
 
EC8791 designing with computing platform
EC8791 designing with computing platformEC8791 designing with computing platform
EC8791 designing with computing platformRajalakshmiSermadurai
 
ARM Processor architecture
ARM Processor  architectureARM Processor  architecture
ARM Processor architecturerajkciitr
 

Similar to EC6703 Embedded and Real-Time Systems Design (20)

Embedded_System_wireless_Technolgy_with_Microcontrollers
Embedded_System_wireless_Technolgy_with_MicrocontrollersEmbedded_System_wireless_Technolgy_with_Microcontrollers
Embedded_System_wireless_Technolgy_with_Microcontrollers
 
Board support package_on_linux
Board support package_on_linuxBoard support package_on_linux
Board support package_on_linux
 
Computer Hardware
Computer HardwareComputer Hardware
Computer Hardware
 
lb.pptx
lb.pptxlb.pptx
lb.pptx
 
Introduction to intel galileo board gen2
Introduction to intel galileo board gen2Introduction to intel galileo board gen2
Introduction to intel galileo board gen2
 
Microcontroller
Microcontroller Microcontroller
Microcontroller
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptx
 
Embedded systems 101 final
Embedded systems 101 finalEmbedded systems 101 final
Embedded systems 101 final
 
Processors
ProcessorsProcessors
Processors
 
The system unit ch # 4
The system unit ch # 4The system unit ch # 4
The system unit ch # 4
 
The system unit
The system unitThe system unit
The system unit
 
ch1.pptx
ch1.pptxch1.pptx
ch1.pptx
 
Cpu
CpuCpu
Cpu
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisation
 
Unit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisationUnit 2 processor&amp;memory-organisation
Unit 2 processor&amp;memory-organisation
 
What is computer and how it works
What is computer and how it worksWhat is computer and how it works
What is computer and how it works
 
I.T for Management: What is a computer and how does it work
I.T for Management: What is a computer and how does it work I.T for Management: What is a computer and how does it work
I.T for Management: What is a computer and how does it work
 
Computer configuration 210804
Computer configuration 210804Computer configuration 210804
Computer configuration 210804
 
EC8791 designing with computing platform
EC8791 designing with computing platformEC8791 designing with computing platform
EC8791 designing with computing platform
 
ARM Processor architecture
ARM Processor  architectureARM Processor  architecture
ARM Processor architecture
 

More from ssuser4ca1eb

Von Neuman vs Harward.docx
Von Neuman vs Harward.docxVon Neuman vs Harward.docx
Von Neuman vs Harward.docxssuser4ca1eb
 
UNIT I_Introduction.pptx
UNIT I_Introduction.pptxUNIT I_Introduction.pptx
UNIT I_Introduction.pptxssuser4ca1eb
 
Ch01_Intro_to_Embedded_Systems.pptx
Ch01_Intro_to_Embedded_Systems.pptxCh01_Intro_to_Embedded_Systems.pptx
Ch01_Intro_to_Embedded_Systems.pptxssuser4ca1eb
 
Lec01_Course_Overview.ppt
Lec01_Course_Overview.pptLec01_Course_Overview.ppt
Lec01_Course_Overview.pptssuser4ca1eb
 

More from ssuser4ca1eb (8)

lightning_go.pptx
lightning_go.pptxlightning_go.pptx
lightning_go.pptx
 
go.ppt
go.pptgo.ppt
go.ppt
 
Von Neuman vs Harward.docx
Von Neuman vs Harward.docxVon Neuman vs Harward.docx
Von Neuman vs Harward.docx
 
UNIT I_Introduction.pptx
UNIT I_Introduction.pptxUNIT I_Introduction.pptx
UNIT I_Introduction.pptx
 
13009690.ppt
13009690.ppt13009690.ppt
13009690.ppt
 
UNIT 3.pptx
UNIT 3.pptxUNIT 3.pptx
UNIT 3.pptx
 
Ch01_Intro_to_Embedded_Systems.pptx
Ch01_Intro_to_Embedded_Systems.pptxCh01_Intro_to_Embedded_Systems.pptx
Ch01_Intro_to_Embedded_Systems.pptx
 
Lec01_Course_Overview.ppt
Lec01_Course_Overview.pptLec01_Course_Overview.ppt
Lec01_Course_Overview.ppt
 

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 

EC6703 Embedded and Real-Time Systems Design

  • 1.
  • 3. The CPU Bus Memory devices and systems Designing with computing platforms Consumer electronics architecture Platform-level performance analysis Components for embedded programs Models of programs- Assembly, linking and loading – compilation techniques Program level performance analysis Software performance optimization Program level energy and power analysis and optimization Analysis and optimization of program size- Program validation and testing. UNIT II EMBEDDED COMPUTING PLATFORM DESIGN
  • 4. In this topic, we concentrate on bus-based computer systems created using microprocessors, I/O devices, and memory components. The microprocessor is an important element of the embedded computing system, but it cannot do its job without memories and I/O devices. We need to understand how to interconnect microprocessors and devices using the CPU bus.
  • 5. The CPU bus, which forms the backbone of the hardware system. A computer system encompasses much more than the CPU; it also includes memory and I/O devices. The bus is the mechanism by which the CPU communicates with memory and devices. A bus is, at a minimum, a collection of wires, but the bus also defines a protocol by which the CPU, memory, and devices communicate. One of the major roles of the bus is to provide an interface to memory , I/O devices. CPU BUS
  • 6. Bus Protocols The basic building block of most bus protocols is the four-cycle handshake, illustrated in Figure The four cycles are described below. 1. Device 1 raises its output to signal an enquiry, which tells device 2 that it should get ready to listen for data. 2. When device 2 is ready to receive, it raises its output to signal an acknowledgment. At this point, devices 1 and 2 can transmit or receive. 3. Once the data transfer is complete, device 2 lowers its output, signaling that it has received the data. 4. After seeing that ack has been released, device 1 lowers its output. At the end of the handshake, both handshaking signals are low, just as they were at the start of the handshake.
  • 7. A typical microprocessor bus. The term bus is used in two ways. The most basic use is as a set of related wires, such as address wires. However, the term may also mean a protocol for communicating between components. The fundamental bus operations are reading and writing.
  • 8. The behavior of a bus is most often specified as a timing diagram. A timing diagram shows how the signals on a bus vary over time.
  • 10. DMA Standard bus transactions require the CPU to be in the middle of every read and write transaction. However, there are certain types of data transfers in which the CPU does not need to be involved. Direct memory access (DMA) is a bus operation that allows reads and writes not controlled by the CPU. A DMA transfer is controlled by a DMA controller, which requests control of the bus from the CPU
  • 11. Bus mastership • By default, CPU is bus master and initiates transfers. • DMA must become bus master to perform its work. – CPU can’t use bus while DMA operates. • Bus mastership protocol: – Bus request. – Bus grant. • Direct memory access (DMA) performs data transfers without executing instructions. – CPU sets up transfer. – DMA engine fetches, writes. • DMA controller is a separate unit.
  • 12. DMA operation • CPU sets DMA registers for start address, length. • DMA status register controls the unit. • Once DMA is bus master, it transfers automatically. – May run continuously until complete. – May use every nth bus cycle.
  • 14. System bus configurations • Multiple busses allow parallelism: – Slow devices on one bus. – Fast devices on separate bus. • A bridge connects two busses. CPU slow device memory high-speed device bridge slow device
  • 15. ARM AMBA bus • Since the ARM CPU is manufactured by many different vendors, the bus provided off-chip can vary from chip to chip. ARM has created a separate bus specification for single-chip systems. • The AMBA bus [ARM99A] supports CPUs, memories, and peripherals integrated in a system-on-silicon. • Two varieties: – AHB (AMBA high-performance bus) supports pipelining, burst transfers, split transactions, and multiple bus masters.. – APB (AMBA peripherals bus) is simple, lower-speed, lower cost. – All devices are slaves on APB.
  • 16.
  • 17.
  • 18. Memory components • Several types of memory: – DRAM. – SRAM. – Flash. • Each type of memory comes in varying: – Capacities. – Widths.
  • 19. Random-access memory • Dynamic RAM is dense, requires refresh. – Synchronous DRAM is dominant type. – SDRAM uses clock to improve performance, pipeline memory accesses. – DDR(double data rate) SDRAMs • Static RAM is faster, less dense, consumes more power. • For PCs, SIMMs(single in-line memory modules), DIMMs are used. Random-access memories can be both read and written. They are called random access because, unlike magnetic disks, addresses can be read in any order.
  • 21. Read-only memory • ROM may be programmed at factory. • Flash is dominant form of field-programmable ROM. – Electrically erasable, must be block erased. – Random access, but write/erase is much slower than read. – NOR flash is more flexible. – NAND flash is more dense.
  • 22. Flash memory • Non-volatile memory. – Flash can be programmed in-circuit. • Random access for read. • To write: – Erase a block to 1. – Write bits to 0.
  • 23. Flash writing • Write is much slower than read. – 1.6 ms write, 70 ns read. • Blocks are large (approx. 1 Mb). • Writing causes wear that eventually destroys the device. – Modern lifetime approx. 1 million writes.
  • 24. Types of flash • NOR: – Word-accessible read. – Erase by blocks. • NAND: – Read by pages (512-4K bytes). – Erase by blocks. • NAND is cheaper, has faster erase, sequential access times.
  • 25. I/O DEVICES Timers and Counters ADC / DAC Key boards / Pads LEDs, Displays, Touchscreen
  • 26.
  • 27. Designing with computing platforms (Microprocessors) In this topic , we are going to see., How to create an initial working embedded system How to ensure that the system works properly. by considering possible architectures for embedded computing systems. by studying techniques for designing the hardware components of embedded systems. To describes the use of the PC as an embedded computing platform.
  • 28. System architectures • The architecture of an embedded computing system is the blueprint for implementing that system—it tells you what components you need and how you put them together • Architectures and components: – software; – hardware. • Some software is very hardware-dependent.
  • 29. Hardware platform architecture Contains several elements: • CPU • bus • memory • I/O devices: networking, sensors, actuators, etc. How big/fast much each one be?
  • 30. Software architecture Functional description must be broken into pieces: • division among people • conceptual organization • performance • testability • maintenance Mixing together different types of functionality into a single code module leads to spaghetti code, which has poorly structured control flow, excessive use of global data, and generally unreliable programs.
  • 31. Hardware and software architectures Hardware and software are intimately related: • software doesn’t run without hardware; • how much hardware you need is determined by the software requirements: – speed; – memory.
  • 32. • Designed by CPU manufacturer or others. • Includes CPU, memory, some I/O devices. • May include prototyping section. • CPU manufacturer often gives out evaluation board netlist---can be used as starting point for your custom board design. Evaluation boards
  • 33. Adding logic to a board • Programmable logic devices (PLDs) provide low/medium density logic. • Field-programmable gate arrays (FPGAs) provide more logic and multi-level logic. • Application-specific integrated circuits (ASICs) are manufactured for a single purpose.
  • 34. The PC as a platform • Advantages: – Cheap (for Industries) and easy to get; – rich and familiar software environment. • Disadvantages: – requires a lot of hardware resources; – not well-adapted to real-time. – It is larger, more power hungry, and more expensive than a custom hardware platform would be
  • 35. Typical PC hardware platform CPU CPU bus memory DMA controller timers bus interface bus interface high-speed bus low-speed bus device device intr ctrl ■ ROM holds the boot program. ■ RAM is used for program storage. ■ PCI: standard for high-speed interfacing 33 or 66 MHz. PCI Express. ■ USB (Universal Serial Bus), Fire wire (IEEE 1394): relatively low-cost serial interface with high speed.
  • 36. Software elements • IBM PC uses BIOS (Basic I/O System) to implement low-level functions: – boot-up – minimal device drivers. • BIOS has become a generic term for the lowest-level system software.
  • 37. Example for Single Chip System: StrongARM(SA-1100) • StrongARM system includes: – CPU chip (3.686 MHz clock) – system control module (32.768 kHz clock). – Real-time clock; – operating system timer – general-purpose I/O; – interrupt controller; – power manager controller; – reset controller.
  • 38. Debugging embedded systems • Challenges: – target system may be hard to observe; – target may be hard to control; – may be hard to generate realistic inputs; – setup sequence may be complex.
  • 39. Host/target design • Use a host system to prepare software for target system: target system host system serial line
  • 40. Host-based tools • Cross compiler: – compiles code on host for target system.(i.e., A cross- compiler is a compiler that runs on one type of machine but generates code for another.) • Cross debugger: – displays target state, allows target system to be controlled.
  • 41. Software debuggers • A monitor program residing on the target provides basic debugger functions. • Debugger should have a minimal footprint in memory. • User program must be careful not to destroy debugger program, but , should be able to recover from some damage caused by user code.
  • 42. Breakpoints • A breakpoint allows the user to stop execution, examine system state, and change state. • Replace the breakpointed instruction with a subroutine call to the monitor program.
  • 43. ARM breakpoints 0x400 MUL r4,r6,r6 0x404 ADD r2,r2,r4 0x408 ADD r0,r0,#1 0x40c B loop uninstrumented code 0x400 MUL r4,r6,r6 0x404 ADD r2,r2,r4 0x408 ADD r0,r0,#1 0x40c BL bkpoint code with breakpoint
  • 44. Breakpoint handler actions • Save registers. • Allow user to examine machine. • Before returning, restore system state. – Safest way to execute the instruction is to replace it and execute in place. – Put another breakpoint after the replaced breakpoint to allow restoring the original breakpoint.
  • 45. In-circuit emulators • A microprocessor in-circuit emulator is a specially-instrumented microprocessor. • Allows you to stop execution, examine CPU state, modify registers.
  • 46. Boundary scan • Simplifies testing of multiple chips on a board. – Registers on pins can be configured as a scan chain. – Used for debuggers, in- circuit emulators.
  • 47. How to exercise code • Run on host system. • Run on target system. • Run in instruction-level simulator. • Run on cycle-accurate simulator. • Run in hardware/software co-simulation environment.
  • 48. Debugging real-time code • Bugs in drivers can cause non-deterministic behavior in the foreground problem. • Bugs may be timing-dependent.
  • 49.
  • 50. (Logic analyzers) • A logic analyzer is an array of low-grade oscilloscopes: CONSUMER ELECTRONICS ARCHITECTURE
  • 51. Logic analyzer architecture The analyzer can sample many different signals simultaneously (tens to hundreds) but can display only 0, 1, or changing values for each L a t c h
  • 52. The logic analyzer records the values on the signals into an internal memory and then displays the results on a display once the memory is full or the run is aborted. A typical logic analyzer can acquire data in either of two modes that are typically called state and timing modes. State and timing mode represent different ways of sampling the values. Timing mode uses an internal clock that is fast enough to take several samples per clock period in a typical system. State mode, on the other hand, uses the system’s own clock to control sampling, so it samples each signal only once per clock cycle. As a result, timing mode requires more memory to store a given number of system clock cycles. Logic analyzer architecture
  • 53. • The system’s data signals are sampled at a latch within the logic analyzer; the latch is controlled by either the system clock or the internal logic analyzer sampling clock, depending on whether the analyzer is being used in state or timing mode. • Each sample is copied into a vector memory under the control of a state machine. • The latch, timing circuitry, sample memory, and controller must be designed to run at high speed since several samples per system clock cycle may be required in timing mode. • After the sampling is complete, an embedded microprocessor takes over to control the display of the data captured in the sample memory. • Logic analyzers typically provide a number of formats for viewing data. One format is a timing diagram format Logic analyzer - Operation
  • 55. System-level performance analysis • Performance depends on all the elements of the system: – CPU. – Cache. – Bus. – Main memory. – I/O device. In this section, we will develop some basic techniques for analyzing the performance of bus-based systems.
  • 56. System level data flows and performance We want to move data from memory to the CPU to process it. To get the data from memory to the CPU we must: ■ read from the memory; ■ transfer over the bus to the cache; and ■ transfer from the cache to the CPU. The time required to transfer from the cache to the CPU is included in the instruction execution time, but the other two times are not. The most basic measure of performance we are interested in is bandwidth →
  • 57. Bandwidth as performance • Bandwidth(the rate at which we can move data) applies to several components: – Memory. – Bus. – CPU fetches. • Different parts of the system run at different clock rates. • Different components may have different widths (bus, memory). • We have to make sure that we apply the right clock rate to each part of the performance estimate when we convert from clock cycles to seconds. • Bandwidth questions often come up when we are transferring large blocks of data. Example
  • 58. Bandwidth and data transfers • Video frame: 320 x 240 x 3 = 230,400 bytes. – Transfer in 1/30 sec. • Transfer 1 byte/msec, leads to 0.23 sec per frame (230400msec). i.e, more than 1/30 sec – Too slow. – i.e, we have to increase the transfer rate by 7 times • We can increase bandwidth in two ways: – We can increase the clock rate of the bus ( 2Mhz) – or we can increase the amount of data transferred per clock cycle (4bytes). Considering the bandwidth provided by only one system component, the bus. Consider an image of 320 X 240 pixels, with each pixel composed of 3 bytes of data.
  • 59. Bus bandwidth • T: # bus cycles. • P: time/bus cycle. • Total time for transfer: t = TP. • D: data payload length. • O1 + O2 = overhead O. O1 D O2 W Tbasic(N) = (D+O)N/W
  • 60. Bus burst transfer bandwidth • T: # bus cycles. • P: time/bus cycle. • Total time for transfer: – t = TP. • D: data payload length. • O1 + O2 = overhead O. B O W Tburst(N) = (BD+O)N/(BW) 2 1 …
  • 61. Memory aspect ratios 64 M 16 M 8 M 1 4 8
  • 62. Memory access times • Memory component access times comes from chip data sheet. – Page modes allow faster access for successive transfers on same page. • If data doesn’t fit naturally into physical words: – A = [(E/w)mod W]+1
  • 63. Bus performance bottlenecks • Transfer 320 x 240 video frame @ 30 frames/sec = 612,000 bytes/sec. • Is performance bottleneck bus or memory? memory CPU
  • 64. Bus performance bottlenecks, cont’d. • Bus: assume 1 MHz bus, D=1, O=3: – Tbasic = (1+3)612,000/2 = 1,224,000 cycles = 1.224 sec. • Memory: try burst mode B=4, width w=0.5. – Tmem = (4*1+4)612,000/(4*0.5) = 2,448,000 cycles = 0.2448 sec.
  • 65. Performance spreadsheet bus memory clock period 1.00E-06 clock period 1.00E-08 W 2 W 0.5 D 1 D 1 O 3 O 4 B 4 N 612000 N 612000 T_basic 1224000 T_mem 2448000 t 1.22E+00 t 2.45E-02
  • 66. Parallelism • Speed things up by running several units at once. • DMA provides parallelism if CPU doesn’t need the bus: – DMA + bus. – CPU. Transfer with DMA
  • 67.
  • 68. Components for embedded programs In this section, we study in detail the process of programming embedded processors. The creation of embedded programs is at the heart of embedded system design. Embedded code must not only provide rich functionality, it must also often run at a required rate to meet system deadlines, fit into the allowed amount of memory, and meet power consumption requirements. Designing code that simultaneously meets multiple design constraints is a considerable challenge, but luckily there are techniques and tools that we can use to help us through the design process. we consider code for three structures or components that are commonly used in embedded software: 1. The state machine, 2. The circular buffer, 3. The queue.  State machines are well suited to reactive systems such as user interfaces;  Circular buffers and Queues are useful in digital signal processing.
  • 69. Software State Machine • State machine keeps internal state as a variable, changes state based on inputs. • Uses: – control-dominated code; – reactive systems. When inputs appear intermittently rather than as periodic samples, it is often convenient to think of the system as reacting to those inputs. The reaction of most systems can be characterized in terms of the input received and the current state of the system. This leads naturally to a finite-state machine
  • 70. State machine example (Seat belt controller) idle buzzer seated belted no seat/- seat/timer on no belt and no timer/- no belt/timer on belt/- belt/ buzzer off Belt/buzzer on no seat/- no seat/ buzzer off The controller’s job is to turn on a buzzer if a person sits in a seat and does not fasten the seat belt within a fixed amount of time. This system has three inputs and one output. The inputs are a sensor for the seat to know when a person has sat down, a seat belt sensor that tells when the belt is fastened, and a timer that goes off when the required time interval has elapsed. The output is the buzzer.
  • 71. C implementation #define IDLE 0 #define SEATED 1 #define BELTED 2 #define BUZZER 3 switch (state) { case IDLE: if (seat) { state = SEATED; timer_on = TRUE; } break; case SEATED: if (belt) state = BELTED; else if (timer) state = BUZZER; break; …
  • 72. Circular buffer • Commonly used in signal processing: – new data constantly arrives; – each datum(factual information derived from measurement) has a limited lifetime. • Use a circular buffer to hold the data stream. – EX: FIR filter – For each sample, the filter must emit one output that depends on the values of the last n inputs. The circular buffer is a data structure that lets us handle streaming data in an efficient way.
  • 73. Circular buffer x1 x2 x3 x4 x5 x6 t1 t2 t3 Data stream x1 x2 x3 x4 Circular buffer x5 x6 x7
  • 74. Circular buffers • Indexes locate currently used data, current input data: d1 d2 d3 d4 time t1 use input d5 d2 d3 d4 time t1+1 use input
  • 75. Circular buffer implementation: FIR filter int circ_buffer[N], circ_buffer_head = 0; int c[N]; /* coefficients */ … int ibuf, ic; for (f=0, ibuff=circ_buff_head, ic=0; ic<N; ibuff=(ibuff==N-1?0:ibuff++), ic++) f = f + c[ic]*circ_buffer[ibuf];
  • 76. Queues • Queues are also used in signal processing and event processing. • Queues are used whenever data may arrive and depart at somewhat unpredictable times or when variable amounts of data may arrive. • A queue is often referred to as an elastic buffer, which holds data that arrives irregularly. • One way to build a queue is with a linked list. – This approach allows the queue to grow to an arbitrary size. • Another way to design the queue is to use an array to hold all the data.
  • 77. Buffer-based queues(to manage interrupt-driven data;) #define Q_SIZE 32 #define Q_MAX (Q_SIZE-1) int q[Q_MAX], head, tail; void initialize_queue() { head = tail = 0; } void enqueue(int val) { if (((tail+1)%Q_SIZE) == head) error(); q[tail]=val; if (tail == Q_MAX) tail = 0; else tail++; } int dequeue() { int returnval; if (head == tail) error(); returnval = q[head]; if (head == Q_MAX) head = 0; else head++; return returnval; }
  • 78.
  • 79. Models of programs • Source code is not a good representation for programs: – clumsy; – leaves much information implicit. • Compilers derive intermediate representations to manipulate and optimize the program. In this section , we develop models for programs that are more general than source code like ALP, C …and so on... Our fundamental model for programs is the control/data flow graph (CDFG).
  • 80. Data flow graph • DFG: data flow graph. • Does not represent control. • Models basic block: code with no entry or exit. • Describes the minimal ordering requirements on operations.
  • 81. Single assignment form x = a + b; y = c - d; z = x * y; y = b + d; original basic block x = a + b; y = c - d; z = x * y; y1 = b + d; single assignment form
  • 82. Data flow graph x = a + b; y = c - d; z = x * y; y1 = b + d; single assignment form + - + * DFG a b c d z x y y1
  • 83. DFGs and partial orders Partial order: • a+b, c-d; b+d x*y Can do pairs of operations in any order. + - + * a b c d z x y y1
  • 84. Control-data flow graph • CDFG: represents control and data. • Uses data flow graphs as components. • Two types of nodes: – decision; – data flow.
  • 85. Data flow node Encapsulates a data flow graph: Write operations in basic block form for simplicity. x = a + b; y = c + d
  • 87. CDFG example if (cond1) bb1(); else bb2(); bb3(); switch (test1) { case c1: bb4(); break; case c2: bb5(); break; case c3: bb6(); break; } cond1 bb1() bb2() bb3() bb4() test1 bb5() bb6() T F c1 c2 c3
  • 88. for loop for (i=0; i<N; i++) loop_body(); for loop i=0; while (i<N) { loop_body(); i++; } equivalent i<N loop_body() T F i=0
  • 89.
  • 90. Assembly , linking, loading • Assembly and linking are the last steps in the compilation process. • they turn a list of instructions into an image of the program’s bits in memory. • Loading actually puts the program in memory so that it can be executed. compiler HLL assembly assembler HLL HLL assembly assembly linker Executable Binary loader Object Code Object Code
  • 91. • As the figure shows, most compilers do not directly generate machine code, but instead create the instruction-level program in the form of human-readable assembly language. • The assembler’s job is to translate symbolic assembly language statements into bit-level representations of instructions known as object code. • A linker allows a program to be stitched together out of several smaller pieces. The linker operates on the object files created by the assembler and modifies the assembled code to make the necessary links between files. • The linker, which produces an executable binary file. • That file may not necessarily be located in the CPU’s memory, however, unless the linker happens to create the executable directly in RAM. • The program that brings the program into memory for execution is called a loader. Assembly , linking, loading
  • 92. Assemblers • Major tasks: – generate binary for symbolic instructions; – translate labels into addresses; – handle pseudo-ops (data, etc.). • Generally one-to-one translation. • Assembly labels: ORG 100 label1 ADR r4,c
  • 93. Pseudo-operations • Pseudo-ops do not generate instructions: – ORG sets program location. – EQU generates symbol table entry without advancing PLC. – Data statements define data blocks.
  • 94. Linking • Combines several object modules into a single executable module. • Jobs: – put modules in order; – resolve labels across modules.
  • 95. Dynamic linking • Some operating systems link modules dynamically at run time: – shares one copy of library among all executing programs; – allows programs to be updated with new versions of libraries.
  • 96. COMPILATION TECHNIQUES It is useful to understand how a high-level language program is translated into instructions. Understanding how the compiler works can help you know when you cannot rely on the compiler. Next, because many applications are also performance sensitive, understanding how code is generated can help you meet your performance goals, either by writing high-level code that gets compiled into the instructions you want or by recognizing when you must write your own assembly code. Compilation combines translation and optimization.
  • 97. Compilation • Compiler determines quality of code: – use of CPU resources; – memory access scheduling; – code size.
  • 98. Basic compilation phases HLL parsing, symbol table machine-independent optimizations machine-dependent Optimizations assembly The high-level language program is parsed to break it into statements and expressions. In addition, a symbol table is generated, which includes all the named objects in the program. Simplifying arithmetic expressions is one example of a machine- independent optimization. Instruction –level optimization and code generation
  • 99. Statement translation and optimization • Source code is translated into intermediate form such as CDFG. • CDFG is transformed/optimized. • CDFG is translated into instructions with optimization decisions. • Instructions are further optimized.
  • 100. Compiling an Arithmetic expressions a*b + 5*(c-d) W,X,Y,Z are temp variables expression DFG * - * + a b c d 5 W Z Y X
  • 101. 2 3 4 1 Compilation of Arithmetic expressions, cont’d. ADR r4,a MOV r1,[r4] ADR r4,b MOV r2,[r4] ADD r3,r1,r2 DFG * - * + a b c d 5 ADR r4,c MOV r1,[r4] ADR r4,d MOV r5,[r4] SUB r6,r4,r5 MUL r7,r6,#5 ADD r8,r7,r3 code
  • 102. Similarly for Control code generation if (a+b > 0) x = 5; else x = 7; a+b>0 x=5 x=7
  • 103. 3 2 1 Control code generation, cont’d. ADR r5,a LDR r1,[r5] ADR r5,b LDR r2,b ADD r3,r1,r2 BLE label3 a+b>0 x=5 x=7 LDR r3,#5 ADR r5,x STR r3,[r5] B stmtent LDR r3,#7 ADR r5,x STR r3,[r5] stmtent ...
  • 104. Procedure linkage • Need code to: – call and return; – pass parameters and results. • Parameters and returns are passed on stack. – Procedures with few parameters may use registers. Another major code generation problem is the creation of procedures
  • 105. Procedure stacks proc1 growth proc1(int a) { proc2(5); } proc2 SP stack pointer (defines the end of the current frame) FP frame pointer (defines the end of the Last frame) 5 accessed relative to SP When a new procedure is called, the sp and fp are modified to push another frame onto the stack.
  • 106. ARM procedure linkage • APCS (ARM Procedure Call Standard): – r0-r3 pass parameters into procedure. Extra parameters are put on stack frame. – r0 holds return value. – r4-r7 hold register values. – r11 is frame pointer, r13 is stack pointer. – r10 holds limiting address on stack size to check for stack overflows.
  • 107. Data structures • Different types of data structures use different data layouts. • Some offsets into data structure can be computed at compile time, others must be computed at run time. • An array element must in general be computed at run time, since the array index may change. • Let us first consider one-dimensional arrays: The compiler must also translate references to data structures into references to raw memories. In general, this requires address computations.
  • 108. One-dimensional arrays • C array name points to 0th element: a[0] a[1] a[2] a = *(a + 1)
  • 109. Two-dimensional arrays • Column-major layout: a[0,0] a[0,1] a[1,0] a[1,1] = a[i*M+j] ... M ... N
  • 110. Structures • Fields within structures are static offsets: field1 field2 aptr struct { int field1; char field2; } mystruct; struct mystruct a, *aptr = &a; 4 bytes *(aptr+4)
  • 111. Using your compiler • Understand various optimization levels (-O1, - O2, etc.) • Look at mixed compiler/assembler output. • Modifying compiler output requires care: – correctness; – loss of hand-tweaked code.
  • 112. Interpreters and JIT(Just In-Time)compilers • Interpreter: translates and executes program statements on- the-fly. An interpreter translates program statements one at a time. • The interpreter sits between the program and the machine. • The interpreter may or may not generate an explicit piece of code to represent the statement. Because the interpreter translates only a very small piece of the program at any given time, • A small amount of memory is used to hold intermediate representations of the program. Programs are not always compiled and then separately executed. In some cases, it may make sense to translate the program into instructions during execution. Two well-known techniques for on-the-fly translation are interpretation and just-in-time (JIT ) compilation.
  • 113. JIT compiler: compiles small sections of code into instructions during program execution. Eliminates some translation overhead. Often requires more memory. Best suited for Java environments Interpreters and JIT(Just In-Time)compilers A JIT compiler is somewhere between an interpreter and a stand- alone compiler. A JIT compiler produces executable code segments for pieces of the program. However, it compiles a section of the program (such as a function) only when it knows it will be executed. Unlike an interpreter, it saves the compiled version of the code so that the code does not have to be retranslated the next time it is executed. The JIT compiler usually generates machine code directly rather than building intermediate program representation data structures such as the CDFG.
  • 114.
  • 115. Program design and analysis • Program-level performance analysis. • Optimizing for: – Execution time. – Energy/power. – Program size. • Program validation and testing.
  • 116. Program-level performance analysis • Need to understand performance in detail: – Real-time behavior, not just typical. – On complex platforms. • Program performance  CPU performance: – Pipeline, cache are windows into program. – We must analyze the entire program.
  • 117. Complexities of analyzing program performance • The execution time of a program often varies with the input data values because those values select different execution paths in the program. - For example, loops • Cache effects. – The cache’s behavior depends in part on the data values input to the program. • Instruction-level performance variations: – Pipeline interlocks. – Fetch times.
  • 118. How to measure program performance • Simulate execution of the CPU (Simulator). – Makes CPU state visible. – Be careful for some microprocessor performance simulators are not 100% accurate, and simulation of I/O- intensive code may be difficult. – Also measures execution time of program • Measure on real CPU using timer. – A timer connected to the microprocessor bus can be used to measure performance of executing sections of code. – Requires modifying the program to control the timer. • Measure on real CPU using logic analyzer. – By measuring the start and stop times of a code segment – Requires events visible on the pins.
  • 119. Program performance metrics • Average-case execution time. – Typically used in application programming. • Worst-case execution time. – A component in deadline satisfaction. • Best-case execution time. – Task-level interactions can cause best-case program behavior to result in worst-case system behavior.
  • 120. Elements of program performance • Basic program execution time formula: – execution time = program path + instruction timing • Solving these problems independently helps simplify analysis. – Easier to separate on simpler CPUs. • Accurate performance analysis requires: – Assembly/binary code. – Execution platform.
  • 121. Data-dependent paths in an if statement if (a || b) { /* T1 */ if ( c ) /* T2 */ x = r*s+t; /* A1 */ else y=r+s; /* A2 */ z = r+s+u; /* A3 */ } else { if ( c ) /* T3 */ y = r-t; /* A4 */ } a b c path 0 0 0 T1=F, T3=F: no assignments 0 0 1 T1=F, T3=T: A4 0 1 0 T1=T, T2=F: A2, A3 0 1 1 T1=T, T2=T: A1, A3 1 0 0 T1=T, T2=F: A2, A3 1 0 1 T1=T, T2=T: A1, A3 1 1 0 T1=T, T2=F: A2, A3 1 1 1 T1=T, T2=T: A1, A3
  • 122. Paths in a loop for (i=0, f=0; i<N; i++) f = f + c[i] * x[i]; i=0 f=0 i<N f = f + c[i] * x[i] i = i + 1 N Y Loop exit
  • 123. Instruction timing • Not all instructions take the same amount of time. – Multi-cycle instructions (RISC, Fixed length instruction) – Fetches. • Execution times of instructions are not independent. (many CPUs use register bypassing to speed up instruction sequences when the result of one instruction is used in the next instruction.) – Pipeline interlocks. – Cache effects. • Execution times may vary with operand value. – This is clearly true of floating-point instructions in which a different number of iterations may be required to calculate the result – Some multi-cycle integer operations. Once we know the execution path of the program, we have to measure the execution time of the instructions executed along that path. However , even ignoring cache effects, this technique is simplistic for the reasons summarized below.
  • 124. Measurement-driven performance analysis • Not so easy as it sounds: – Must actually have access to the CPU. – Must know data inputs that give worst/best case performance. – Must make state visible. • Still an important method for performance analysis.
  • 125. Feeding the program • Need to know the desired input values. • May need to write software scaffolding to generate the input values. • Software scaffolding may also need to examine outputs to generate feedback-driven inputs.
  • 126. Trace-driven measurement • Trace-driven: – Instrument the program. – Save information about the path. • Requires modifying the program. • Trace files are large. • Widely used for cache analysis.
  • 127. Physical measurement • In-circuit emulator allows tracing. – Affects execution timing. • Logic analyzer can measure behavior at pins. – Address bus can be analyzed to look for events. – Code can be modified to make events visible. • Particularly important for real-world input streams.
  • 128. CPU simulation • Some simulators are less accurate. • Cycle-accurate simulator provides accurate clock-cycle timing. – Simulator models CPU internals. – Simulator writer must know how CPU works.
  • 129. SimpleScalar FIR filter simulation int x[N] = {8, 17, … }; int c[N] = {1, 2, … }; main() { int i, k, f; for (k=0; k<COUNT; k++) for (i=0; i<N; i++) f += c[i]*x[i]; } N total sim cycles sim cycles per filter execution 100 25854 259 1,000 155759 156 1,0000 1451840 145
  • 130. Performance optimization motivation • Embedded systems must often meet deadlines. – Faster may not be fast enough. • Need to be able to analyze execution time. – Worst-case, not typical. • Need techniques for reliably improving execution time.
  • 131. Programs and performance analysis • Best results come from analyzing optimized instructions, not high-level language code: – non-obvious translations of HLL statements into instructions; – code may move; – cache effects are hard to predict.
  • 133. Loop optimizations • Loops are important targets for optimization because programs with loops tend to spend a lot of time executing those loops. • There are three important techniques in optimizing loops: – code motion, – induction variable elimination, and – Strength reduction (x*2 -> x<<1).
  • 134. Code motion for (i=0; i<N*M; i++) z[i] = a[i] + b[i]; Code motion lets us move unnecessary code out of a loop. We can avoid N *M-1 unnecessary executions of this statement by moving it before the loop, as shown in the figure 2.
  • 135. Induction variable elimination • Induction variable: A nested loop is a good example of the use of induction variables. • Consider loop: for (i=0; i<N; i++) for (j=0; j<M; j++) z[i,j] = b[i,j]; • Rather than recompute i*M+j for each array in each iteration, share induction variable between arrays, increment at end of loop body. An induction variable is a variable whose value is derived from the loop iteration variable’s value. The compiler often introduces induction variables to help it implement the loop.
  • 136. The compiler uses induction variables to help it address the arrays. Let us rewrite the loop in C using induction variables and pointers. In the above code, zptr and bptr are pointers to the heads of the z and b arrays and zbinduct is the shared induction variable.
  • 137. • Strength reduction helps us reduce the cost of a loop iteration. • Consider the following assignment: y = x * 2; – In integer arithmetic, we can use a left shift rather than a multiplication by 2 (as long as we properly keep track of overflows). – If the shift is faster than the multiply, we probably want to perform the substitution. Strength reduction
  • 138. Performance optimization hints • Use registers efficiently. • Use page mode memory accesses. • Analyze cache behavior: – instruction conflicts can be handled by rewriting code, rescheudling; – conflicting scalar data can easily be moved; – conflicting array data can be moved, padded.
  • 139. PROGRAM-LEVEL ENERGY AND POWER ANALYSIS AND OPTIMIZATION
  • 140. Energy/power optimization • Energy: ability to do work. – Most important in battery-powered systems. • Power: energy per unit time. – Important even in wall-plug systems---power becomes heat.
  • 141. Opportunities for saving power ■ We may be able to replace the algorithms with others that do things in clever ways that consume less power. ■ Memory accesses are a major component of power consumption in many applications. By optimizing memory accesses we may be able to significantly reduce power. ■ We may be able to turn off parts of the system— such as subsystems of the CPU, chips in the system, and so on—when we do not need them in order to save power.
  • 142. Measuring energy consumption • Execute a small loop, measure current: Figure executes the code under test over and over in a loop. By measuring the current flowing into the CPU, we are measuring the power consumption of the complete loop, including both the body and other code. By separately measuring the power consumption of a loop with no body
  • 143. Sources of energy consumption • Relative energy per operation (Catthoor et al): – memory transfer: 33 – external I/O: 10 – SRAM write: 9 – SRAM read: 4.4 – multiply: 3.6 – add: 1
  • 144. Cache behavior is important • Energy consumption has a sweet spot as cache size changes: – cache too small: program thrashes, burning energy on external memory accesses; – cache too large: cache itself burns too much power.
  • 145. Optimizing for energy • Use registers efficiently. • Identify and eliminate cache conflicts. • Moderate loop unrolling eliminates some loop overhead instructions. • Eliminate pipeline stalls. • Inlining procedures may help: reduces linkage, but may increase cache thrashing.
  • 146. Efficient loops • General rules: – Don’t use function calls. – Keep loop body small to enable local repeat (only forward branches). – Use unsigned integer for loop counter. – Use <= to test loop counter. – Make use of compiler---global optimization, software pipelining.
  • 147. Program validation and testing • But does it work? • Concentrate here on functional verification. • Major testing strategies: – Black box doesn’t look at the source code. – Clear box (white box) does look at the source code.
  • 148. Clear-box testing • Examine the source code to determine whether it works: – Can you actually exercise a path? – Do you get the value you expect along a path? • Testing procedure: – Controllability: Provide program with inputs. – Execute. – Observability: examine outputs.
  • 149. How much testing is enough? • Exhaustive testing is impractical. • One important measure of test quality---bugs escaping into field. • Good organizations can test software to give very low field bug report rates. • Error injection measures test quality: – Add known bugs. – Run your tests. – Determine % injected bugs that are caught.