Embedded system Design introduction _ Karakola

Telecommunications
Telecommunications
Telecommunications
Telecommunications Engineering Master
Engineering Master
Engineering Master
Engineering Master
Embedded Systems : Introduction
Higher Institute for Applied Sciences and Technology
Dr. Daoud KARAKOULA

Module Objectives
Obtain experience in hardware/software design of embedded systems
Learn how to move from algorithm to architecture
Learn about interfacing protocols in embedded systems
Introducing high-level programming languages to describe ES
Introducing modern design issues : SOC, NOC, co-design, …

Syllabus
Embedded Processors and Memory
Embedded Systems IO
Interfacing bus, Protocols, Timers, AD and DA, …
Embedded Communications
Parallel/Serial Communication,
Parallel/Serial Communication,
Wireless Communication,
Network Communication, …
Processors and FPGA
Design of Embedded Processors (HDL)

Course material: resources references
Embedded System Design: A Unified Hardware/Software Introduction, by
F. Vahid (UCR) and T. Givargis (UCI)
Embedded System Design, Peter Marwedel
The Art of Designing Embedded Systems, Jack Ganssle
The course of Dr. Amer Baghdadi of Embedded Systems

What are Embedded Systems ?
The embedding of microprocessors
into equipment and portable devices
started before the appearance of the
home computer
It consumes the majority of
microprocessors that are made today
microprocessors that are made today
☺ Huge application domaine
☺ Prototyping boards

Definition
Definition
An embedded system is nearly any computing system other than general
purpose computer : desktop, laptop, or mainframe computer
An embedded system is a microprocessor-based system that is built to
control a function or range of functions and is not designed to be
programmed by the end user

Hardware and Software
Hardware and Software
Modern design requires a designer to have a unified
view of software and hardware
Integrated circuit (IC) capacities
Quality compiler availability
Synthesis technology
hardware
software
Application software
OS
Sw. Comm
(drivers, interruptions)
Resources
management
Hardware communication
network
CPUs (DSP, MCU), IPs, Memories

Examples of Embedded Systems
Examples
Examples
Front panel of a microwave oven
simple control
MP3 player
32-bit µP
GPS Receiver player
16-bit µP
Palm VX:
32-bit µP motorola Dragonball EZ

Examples
Examples
Camera canon EOS-3
3 µPs, 32-bit RISC CPU runs auto-focus
Nokia 6620-g :
32-bit RISC CPU ARM-9

Examples
Examples
iPhone 3G
ARM11 processor
- 64-bit data-path
- 64-bit data-path
- 8-stage pipeline
- Can vary in clock speed up to 700MHz or more
- ARM Intelligent Energy Manager (reduce power consumption 25-50%)
- Features vector floating point coprocessor
- ARM Jazelle enabled for embedded Java execution

Characteristics of Embedded Systems
Single functioned
Real-time operation
Physical size and weight
Low manufacturing cost
Low manufacturing cost
Not using general purpose processor which we find in desktop computer
Need to work with restricted memory
Low power
- Power consumption is critical in battery-powered devices

Design Challenges
• How much hardware do we need ?
what is word size of the CPU ? size of memory ?
• How to minimize power ?
reduce memory accesses
• How to speed up our design ?
Size
Performance
Power
NRE cost
• How to speed up our design ?
introduce parallelism, pipeline technique
• How to reduce the NRE (Non-recurring Engineering) cost ?
The one-time cost of designing the system
• Expertise with both software and hardware is needed to
optimize design metrics
• Improving one metric may worsen others
NRE cost

Architecture of Embedded Systems

Processors
total = 0
for i = 1 to N loop
total += M[i]
end loop
Desired
functionality
General-purpose
processor
Single-purpose
processor
Application-specific
processor

Introduction
Processor
Digital circuit that performs a
computation tasks
Controller and datapath
General-purpose: variety of computation
tasks
Single-purpose: one particular
CCD preprocessor Pixel coprocessor
A2D
D2A
Digital Camera chip
CCD
Single-purpose: one particular
computation task
Custom single-purpose: non-standard
task
A custom single-purpose
processor may be
Fast, small, low power
But, high NRE, longer time-to-market,
less flexible
µProcessor
JPEG codec
DMA controller
Memory controller ISA bus interface UART LCD ctrl
Display ctrl
Multiplier/Accum
lens

Combinational logic: basic logic gates
Buffer
x F F
x
y
x
y F
F
y
x
AND OR XOR
x y
0 1
1
0
1
1
0 0
F
1
0
1
0
x y
0 1
1
0
1
1
0 0
F
1
1
1
0
x y
0 1
1
0
1
1
0 0
F
0
1
0
0
x
1
0
F
1
0
x F
x
y
F
x
y F
x
y F
Inverseur NAND NOR XNOR
x y
0 1
1
0
1
1
0 0
F
0
0
0
1
x y
0 1
1
0
1
1
0 0
F
0
1
0
1
x y
0 1
1
0
1
1
0 0
F
1
0
1
1
x
1
0
F
0
1

Combinational logic: basic functions
Comparator
n-bit
n n
A B
I E S
Add
n-bit
n n
A B
C
n
Sum
Decoder
E(log n – 1) E0
Q0
Qn-1
A
Q
n
n
S0
Slog
m
UAL
n bits, m Ops
B
n
Mux
m x 1
E(m-1) E0
Q
n
n
S0
Slog
m
S = 1 if AB
E = 1 if A=B
I = 1 if AB
Sum = A+B
(first n bits)
C = (n+1)’th bit
of A+B
(C:Carry)
Q = A op B
op determined by
S
Q0 = 1 if E=0..00
Q1 = 1 if E=0..01
…
Qn-1=1 if E=1..11
Q =
E0 if S=0..00
E1 if S=0..00
…
Em-1 if S=1..11
May have status
outputs carry, zero, etc.
with input Cin :
Somme = A + B + Cin
with enable input en :
en=0 Output = 0..00

Sequential logic: basic functions
Counter
(n-bit)
n
Q
clk
en
Init
Shift register
(n-bit) Q
clk
E
Init
Register
(n-bit) Q
clk
load
Init
n
Q
n
E
Shift
D-FF
Q
clk
D
Init
Q
Q
Q+ =
0 if Init=1,
Q+1 if en=1 clk
Q+ =
0 if Init=1,
LSB if Shift=1 clk
- content shifted
- E stored in MSB
Q+ =
0 if Init=1,
D if clk
Q otherwise
Q+ =
0 if Init=1,
E if load=1 clk
Q otherwise

Custom single-purpose processor basic model
controller datapath
state signals
external
control
inputs
external
data
inputs
control signals
combinational
logic
(control logic
and next state)
controller
registers
datapath
controller + datapath
external
data
outputs
external
control
outputs
and next state)
state
register
functional
units
a view inside the controller and datapath

Example: Greatest Common Divisor
0: int x, y;
1: while (1) {
(b) desired functionality
GCD
clk
go_i x_i y_i
d_o
(a) black-box
First, write the algorithm
1: while (1) {
2: while (!go_i);
3: x = x_i;
4: y = y_i;
5: while (x != y) {
6: if (x y)
7: y = y - x;
else
8: x = x - y;
}
9: d_o = x;
}
GCD(42, 8) – loop of 9 iterations
evolution of (x,y) : ?

Example: Greatest Common Divisor
Convert algorithm to
“complex” state machine
(b) state diagram
(FSMD)
Known as FSMD:
finite-state machine
with datapath
1:
3:
4:
2:
2-J:
x = x_i
y = y_i
!go_i
1 !(!go_i)
!1
!(x!=y)
Can use templates to
perform such
conversion
5:
y = y -x
d_o = x
x = x - y
6:
7:
6-J:
5-J:
9:
1-J:
8:
x!=y
xy !(xy)
!(x!=y)
0: int x, y;
1: while (1) {
2: while (!go_i);
3: x = x_i;
4: y = y_i;
5: while (x != y) {
6: if (x y)
7: y = y - x;
else
8: x = x - y;
}
9: d_o = x;
}

State diagram templates
Branch statement
if (c1)
c1 stmts.
else if c2
c2 stmts.
else
other stmts
next statement
Loop statement
while (cond) {
loop-body-
statements
}
next statement
Assignment statement
a = b
next statement
J:
c2 stmts
next
statement
C:
!c1*c2
c1 !c1*!c2
others
c1 stmts
J:
l-b-stmts
next
statement
C:
cond
!cond
a = b
next
statement

Datapath
Creating the datapath
Create a register for any
declared variable
Create a functional unit for
each arithmetic operation
Connect the ports,
registers and functional
units
1:
3:
4:
2:
2-J:
x = x_i
!go_i
1 !(!go_i)
!1
y_ld
x_ld
y_sel
x_sel
x_i y_i
Mux
2x 1
n
n
Mux
2x 1
0: x 0: y
units
Based on reads and
writes
Use multiplexors for
multiple sources
Create unique identifier
for each datapath
component control input
and output
5:
4: y = y_i
y = y -x
d_o = x
x = x - y
6:
7:
6-J:
5-J:
9:
1-J:
8:
x!=y
xy !(xy)
!(x!=y)
soustractor
–
comparator
comparator
!=
soustractor
–
8: x-y 7: y-x
6: xy
5: x!=y
x_inf_y
x_neq_y
d_ld
y_ld
d_o
0: x 0: y
9: d

Creating the controller’s FSM
Same structure as FSMD
Replace complex
actions/conditions with
datapath configurations
x_i y_i
Unité opérative
n
n
x_i y_i
Unité opérative
n
n
1:
3:
4:
2:
2-J:
x = x_i
y = y_i
!go_i
1 !(!go_i)
!1
FSMD
1:
3:
4:
2:
2-J:
x_sel=0
x_ld=1
!go_i
1 !(!go_i)
!1
y_sel=0
0000
0001
0010
0011
0100
go_i
Controller
FSM
Mux
2x 1
d_ld
Mux
2x 1
0: x 0: y
soustracteur
–
comparateur

comparateur
!=
soustracteur
–
9: d
8: x-y 7: y-x
6: xy
5: x!=y
x_inf_y
x_neq_y
y_ld
x_ld
y_sel
x_sel
d_o
Mux
2x 1
d_ld
Mux
2x 1
0: x 0: y
soustracteur
–
comparateur

comparateur
!=
soustracteur
–
9: d
8: x-y 7: y-x
6: xy
5: x!=y
x_inf_y
x_neq_y
y_ld
x_ld
y_sel
x_sel
d_o
5:
y = y_i
y = y -x
d_o = x
x = x - y
6:
7:
6-J:
5-J:
9:
1-J:
8:
x!=y
xy !(xy)
!(x!=y)
x_inf_y
d_ld
x_neq_y
y_ld
x_ld
x_sel
y_sel
5:
4:
d_ld = 1
6:
7:
6-J:
5-J:
9:
1-J:
8:
!x_neq_y
y_ld=1
x_sel=1
x_ld=1
y_sel=1
y_ld=1
x_neq_y
!x_inf_y
x_inf_y
0100
0101
0110
0111
1000
1001
1010
1011
1100

Splitting into a controller and datapath
Implementation model of the
controller
Combinational
logic
x_sel
y_sel
x_ld
x_ld
x_neq_y
go_i
Mux
2x 1
Mux
2x 1
0: x 0: y
soustracteur
–
comparateur

comparateur
!=
soustracteur
–
y_ld
x_ld
y_sel
x_sel
x_i y_i
Unité opérative
n
n
Mux
2x 1
Mux
2x 1
0: x 0: y
soustracteur
–
comparateur

comparateur
!=
soustracteur
–
y_ld
x_ld
y_sel
x_sel
x_i y_i
Unité opérative
n
n
1:
3:
4:
2:
2-J:
x_sel=0
x_ld=1
!go_i
1 !(!go_i)
!1
y_sel=0
y_ld=1
0000
0001
0010
0011
0100
go_i
Unité de contrôle
1:
3:
4:
2:
2-J:
x_sel=0
x_ld=1
!go_i
1 !(!go_i)
!1
y_sel=0
y_ld=1
0000
0001
0010
0011
0100
go_i
Unité de contrôle
State register
x_inf_y
d_ld
Q1
Q2
Q3
Q4
E1
E2
E3
E4
d_ld
–

!= –
9: d
8: x-y 7: y-x
6: xy
5: x!=y
x_inf_y
x_neq_y
d_o
d_ld
–

!= –
9: d
8: x-y 7: y-x
6: xy
5: x!=y
x_inf_y
x_neq_y
d_o
5:
4:
d_ld = 1
6:
7:
6-J:
5-J:
9:
1-J:
8:
!x_neq_y
y_ld=1
x_sel=1
x_ld=1
y_sel=1
y_ld=1
x_neq_y
!x_inf_y
x_inf_y
0100
0101
0110
0111
1000
1001
1010
1011
1100
5:
4:
d_ld = 1
6:
7:
6-J:
5-J:
9:
1-J:
8:
!x_neq_y
y_ld=1
x_sel=1
x_ld=1
y_sel=1
y_ld=1
x_neq_y
!x_inf_y
x_inf_y
0100
0101
0110
0111
1000
1001
1010
1011
1100
Why splitting ?

Controller state table for the GCD example
Q3 Q2 Q1 Q0 x_neq_y x_inf_y go_i
Inputs
Q3
+
(E3)
Q2
+
(E2)
Q1
+
(E1)
Q0
+
(E0)
x_sel y_sel x_ld y_ld d_ld
Outputs
0
0
0
0
0
0
0
1
0
1
1
0
0
0
1
0
-
-
-
-
-
-
-
-
-
-
-
-
0 0 0 1 - - 0
0 1 0 1 0 - -
0 0 0 1 - - 1
0 1 0 1 1 - -
0
0
0
0
0
0
1
1
0
0
0
0
1
1
0
1
x
x
0
x
x
x
x
0
0
0
1
0
0
0
0
1
0
0 0 1 0 x x 0 0 0
0
0
0
1 0 1 1 x x 0 0 0
0 0 1 1 x x 0 0 0
0 1 1 0 x x 0 0 0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
0
0
1
1
0
0
1
1
1
0
1
0
1
0
1
0
1
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
0 1 1 0 - 1 -
0 1 0 1 1 - -
0 1 1 0 - 0 -
1
1
1
0
1
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
x
1
x
x
x
x
x
x
x
1
x
x
x
x
x
x
x
x
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0 1 1 1 x x 0 0 0
0
0
0
0
1
0
0
0
0
0 1 1 0 x x 0 0 0
1 0 0 0 x x 0 0 0

Completing the GCD custom single-purpose processor
design
We finished the datapath
We have a state table for the
next state and control logic
Combinational
logic
(control and
new-state)
Controller
registers
Datapath
new-state)
State register
Functional
units
This is not an optimized design,
but we see the basic steps
combinational logic design

FSM design
Schematic
CAD tools : StateCAD
HDL

Optimizing single-purpose processors
Optimization is the task of making design metric values the best
possible
Optimization opportunities
original program
original program
FSMD
datapath
FSM

Optimizing the original program
Analyze program attributes and look for areas of possible
improvement
number of computations
size of variable
time and space complexity
operations used
multiplication and division very expensive

Optimizing the original program (cont’)
0: int x, y;
1: while (1) {
2: while (!go_i);
3: x = x_i;
4: y = y_i;
5: while (x != y) {
6: if (x y)
7: y = y - x;
else
original program
0: int x, y, r;
1: while (1) {
2: while (!go_i);
// x doit être le plus grand
3: if (x_i = y_i) {
4: x=x_i;
5: y=y_i;
}
6: else {
7: x=y_i;
optimized program
replace the subtraction
operation(s) with modulo
operation in order to speed up
program
else
8: x = x - y;
}
9: d_o = x;
}
7: x=y_i;
8: y=x_i;
}
9: while (y != 0) {
10: r = x % y;
11: x = y;
12: y = r;
}
13: d_o = x;
}
program
GCD(42, 8) - 9 iterations to complete the
loop (x,y): (42, 8), (43, 8), (26,8), (18,8),
(10, 8), (2,8), (2,6), (2,4), (2,2).
GCD(42,8) - 3 iterations to complete
the loop (x,y): (42, 8), (8,2), (2,0)

Optimizing the FSMD
Areas of possible improvements
merge states
states with constants on transitions can be eliminated, transition taken is
already known
states with independent operations can be merged
separate states
states which require complex operations (a*b*c*d) can be broken into
smaller states to reduce hardware size
scheduling

Optimizing the FSMD (cont.)
3:
2:
2-J:
x = x_i
4:
!go_i
!(!go_i)
!(x!=y)
1:
1
!1 original FSMD
eliminate state 1 – transitions have constant
values
merge state 2 and state 2J – no loop operation
in between them
merge state 3 and state 4 – assignment
operations are independent of one another
int x, y;
5:
3:
2:
x = x_i
y = y_i
go_i
xy xy
optimized FSMD
!go_i
y = y_i
5:
y = y -x
d_o = x
x = x - y
7:
6-J:
5-J:
9:
8:
6:
x!=y
xy !(xy)
!(x!=y)
1-J:
operations are independent of one another
merge state 5 and state 6 – transitions from
state 6 can be done in state 5
eliminate state 5J and 6J – transitions from
each state can be done from state 7 and state
8, respectively
eliminate state 1-J – transition from state 1-J
can be done directly from state 9
y = y -x
d_o = x
x = x - y
7:
9:
8:
xy xy

Optimizing the datapath
Sharing of functional units
one-to-one mapping, as done previously, is not necessary
if same operation occurs in different states, they can share a single
functional unit
Multi-functional units
ALUs support a variety of operations, it can be shared among operations
occurring in different states

Optimizing the FSM
State encoding
task of assigning a unique bit pattern to each state in an FSM
size of state register and combinational logic vary
State minimization
task of merging equivalent states into a single state
state equivalent if for all possible input combinations the two states
generate the same outputs and transitions to the next same state

Introduction to GPP
General-Purpose Processor
Processor designed for a variety of computation tasks
Low unit cost, in part because manufacturer spreads NRE over large
numbers of units
Motorola sold half a billion 68HC05 microcontrollers in 1996 alone
Carefully designed since higher NRE is acceptable
Can yield good performance, size and power
Low NRE cost, short time-to-market/prototype, high flexibility
User just writes software; no processor design
a.k.a. “microprocessor” – “micro” used when they were implemented on one
or a few chips rather than entire rooms

Basic Architecture
Control unit and
datapath
Note similarity to single-
purpose processor
Processor
Control unit Datapath
Control
Controller
ALU
Registers
Status
Key differences ?
Memory
I/O
IR
PC
Registers

Basic Architecture
Control unit and
datapath
Note similarity to single-
purpose processor
Processor
Control
Controller
ALU
Registers
Status
Key differences
Datapath is general
Control unit doesn’t
store the algorithm – the
algorithm is
“programmed” into the
memory
Memory
I/O
IR
PC
Registers

Datapath Operations
Load
Read memory location into
register
ALU operation
Processor
Datapath
ALU
Registers
Control
Control unit
Controller
Status
+1
11
ALU operation
Input certain registers
through ALU, store back in
register
Store
Write register to memory
location
Registers
Memory
I/O
IR
PC
10
…
…
10
10 11

Control Unit
Control unit: configures the datapath
operations
Sequence of desired operations
(“instructions”) stored in memory –
“program”
Instruction cycle – broken into
several sub-operations, each one
clock cycle, e.g.:
Fetch instruction : Get next
Processor
ALU
Registers
Controller
Control
Status
Fetch instruction : Get next
instruction into IR
Decode : Determine what the
instruction means
Fetch operands : Move data from
memory to datapath register
Execute : Move data through the ALU
Store : Write data from register to
memory
Registers
IR
PC
Memory
I/O
10
…
…
500
501
100
101
load R0, M[500]
Inc R1, R0
store M[501], R1
102
R0 R1

Control Unit Sub-Operations
Fetch Instruction
Get next instruction into
IR
PC: program counter,
Processor
ALU
Registers
Controller
Control
Status
PC: program counter,
always points to next
instruction
IR: holds the fetched
instruction
Registers
IR
PC
Mmeory
I/O
10
…
…
500
501
100
101
load R0, M[500]
Inc R1, R0
store M[501], R1
102
R0 R1
100
load R0, M[500]
Adresse

Decode
Determine what the
instruction means
Processor
ALU
Registers
Controller
Control
Status
Registers
IR
PC
Memory
I/O
10
…
…
500
501
load R0, M[500]
Inc R1, R0
store M[501], R1
R0 R1
100 load R0, M[500]
100
101
102

Fetch operands
Move data from memory
to datapath register
Processor
ALU
Registers
Controller
Control
Status
Registers
IR
PC
Memory
I/O
10
…
…
500
501
100
101
load R0, M[500]
Inc R1, R0
store M[501], R1
102
R0 R1
100 load R0, M[500]
10

Execute
Move data through the
ALU
This particular
Processor
ALU
Registers
Controller
Control
Status
This particular
instruction (load R0, M[500])
does nothing during this
sub-operation
Registers
IR
PC
Mémoire
I/O
10
…
…
500
501
100
101
load R0, M[500]
Inc R1, R0
store M[501], R1
102
R0 R1
100 load R0, M[500]
10

Store
Write data from register
to memory
This particular
Processor
ALU
Registers
Controller
Control
Status
This particular
instruction (load R0, M[500])
does nothing during this
sub-operation
Registers
IR
PC
Memory
I/O
10
…
…
500
501
100
101
load R0, M[500]
Inc R1, R0
store M[501], R1
102
R0 R1
100 load R0, M[500]
10

Instruction Cycles
PC=100
Fetch
operands
Exec. Store
results
clk
Fetch
inst.
Decode
Processor
ALU
Registers
Control
Status
Controller
Registers
IR
PC
Memory
I/O
10
…
…
500
501
100
101
load R0, M[500]
Inc R1, R0
store M[501], R1
102
R0 R1
100
load R0, M[500]
10

Instruction Cycles
Processor
ALU
Registers
Control
Status
PC=101
PC=100
Fetch
operands
Exec. Store
results
clk
Decode
Fetch
inst.
Controller
+1
11
Registers
IR
PC
Memory
I/O
10
…
…
500
501
100
101
load R0, M[500]
Inc R1, R0
store M[501], R1
102
R0 R1
101
Inc R1, R0
10
PC=101
Fetch
operands
Exec. Store
results
clk
Decode
Fetch
inst.
10

Instruction Cycles
PC=100
Fetch
operands
Exec. Store
results
clk
Fetch
inst.
Decode
Processor
ALU
Registers
Control
Status
PC=101
Controller
Registers
IR
PC
Memory
I/O
10
…
…
500
501
100
101
load R0, M[500]
Inc R1, R0
store M[501], R1
102
R0 R1
102
store M[501], R1
10
PC=102
Fetch
operands
Exec. Store
results
clk
Decode
11
PC=101
Fetch
operands
Exec. Store
results
clk
Decode
Fetch
inst.
Fetch
inst.
11

Instruction Cycles
PC=100
Fetch
operands
Exec. Store
results
clk
Fetch
inst.
Decode
Processor
ALU
Registers
Control
Status
PC=101
Controller
Registers
IR
PC
Memory
I/O
10
…
…
500
501
100
101
load R0, M[500]
Inc R1, R0
store M[501], R1
102
R0 R1
102
store M[501], R1
10
PC=102
Fetch
operands
Exec. Store
results
clk
Decode
11
PC=101
Fetch
operands
Exec. Store
results
clk
Decode
Fetch
inst.
Fetch
inst.
11
What’s the problem of this processor ?

Architectural Considerations
Performance can be improved by:
Faster clock (but there’s a limit)
Pipelining: slice up instruction into stages, overlap stages
Pipelining: slice up instruction into stages, overlap stages
Multiple ALUs to support more than one instruction stream
Superscalar and V LIW architectures

Clock Frequency
Inverse of clock period
Must be longer than
longest register to register
Processor
ALU
Registers
Controller
Control
Status
longest register to register
delay in entire processor
Memory access is often
the longest
Registers
IR
PC
Memory
I/O

1 2 3 4 5 6 7 8
Wash
Dry
Pipelined
pipelined dish cleaning
1 2 3
1 2
4
3
5
4
6
5
7
6
8
7 8
Non-pipelined
non-pipelined dish cleaning
1 2 3 4 5 6 7 8
time time
Two available
ressources
Pipelining: Increasing Instruction Throughput
Fetch-inst.
Decode
Fetch ops.
Execute
Store res.
time
Pipelined
pipelined instruction execution
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1st instruction

Superscalar Architecture
Superscalar
Scalar operation: executing on one or two numbers
Fetch instructions in packets
Static scheduling (at compilation time) or dynamic (at execution time)
In case of dynamic scheduling: need of a complex hardware block to detect the independent
instructions
multiple
Cache/
memory
Fetch Decode multiple
instructions
Registers
memory
Fetch
FU FU FU
Several functional
units (UF)
Decode
Ordre
Sequential
instruction flow

VLIW Architecture
VLIW (Very Long Instruction Word) : long instruction (128-1024 bits) composed of
several independent operations (rather than one)
Equivalent to a superscalar architecture with a static scheduling
More and more widespread
one instruction multi-operations
Cache/
memory
Fetch
one instruction multi-operations
Registres
FU FU FU
Several functional
units (UF)

Superscalar vs. VLIW
Superscalar VLIW
HW detects potential parallelism,
register renaming
parallelism detection on compile time
very complex HW, windows execution
is limited
simpler hardware, whole program is
analyzed
is limited analyzed
- large registers, large code size (wasted
bits in instruction word)
i.e. PowerPC, Pentium, AMD K5 i.e. TMS320C6x (multimedia),
IA64 (Servers workstations)

Two Memory Architectures
Princeton
(Von Neumann)
Fewer memory wires
Simple Implementation
Processor
Program
Processor
Memory
Simple Implementation
Harvard
Simultaneous program
and data memory access
Program
memory
Data memory
Harvard
Memory
(program and data)
Princeton
(Von Neumann)
Von Neumann model is the most used generally
Harvard Princeton
More nb. of control signals Less nb. of control signals
computation speed is higher No parallelism

Cache Memory
Memory access may be slow
Cache is small but fast memory close
to processor
Processor
Fast/expensive technology, usually on the
same chip
Holds copy of part of memory
Hits and misses
Hit : if the mem. @ is in the cache
Miss : if not. The cache is updated
Memory
Cache
Slower/cheaper technology, usually on a
different chip

Programmer’s View
Programmer doesn’t need detailed understanding of architecture
Instead, needs to know what instructions can be executed
Two levels of instructions:
Assembly level
Structured languages (C, C++, Java, etc.)
Most development today done using structured languages
But, some assembly level programming may still be necessary
Drivers: portion of program that communicates with and/or controls (drives)
another device
Often have detailed timing considerations, extensive bit manipulation
Assembly level may be best for these

Assembly-Level Instructions
code.op opérande1 opérande2
...
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Instruction Set
Defines the legal set of instructions for that processor
Data transfer: memory/register, register/register, I/O, etc.
Arithmetic/logical: move register through ALU and back
Branches: determine next PC value when not just PC+1

Addressing Modes
Operand field
Register-direct
Immediate data
Register address
Addressing
mode
Register-file
contents
Memory
contents
data
Register
indirect
Direct
Indirect
Register address
Memory address
Memory address
data
Memory address data
data
Memory address

MOV Rn, direct
assembler
Instruction
0000 Rn
First byte
direct
Second byte
Rn = M(direct)
Operation
MOV direct, Rn 0001 Rn direct M(direct) = Rn
MOV @Rn, Rm 0010 Rn Rm M(Rn) = Rm
A Simple Instruction Set
ADD Rn, Rm 0100 Rm
Rn Rn = Rn + Rm
MOV Rn, #immed. 0011 Rn immédiat Rn = immédiat
JZ Rn, relatif 0110 Rn relatif PC = PC + relatif
(ssi Rn = 0)
SUB Rn, Rm 0101 Rm Rn = Rn - Rm
Rn
code.op operand

Sample Programs
int total = 0;
for (int i=10; i!=0; i--)
C program Equivalent assembly program
MOV R0, #0; // total = 0
0
MOV R1, #10; // i = 10
1
MOV R2, #1; // constant 1
2
3
total += i;
// next instructions...
JZ R1, Next; // Saut si i=0
Loop:
Next: // next instructions...
3
ADD R0, R1; // total += i
5
SUB R1, R2; // i--
6
JZ R3, Loop; // Saut
7

Programmer Considerations
Program and data memory space
Embedded processors often very limited
e.g., 64 Kbytes program, 256 bytes of RAM (expandable)
N-bit processor
N-bit ALU, registers, buses, memory data interface
Embedded: 8-bit, 16-bit, 32-bit common
Desktop/servers: 32-bit, 64-bit
Registers: How many are there?
Only a direct concern for assembly-level programmers
I/O
How communicate with external signals?
Interrupts

Application-Specific Instruction-Set Processors (ASIPs)
General-purpose processors
Sometimes too general to be effective in demanding application
e.g., video processing – requires huge video buffers and operations on large
arrays of data, inefficient on a GPP
But single-purpose processor has high NRE, not programmable
ASIPs – targeted to a particular domain
Contain architectural features specific to that domain
e.g., embedded control, digital signal processing, video processing, network
processing, telecommunications, etc.
Still programmable

A Common ASIP: Digital Signal Processors (DSP)
For signal processing applications
Large amounts of digitized data, often streaming
Data transformations must be applied fast
e.g., cell-phone voice filter, digital TV, music synthesizer
DSP features
Several instruction execution units
Multiple-accumulate single-cycle instruction, other instrs.
Efficient vector operations – e.g., add two arrays
Vector ALUs, loop buffers, etc.

Another Common ASIP: Microcontroller
For embedded control applications
Reading sensors, setting actuators
Mostly dealing with events (bits): data is present, but not in huge amounts
e.g., VCR, disk drive, digital camera (assuming SPP for image compression),
washing machine, microwave oven
Microcontroller features
On-chip peripherals
Timers, analog-digital converters, serial communication, etc.
Tightly integrated for programmer, typically part of register space
On-chip program and data memory
Direct programmer access to many of the chip’s pins
Specialized instructions for bit-manipulation and other low-level operations

Trend: Even More Customized ASIPs
In the past, microprocessors were acquired as chips
Today, we increasingly acquire a processor as Intellectual Property (IP)
e.g., synthesizable VHDL model
Opportunity to add a custom datapath hardware and a few custom instructions, or
delete a few instructions
delete a few instructions
Can have significant performance, power and size impacts
Problem: need compiler/debugger for customized ASIP
Remember, most development uses structured languages
One solution: automatic compiler/debugger generation
e.g., www.tensillica.com
Another solution: retargettable compilers
e.g., www.improvsys.com (customized VLIW architectures)

Selecting a Microprocessor
Issues
Technical: speed, power, size, cost
Other: development environment, prior expertise, licensing, etc.
Speed: how evaluate a processor’s speed?
Clock speed – but instructions per cycle may differ
Instructions per second – but work per instr. may differ
Instructions per second – but work per instr. may differ
Dhrystone: Synthetic benchmark, developed in 1984. Dhrystones/sec.
MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s VAX 11/780). A.k.a.
Dhrystone MIPS. Commonly used today.
So, 750 MIPS = 750*1757 = 1,317,750 Dhrystones per second
SPEC: set of more realistic benchmarks, but oriented to desktops
EEMBC – EDN Embedded Benchmark Consortium, www.eembc.org
Suites of benchmarks: automotive, consumer electronics, networking, office automation,
telecommunications

Presentation of the elementary processor
8-bits general purpose processor
Based on an accumulator register called ACCU (8 bits)
Four instruction types
Mnemonic Instruction coding Description
NOR 00AAAAAA ACCU = ACCU NOR Mem[AAAAAA]
ADD 01AAAAAA ACCU = ACCU + Mem[AAAAAA], update
Carry
Each instruction is coded with 8 bits. Two for the operation type (code.op) and 6
bits to code the operand or the address of the operand in the memory (depending
on the operation type)
ADD 01AAAAAA Carry
STA 10AAAAAA Mem[AAAAAA] = ACCU
JCC 11DDDDDD
If Carry = 0 ⇒
⇒
⇒
⇒ PC = DDDDDD Else clear
Carry (Carry=0)
[Source du jeu d’instructions : http://www.tuhh.de/~setb0209/cpu/ par T. Böscke]

000000 : 00001000 (0x08) NOR 0b001000 ; ACCU = ACCU NOR M[001000]
000001 : 01000111 (0x47) ADD 0b000111 ; ACCU = ACCU + M[000111] (Carry)
000010 : 10000110 (0x86) STA 0b000110 ; M[000110] = ACCU
000011 : 11000100 (0xC4) JCC 0b000100 ; If Carry = 0 then PC = 000100 Else clear Carry
000100 : 11000100 (0xC4) JCC 0b000100 ; PC = 000100 (Carry is already cleared!)
000101 : 00000000 (0x00)
Adr Mem binary (hexa) Instruction Comments
content in assembler
Example of a test program
000110 : 00000000 (0x00)
000111 : 01111110 (0x7E)
001000 : 11111111 (0xFF)
001001 : 00000000 (0x00)
001001 : 00000000 (0x00)
… …
Data…
… …
… …
111111 : 00000000 (0x00)

Processor design (1/3)
Considering the basic template
architecture
Considering the instruction set, the
number of registers, and the
Processeur
Unité de contrôle Unité opérative
Commande
Contrôleur
Contrôleur
UAL
Registres
UAL
Registres
État
number of registers, and the
eventual architectural
specifications/constraints
And using the previously presented
design methodology
Mémoire
E/S
IR
PC IR
PC
Registres
Registres

Processeur
Unité de contrôle Unité opérative
Commande
Contrôleur
Contrôleur
UAL
Registres
UAL
Registres
État
Algorithm – FSMD ?
Clear PC IR Carry Registers;
while (1) {
Fetch Inst (get one instruction);
Decode the instruction;
if ( CodeOp=00 or CodeOp=01 )
{
Fetch Operand (get the operand);
if CodeOp=00
Execute NOR (ACCU = ACCU NOR M[AAAAAA]);
else
Mémoire
E/S
IR
PC IR
PC
else
Execute ADD (ACCU = ACCU + M[AAAAAA]
Update Carry);
}
else if CodeOp=10
{
Execute STA (Mem[AAAAAA] = ACCU);
}
else
{
Execute JCC (if Carry=0
PC=DDDDDD
else
Carry=0);
}
}

selALU
ALU
NOR ou ADD
incPC
[7:0]
ldC
[7:0]
ldPC
Contrôle (FSM)
[7:6]
[5:0]
ldACCU
CodeOp
Rst
C
clrC
C
Datapath
Control unit
SelALU=‘0’ for NOR
SelALU=‘1’ for ADD
ACCU
selADR
incPC
PC
IR
clrPC
[7:0]
Mux
[5:0] [5:0]
[5:0]
Memory
ldIR
[7:0]
[7:0]
DataIn
DataOut
Adr
ldACCU
1
0
R1
ldR1
enM
weM

It is time to
It is time to
exercise!

Embedded system Design introduction _ Karakola

Recommended

Recommended

More Related Content

Similar to Embedded system Design introduction _ Karakola

Similar to Embedded system Design introduction _ Karakola (20)

Recently uploaded

Recently uploaded (20)

Embedded system Design introduction _ Karakola