SlideShare a Scribd company logo
Specification Languages:
Part 2
Marc Engels
e-mail: marc.engels@flandersmake.be
2
Specification Languages
 Part 1: Specification Models
 Part 2: Model based system design
 Show how the models of part 1 can be used for
architectural design
 Provide hands-on experience with SystemC v2.3.2
(released in October 2017).
 Introduce OO techniques for design of hardware systems
 Part 3: Project
3
Course Material for part 2
 Prerequisite:
 part 1 of specification languages
 C++ (good tutorial at www.cplusplus.com)
 Coding and debugging programs
 RTL description of synchronous digital circuits
 Material for part 2:
 Slides with notes.
 IEEE Standard SystemC Language Reference Manual, IEEE
Std 1666-2011.
Model Based System
Design
Class 1: constructing a
functional model
Marc Engels
e-mail:
marc.engels@flandersmake.be
5
Functional modeling in
SystemC
 Introduction to design of digital embedded systems
 SystemC introduction
 SystemC functional model syntax
 Exercise 1: building a functional model in SystemC
6
Consumer devices become
increasingly more intelligent
7
… as well as professional
equipment
8
Characteristics of embedded
systems
 Optimize for power, cost, and size
 Robust design
 Provide the ability for evolution and mass customization
 Minimize time to market
 Some functionality might be safety-critical
 Interfacing with the real world, leading to real time constraints
9
Sensors Actuators
Real world process
Processing
Embedded systems combine
various types of real-time behavior
ADC DAC
event
signal signal
action
user
Signal
conditioning
Actuator
Powering
10
Digital embedded systems
combine hard- and software
User
interface
NVM
ROM
µPorDSPcore
RAM
Conf. Logic
Memories
Peripheral
Mo-
dem
buffers
Video/
Graphics
processor Protocol
Speech
Processing
Analysis of
channel
+ analog, sensors and actuators
11
Design flow for digital embedded
systems
System
Functionality
Functional
Requirements
Performance
Requirements
Architecture
Template
Architectural
Requirements
Mapping
Dedicated
Architecture
C-code
Non-functional
Requirements
12
Function to architecture
conversion follows three axes
ComputationsComputations
operations
DataData
variables, arrays
floating point
memories
fixed point
operators
CommunicationCommunication
point-to-point
queues
busses
detailed protocol
resource allocation
scheduling
memory allocation
address generation
word sizing
bus allocation
introduce arbiters
include protocols
System
Functionality
Dedicated
Architecture
13
Functional modeling in
SystemC
 Introduction to design of digital embedded systems
 SystemC introduction
 SystemC functional model syntax
 Exercise 1: building a functional model in SystemC
14
SystemC bridges gap between
function and architecture
MATLAB
C/C++
VHDL
Verilog
SystemC
System
Functionality
Dedicated
Architecture
15
What is SystemC?
 A modeling framework in C++ for the refinement of system from a functional
description into an architecture
 Contributions:
 hardware modeling with C++: OCAPI (IMEC) and SCENIC (Synopsys/UC
Irvine)
 fixed-point data types: Frontier Design
 hardware-software co-design: CoWare (IMEC/CoWare)
 Language first standardized in December 2005 as IEEE 1666, revised in 2011 as
IEEE 1666-2011
 Extensions of SystemC:
 Verification library.
 Transaction level modeling library ( integrated in IEEE 1666-2011).
 Analog and mixed-signal modeling.
 More info: www.accellera.org
16
Which tools are available for
SystemC?
 Open source simulation library available
 Open source translators from Verilog or VHDL to SystemC
 Commercial synthesis tools:
 Cadence (Stratus HLS).
 Mentor(Catapult C).
 NEC(CyberWorkBench).
 SystemCrafter (SC).
 Xilinx (Vivado Design Suite).
17
SystemC language
architecture
C++ language
Core Language
Modules
Ports
Exports
Processes
Interfaces
Channels
Events
Event-driven simulation kernel
Data-types
4-valued logic type
4-valued logic vectors
Bit-vectors
Finite-Precision integers
Limited-Precision integers
Fixed-Point types
Pre-defined Channels
Signal, Clock, fifo,
Mutex, Semaphore.
Libraries for Specific Models of Computation and/or methodologies, e.g. TLM
interfaces, bus models, SystemC verification library
Utilities
Report Handling,
Tracing
User Application
18
SystemC core language
sc_modulesc_module
sc_portsc_port
sc_prim_channelsc_prim_channel
sc_processsc_process
sc_interfacesc_interface
sc_eventsc_event
sc_exportsc_export
19
Functional modeling in
SystemC
 Introduction to design of digital embedded systems
 SystemC introduction
 SystemC functional model syntax
 Exercise 1: building a functional model in SystemC
20
processprocess processprocess
FIFOFIFO
Kahn Process Networks in
SystemC
 (Modules to structure design)
 Functional processes
 First-In-First-Out queues
 Simulation engine
21
Modules are used for structural
partitioning the functionality
 Each module has its own class, derived from the sc_module
class.
 Every constructor of a module class shall have exactly one
parameter of class sc_module_name.
 It is good practice to make this name for an instance of the
module the same as the C++ variable name through which
the module is referenced.
 A module can be hierarchical or contains processes. In the latter case,
the SC_HAS_PROCESS(“class name”) macro is used to indicate
that the module contains processes.
22
Example of a functional model of
an adder
SC_MODULE(adder) {
//define ports
//define processes, internal data, etc.
SC_CTOR(adder) {
// body of constructor;
// process declaration, sensitivities, etc.
};
};
Class adder : public sc_module {
public:
// define ports
//define processes, , internal data, etc.
SC_HAS_PROCESS(adder);
adder(sc_module_name name):
sc_module(name) {
// body of constructor;
// process declaration, sensitivities, etc.
};
};
Explicit:Explicit: With MACROs:With MACROs:
23
Ports are used to communicate
with a FIFO channel
 General port definition: sc_port<interface>
 Predefined ports are: sc_fifo_in<T> and sc_fifo_out<T>.
 sc_fifo_in<T> is derived from sc_port<sc_fifo_in_if<T>,0> with interface
functions read(), nb_read(), and num_available().
 sc_fifo_out<T> is derived from sc_port<sc_fifo_out_if<T>,0> with interface
functions write(), nb_write(), and num_free().
 blocking read and write interface functions (automatic synchronization with
implicit wait() operations)
int a = f1.read(); // read a token
f1.write(a); // write a token
 Inspecting queues
int a = f1.num_available(); // number of tokens in a queue
int a = f1.num_free(); // number of free places in a queue
24
Example of a functional model of
an adder (continued)
SC_MODULE(adder) {
sc_fifo_in<int> a,b;
sc_fifo_out<int> c;
//define processes, internal data, etc.
SC_CTOR(adder) {
// body of constructor;
// process declaration, sensitivities, etc.
};
};
25
SC_THREAD processes are used
to model functional processes
 SC_THREAD processes run forever once started.
 SC_THREAD processes can be suspended by means of the
wait(event) function. In functional modeling the wait
statements are hidden in the read() and write() functions to the
queues.
 Multiple processes per module are possible
 Processes can also be dynamically created.
26
Example of a functional model of
an adder (continued)
SC_MODULE(adder) {
sc_fifo_in<int> a,b;
sc_fifo_out<int> c;
void compute() {
while(true) {
int valuea = a.read();
int valueb = b.read();
c.write(valuea+valueb);
}
}
SC_CTOR(adder) {
SC_THREAD(compute);
}
};
27
Define the main program
 The systemc library must be included in the main program:
 #include <systemc.h>
 In sc_main() the following actions are taken:
 Instantiate channels with:
• sc_fifo<T> (”name”, length); // default length 16
• e.g. sc_fifo<int> f1(”f1”,2);
 Instantiate the modules.
 Bind ports of modules to channels:
• Positional
• named.
 Call sc_start() to start simulation and run until end of any
activity.
28
Example of a functional model of
an adder (continued)
int sc_main(int argc , char *argv[]) {
sc_fifo<int> fifo_a, fifo_b, fifo_c; //channel instantiation
… // instantiate signal generation and evaluation module
adder my_adder(“my_adder”); // module instantiation
my_adder.a(fifo_a); // binding of port to channel
my_adder.b(fifo_b);
my_adder.c(fifo_c);
… // other modules and test bench, which drive fifo_a and fifo_b.
sc_start(); // start simulation
};
Elaborationphase
29
SC_MODULE(superfunc) {
// IO ports
sc_fifo_in<float> in;
sc_fifo_out<float> out;
//internal queues
sc_fifo<float> d;
// internal modules
function func1;
function *func2;
// Module constructor
SC_CTOR (superfunc):
func1(“func1”) {
func1.in(in);
func1.out(d);
func2 = new function (“func2”);
func2->in(d);
func2->out(out);
}
}
Modules can also be used to
create hierarchy
func1func1
superfunc
d
func2func2
sc_module(function)
30
Simulation engine
 In an un-timed model, the simulator only advances in delta-
cycles:
 If it is started to run for a finite amount of time, it will never
stop.
 We therefore run it until no events are present: sc_start();
 Ways of stopping the simulator:
 Terminate a process (return from SC_THREAD): the
simulator will stop due to the lack of events.
 Call sc_stop() when a termination condition is fulfilled.
31
Functional modeling in
SystemC
 Introduction to design of digital embedded systems
 SystemC introduction
 SystemC functional model syntax
 Exercise 1: building a functional model in SystemC
32
Goal of this exercise
 use a simplifiedJPEG block diagram to practice functional
modeling
 develop a functional process that fits into a system
 simulate a functional model
 observe the overall behavior of a system
33
What is JPEG?
 “JPEG” stands for
“Joint Photographic Experts Group”
 “JPEG” is a standard for color image compression
 “JPEG” is widely used (e.g. on the WWW)
 More information?
 http://www.jpeg.org/
34
(Partial) JPEG: a simple block
diagram
DCT
Quantize
(+table)
ZIGZAG
SCAN
RUN-LENGTH
ENCODER
IDCT
Normalize
(+table)
ZIGZAG
SCAN
RUN-LENGTH
DECODER
Original
Image
Reconstructed
Image
JPEG-ENCODER
JPEG-DECODER
R2B
B2R
Parameters: width, height, #bits
Parameters: width, height, #bits
35
2D Discrete Cosine Transform
 Non-optimized equation
 DCT can be separated in consecutive 1-D operations
 There are many optimized DCT-algorithms available
( ) ( ) ( ) ( ) ( ) ( )
∑∑= =
++
⋅=
7
0
7
0 16
12
cos.
16
12
cos,
4
1
,
i j
vjui
jifvCuCvuF
ππ
( ) ( ) ( ) ( ) ( ) ( )
∑∑= =
++
⋅=
7
0
7
0 16
12
cos.
16
12
cos,
4
1
,
u v
vjui
vuFvCuCjif
ππ
01
0
2
1
)(




≠
=
=
l
l
lCwhere
36
Quantization
 Each DCT coefficient is divided by the coefficient amplitude
that is just detectable by the human eye (table)
 The result is rounded to an integer
 This reduces the number of bits needed to represent the DCT
coefficient
 The quantization is the place where information of the image
might be lost, resulting in lossy compression.
37
Quantization Table
9910310011298959272
10112012110387786449
921131048164553524
771031096856372218
6280875129221714
5669574024161314
5560582619141212
6151402416101116


























=N
38
The coefficients are zigzag
scanned
0 1 5 6 14 15 27 28
2 4 7 13 16 26 29 42
3 8 12 17 25 30 41 43
9 11 18 24 31 40 44 53
10 19 23 32 39 45 52 54
20 22 33 38 46 51 55 60
21 34 37 47 50 56 59 61
35 36 48 49 57 58 62 63
39
(Simplified) Run-length coding
 Send the DC value “as is”
 Represent the high frequency data with (zero run-length,
amplitude) combinations.
 End the stream with EOB (= 63).
 Example:
 in: 79, 0, -2, -1, 3, -1, 0, 0, -1, 0, 0, 0, …
 out: 79, 1,-2, 0,-1, 0, 3, 0,-1,2,-1, 63
40
How to start?
 Download exercise files form http://www.icorsi.ch/
 Follow installation instructions of exercises.
 you will find:
 In /exercises/exercise1/: main.cpp to start from
 In/exercises/modules/: library with JPEG encoder modules
{r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules
{b2r,idct,normalize,zz_dec}.{h,cpp} and test bench modules {src,snk,test}.{h,cpp}
 In /exercises/images/: test images
 In /exercises/add2systemc additional functions (df_fork, fifo_stat)
 Things to be done:
 make rl_dec.h and rl_dec.cpp
 complete the main.cpp with the modules.
 Compile and execute the application.
 Inspect the number of reads and writes in the fifos
 Visualize resulting image
 Test if you can launch the application in the debugger.
 Optional: make a hierarchy for the encoder and decoder.
41
Using SystemC on
Linux/Cygwin
 Use g++ (I used version 4.5.3).
 Make a workspace in Eclipse:
 Add your source files to the project.
 Add libmodules.a
 Add libadd2systemc.a (for next exercises).
 Add libsystemc.a
 Put the right include paths and linker paths
 Build your application from within Eclipse.
 Execute your application from within Eclipse.
 Exercise1.exe –i ../images/mountain.pgm –o result.pgm
Model Based System
Design
Class 2: Fixed-point
refinement
Marc Engels
e-mail: marc.engels@flandersmake.be
43
Fixed point refinement
 Fixed word length optimization
 Overflow and quantization
 MSB determination
 LSB determination
 Fixed word length support in SystemC
 Exercise 2: fixed point refinement of IDCT
44
Fixed point refinement is one of the
steps in architectural design
ComputationsComputations
operations
DataData
variables, arrays
floating point
memories
fixed point
operators
CommunicationCommunication
point-to-point
queues
busses
detailed protocol
resource allocation
scheduling
memory allocation
address generation
word sizing
bus allocation
introduce arbiters
include protocols
System
Functionality
Dedicated
Architecture
45
**
3 bytes (mantissa)3 bytes (mantissa)
+ 1 byte (exponent)+ 1 byte (exponent)
Fixed-point
•minimum area
•low power
•high speed
88
**66
1414
Finite word lengths are a must
for DSP applications
Floating-point
•powerful
•expensive (storage & ops)
46
22
33
22 22 22 22 22i.2i.2
22 11 00 -1-1 -2-2 -3-3
WLWL
IWLIWL
MSBMSB LSBLSB
How to model a fixed-point
signal?
Total number of bits WL
Integer bits IWL
Value representation
•2’s complement (i=-1)
•unsigned (i=1)
WL-IWLWL-IWL
47
How do we quantize?
truncatetruncate
(floor)(floor)
fxpfxp
flpflp
roundround
fxpfxp
flpflp
magnitudemagnitude
truncatetruncate
fxpfxp
flpflp
ceilceil
fxpfxp
flpflp
48
What happens on an overflow?
wrap-around saturation
flp flp
fxp fxp
max. value
49
Saturation Hardware
MAX_VAL
MIN_VAL
comp
comp
mux
mux
VALUE RESULT
50
Floating-pointFloating-point
algorithmalgorithmADCADC
88 77
**
**
++
??
??
??
??
????
During design we must specify
fixed-point formats for signals
z-1
DACDAC
51
Fixed-point refinement is a
complex optimization problem
Minimize overall cost:
minimal word lengths
truncate and wrap-around
MSB determination:
goal: avoid unwanted overflows
method: find min, max signal values
result: MSB position, value
representation, overflow
LSB determination:
goal: keep required precision
method: evaluate difference
between flp a fxp behavior
result: LSB position, quantization
safe rangesafe range
quantizationquantization
52
MSB determination can be
based on range calculations
* +
d
m
x
y
Put range (min, max) on inputs
Propagate range over the operators
This gives a save (pessimistic) estimate
rangerange
infoinfo
[0,255]
12
rangerange
calc.calc.[0,255]
[0,3060] [0,3315]
z-1
53
Range propagation is a simple
calculation
Operator minc maxc
c=a+b mina+minb maxa+maxb
c=a-b mina-maxb maxa-minb
c=a*b MIN(mina*minb,
mina*maxb,
maxa*minb,
maxa*maxb)
MAX(mina*minb,
mina*maxb,
maxa*minb,
maxa*maxb)
54
Range calculations can get
unstable with feedback
*
+
a
X(n) Y(n)
z-1
F(n)
sample n
maxF
minF
value
55
* +
d
m
x
12 y
stimuli
?min, max
q1
Collecting signal statistics from
simulations is an alternative
Perform simulation with realistic stimuli.
Collect minimum and maximum value on each signal during the
simulation
This gives an optimistic, stimuli dependent estimate
z-1
56
signal statistic range propagation
name min max MSB1 min max MSB2
signal1 -1.5 1.6 2 -1.9 1.9 2
signal2 -1.3 1.4 2 -2.1 2.1 3
signal3 -1.2 1.2 2 -22.0 22.0 6
signal4 -1.2 1.2 2 -∞ ∞ ∞
Combine both methods for
accurate MSB determination
If MSB1 == MSB2: wrap-around(MSB1)
If MSB1 < MSB2: wrap-around(MSB2)
If MSB1 << MSB2: saturation (MSB1)
MSB2 is ∞ saturation (MSB1)
57
QQ ++
B bits
input output outputinput
noise
Quantization effects can be
modeled as additive noise
Noise is approximated by a statistical model with the following
assumptions:
the noise is uncorrelated to the input.
the noise is white.
the probability distribution is uniform.
58
Each quantization effect has
mean and variance
 Rounding with step ∆:
 Truncation with step ∆:
 Magnitude truncation with step ∆:
12
and0
2
2 ∆
== nnm σ
12
and
2
2
2 ∆
=
∆
−= nnm σ
3
and0
2
2 ∆
== nnm σ
59
This results in an equivalent
linear network
Q1Q1 +
* +
d
m
x
12 y
z-1
QQ
22
* +
d
m
x
12 y
z-1
e1(t)
+
e2(t)
))1()()(12())1()(12()( 121 −+++−+= tetetetxtxty
60
… but quantization is a non-
linear operation
*
+
-0.96
X(n) Y(n)
z-1
QQ
X(0) = 14, x(n) = 0 for n > 0
round to nearest integer
B bits
...
...
with rounding:
without rounding:
61
LSB determination is based on
simulations
All fixed-point
simulate
output
ok
yes
no
* +
stimuli
12
x
ym
QQ
* +
12
x
ym com
pare
QQ
z-1
z-1
62
Signal to quantization noise
ratio (SQNR)








+
+
= 22
22
10log10
ee
ss
x
m
m
SQNR
σ
σ
Q
-
e
me,σe
ms,σs
xQ
63
LSB selection optimizes cost and
performance
quantization
set
SQNR
pi
SQNR
accu
SQNR
pix
SQNR
coeffs
SQNR
block
SQNR
temp block
SQNR
blocki cost SNR PSNR
0 208 253 Inf 184 Inf 225 Inf 787968 27,64 31,49
1 45,5 59,76 Inf 174 Inf Inf Inf 759296 27,48 31,33
2 45,5 59,76 25,15 174 Inf Inf Inf 759296 22,66 26,51
3 45,5 59,76 38,77 174 Inf Inf Inf 759296 26,91 30,75
4 45,5 59,76 47,3 30,88 Inf Inf Inf 230912 27,35 31,19
5 45,5 59,8 47,3 30,88 29,38 Inf Inf 230912 27,34 31,19
6 45,5 61,4 47,3 30,88 29,38 -1,93 Inf 41472 20,47 24,32
7 45,5 59,8 47,3 30,88 29,38 Inf Inf 72192 27,34 31,19
8 45,5 60,23 47,3 30,88 29,38 16,73 Inf 56832 26,96 30,8
9 45,5 59,88 47,3 30,88 29,38 31,86 Inf 67072 27,31 31,16
64
Fixed point refinement
 Fixed word length optimization
 Overflow and quantization
 MSB determination
 LSB determination
 Fixed word length support in SystemC
 Exercise 2: fixed point refinement of IDCT
65
SystemC introduces a number
of specific data types
Type Description
sc_logic 4 value {0,1,X,Z} single bit
sc_int 1 to 64 bit signed integer
sc_uint 1 to 64 bit unsigned integer
sc_bigint Arbitrary size signed integer
sc_biguint Arbitrary size unsigned integer
sc_bv Arbitrary sized 2 value vector
sc_lv Arbitrary sized 4 value vector
sc_fixed Signed fixed point
sc_ufixed Unsigned fixed point
sc_fix Untemplated signed fixed point
sc_ufix Untemplated unsigned fixed point
66
SystemC templated fixed-point
types
 Two fixed point templates
 sc_fixed <wl, iwl, q_mode, o_mode, n_bits> x; // signed
 sc_ufixed <wl, iwl, q_mode, o_mode, n_bits> y; // unsigned
 Parameters:
 wl = number of bits
 Iwl = number of integer bits
 q_mode = quantization method (SC_RND / SC_TRN /
SC_TRN_ZERO / ...)
 o_mode = overflow method (SC_SAT / SC_WRAP / … )
 n_bits = number of saturated bits in case of wrapping (default 0)
 If quantization and overflow not specified the defaults (SC_TRN and
SC_WRAP) are used
67
Fixed point lengths
sc_fixed <5, 7> v;
X X X 0 0 [ -64 , 60 ]X X
sc_fixed <5, 3> v;
X X X [ -4 , 3.75 ]X X
sc_fixed <5, -2> v;
X X X X X [ -0.125 , 0.109375 ]S S
68
Quantization methods
sc_ufixed <5, 3, SC_RND> v;
v = 3.1875
0 1 1 0 1
3.1875
011.0011
3.25
0 1 1 0 0 3.0
sc_ufixed <5, 3, SC_TRN> v;
v = 3.1875
[ 0 , 7.75 ] precision = 0.25
quantization
error
0.0625
0.1875
3.1875
011.0011
69
Overflow handling
sc_fixed <5, 5, SC_RND,SC_SAT> v;
v = 18 ;
0 1 1 1 118 15
1 0 0 1 018 -14
sc_fixed <5, 5, SC_RND,SC_WRAP> v;
v = 18;
[ -16 , 15 ]
70
Fixed-point simulation
operations in floating-point
quantization and overflow handling during assignment
sc_fixed <4,3> a;
sc_fixed <4,1> b;
sc_fixed <4,2> c;
a = 1.6;
b = 0.9;
c = a * b;
1.6 1.5
0.9 0.875
1.31251.3125 1.251.25
QQ
QQ
QQ**
0.5
0.125
0.25
lsb precision
a
b
c
71
SystemC fixed point types with
non-static arguments
 Fixed point parameter values
 sc_fxtype_params my_type(wl,iwl,q_mode,o_mode,n_bits);
 x = my_type.wl();
 my_type.iwl()=x-2;
 Two non-static fixed point types
 sc_fix x(my_type); // signed
 sc_ufix y(my_type); // unsigned
 For arrays, these types are used with a context
 sc_fxtype_context my_context(sc_fxtype_params);
 sc_fix z[64];
 Remark: for fixed point simulations, include in every file
 #define SC_INCLUDE_FX
 #include <systemc.h>
72
Fixed point refinement
 Fixed word length optimization
 Overflow and quantization
 MSB determination
 LSB determination
 Fixed word length support in SystemC
 Exercise 2: fixed point refinement of IDCT
73
Goal of this exercise
 Perform fixed point refinement for all the internal variables of
the IDCT in the JPEG example
 determine the MSB to avoid internal overflows without overflow
logic.
 determine the LSB to have no more that 0,5dB degradation on
the PSNR of the resulting image
74
How to start?
 You find:
In .../exercises/exercise2/ : the functional model with a fixed point IDCT
implementation; types-file datatypes_original.txt
In/exercises/modules/: library of JPEG-encoder modules
{r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules
{b2r,idct,normalize,zz_dec}.{h,cpp} and testbench modules {src,snk,test}.
{h,cpp}
Special fixed point support functions of directory
…/exercises/add2systemc/ are used
In /exercises/images/: test images
 Things to do:
inspect the code to understand the behavior
Make the application
change datatypes.txt file
syntax: exercise2 -i <inputfile> -o <outputfile> -t <typefile>
Model Based System Design
Class 3: Communication
Refinement
Marc Engels
e-mail: marc.engels@flandersmake.be
76
Communication refinement
 Communication refinement
 Communication refinement in SystemC
 Exercise 3: communication refinement for
the JPEG decoder
77
Communication refinement is one
of the steps in architectural design
ComputationsComputations
operations
DataData
variables, arrays
floating point
memories
fixed point
operators
CommunicationCommunication
point-to-point
queues
busses
detailed protocol
resource allocation
scheduling
memory allocation
address generation
word sizing
bus allocation
introduce arbiters
include protocols
System
Functionality
Dedicated
Architecture
78
Functional models use FIFO
communication
 Queues guarantee consistent data passing
 Implementation could become expensive for large sizes
 communication must be optimized
Process1Process1 Process2Process2
(infinite) storage
79
wire
Process1Process1 Process2Process2
Many communications can be
reduced to a single register
 Output of functions is registered
 No extra implementation cost
 No storage for data
 Consistency of communication needs to be guaranteed
80
w=4w=4
Example of correct wired
communication
wire
Process 1Process 1 Process 2Process 2
w=0w=0
w<4w<4
filt1
filt2
filt3
filt4
write()
w++
read()
op1
op2 op3
op4
81
1 w=1
2 w=2
3 w=3
4 w=4
5 read() op1
6 op2
7 op3
8 op4
9 read() op1
10 op2
Communication is perfectly
aligned
1 filt1
2 filt2
3 filt3
4 filt4 write()
5 filt1
6 filt2
7 filt3
8 filt4 write()
9 filt1
10 filt2
… …
We have to guarantee the condition that every write()
comes before a read()
ClockCycle
82
Small changes to design can
result in errors
 Increase (decrease) the number of operations in process 1 (2):
the same data will be consumed twice.
 Decrease (increase) the number of operations in process 1 (2):
data will be lost
 If the number of initial wait operations in process 2 is too low,
we will use undefined data
 If the number of initial wait operations in process 2 is too high,
we will loose the first data elements)
83
Example of wrong wired
communication
wirefilt1
filt2
filt3
filt4
write()
Process 1Process 1 Process 2Process 2
read()
op1
op2
84
1 read() op1
2 op2
3 read() op1
4 op2
5 read() op1
6 op2
7 read() op1
8 op2
9 read() op1
10 op2
The example results in
undesired behavior
1 filt1
2 filt2
3 filt3
4 filt4 write()
5 filt1
6 filt2
7 filt3
8 filt4 write()
9 filt1
10 filt2
ClockCycles
… …
?
Adapt cycle budget or introduce handshake protocol
85
Simple handshake protocol is
more robust
 The flag “a” (ask) indicates that the receiver is ready to read
data in the next cycle.
 The flag “r” (ready) indicates that data has been written
 Save communication requires at least two cycles.
86
!r
r a
Simple handshake protocol is
more robust
Process 2Process 2
filt1
r=0
filt2 filt3
if (a==1){
filt4
write()
r=1}
Process 1Process 1
!a
a
if (r==1) {
read()
op1
a=0}
op2
a=1
r
a=1
r=0
87
1 a=1
2 a=1
3 a=1
4 a=1
5 a=0 read() op1
6 a=1 op2
7 a=1
8 a=1
9 a=0 read() op1
10 a=1 op2
… and effectively synchronizes
the communication
1 r=0 filt1
2 r=0 filt2
3 r=0 filt3
4 r=1 filt4 write()
5 r=0 filt1
6 r=0 filt2
7 r=0 filt3
8 r=1 filt4 write()
9 r=0 filt1
10 r=0 filt2
ClockCycles
… …
88
r a
… also when receiver is slower
than transmitter
Process 1Process 1 Process 2Process 2
filt1
r=0
If(a==1){
filt2
write()
r=1} !a
!r If (r==1){
read()
op1
a=0 }
op2
r
op3
a=1
a=1
r=0
a
89
1 a=1
2 a=1
3 a=0 read() op1
4 a=0 op2
5 a=1 op3
6 a=1
7 a=0 read() op1
8 a=0 op2
9 a=1 op3
10 a=1
… but introduces then one
extra wait cycle at receiver
1 r=0 filt1
2 r=1 filt2 write()
3 r=0 filt1
4 r=0
5 r=0
6 r=1 filt2 write()
7 r=0 filt1
8 r=0
9 r=0
10 r=0 filt2 write()
Cycles
… …
The extra wait cycle can be avoided by already putting a=1 during op2
90
Most general protocol: 4-phase
handshake protocol
Ack
Ack
Ack
Req
Req
Req
Req
Ack
Req
Ack
Req
Req
Ack
Execute
Ack
Data
Ack
Req=1
Get Data
Req=0
Ack=0
Put Data
Ack=1
Ack=0
91
Multiple variations on these
handshake protocols exist
 In stead of signal levels, the protocol can be based on signal
transitions.
 The protocol can be simplified if both systems run on the same
clock.
 Protocols can be simplified if one knows that the receiver or
the transmitter is fastest.
 Synchronization can be performed on the basis of a block:
 Set-up communication for first element of a block
 Next, communicate every cycle
 Some protocols are based on typical FIFO signals: full and
empty.
92
In some cases buffered
communication is required
process2process2process1process1
Q1Q1
Queue size can be determined by monitoring the maximum
number of elements in a queue during simulation.
1 write(Q1) 1
2 write(Q1) 2
3 write(Q2) 3
4 4 read(Q2)
5 5 read(Q1)
6 6 read(Q1)
Q2Q2
93
r a
Queues must be introduced
explicitly in hardware
FIFO process
size N
fsm
Wired
handshake
protocol
Process1 Process2
r a
94
Process1Process1 Process2Process2
Several communications can
also be multiplexed on a bus
Process3Process3 Process4Process4
Process1Process1
Process3Process3
Process2Process2
Process4Process4
busbus
arbiterarbiter
r a
a r
r a
a r
Bus and arbiter classes
can be reused!
95
Communication refinement
results in behavioral model
 Model that defines the relative ordering of input and outputs
 A clock signal is used for ordering
 Pins are accurate to the final implementation
 Internal resources are not mapped on clock cycles
(scheduling) and functional units (resource binding)
96
Communication refinement
 Communication refinement
 Communication refinement in SystemC
 Exercise 3: communication refinement for
the JPEG decoder
97
In SystemC behavioral models
use (clocked) threads
 Modeled with thread processes SC_THREAD or with clocked thread
processes SC_CTHREAD
 Every module has a clock input:
 sc_in_clk clk;
 The SC_THREAD process is made static sensitive to a clock edge
 Sensitive << clk.pos();
 To separate clock cycles wait() statements are used.
 A synchronous or asynchronous reset signal can be specified:
 reset_signal_is(reset, true);
 async_reset_signal_is(reset, true);
 Simulation must be run for a finite time (or will not stop!) or halted
explicitly.
98
Behavioral models communi-
cate via standard signals
 All input and outputs are standard signals
 Define signals with:
 sc_signal<T> a;
 Predefined ports for sc_signal<T> channels:
 sc_in<T> with interface function read() or assignment operator.
 sc_out<T> with interface function write() or assignment operator.
 sc_inout<T> that combines both interface functions.
99
Clocks in SystemC
 Create clock
 sc_clock clock1 ( “clock_label”, period, time_unit, duty_ratio, offset, first_value );
 sc_clock clock2 ( “clock_label”, period, time_unit, duty_ratio);
 sc_clock clock3 ( “clock_label”, period, time_unit);
 Clock Binding
• f1.clk( clock1 );
 Clocks are typically defined in sc_main();
 Example
2 12 22 32 42
sc_clock clock1 ("clock1", 20, SC_NS, 0.5, 2, true);
100
Example: summing 3 values on
an input
SC_MODULE(sum3) {
sc_in_clk CLOCK;
sc_in<bool> RESET;
sc_in<unsigned> A;
sc_out<unsigned> D;
void compute();
SC_CTOR(sum3) {
SC_CTHREAD(compute, CLOCK.pos());
reset_signal_is(RESET,true);
};
};
void sum3::compute() {
unsigned tmp;
// reset section
while (TRUE) { // main loop
tmp = A.read();
wait(); // end first I/O cycle
tmp += A.read();
wait(); // end second I/O cycle
tmp += A.read();
D.write(tmp);
wait(); // end third I/O cycle
}
}
101
Gradual Communication
refinement (1/2)
Process1Process1 Process2Process2
queue
Process1Process1 Process2Process2
r a
Behavioral_process1 Behavioral_process2
clock
Converters
Q1 Q2
102
Gradual Communication
refinement (2/2)
Process1Process1 BehavioralBehavioral
Process2Process2
C1C1
r a
Behavioral_process1
clock
Q1
BehavioralBehavioral
Process2Process2r a
clock
BehavioralBehavioral
Process1Process1
103
Converter SystemC code
template <class T> SC_MODULE(FF2P) {
sc_fifo_in<T> input;
sc_out<T> output;
sc_in<bool> ask;
sc_out<bool> ready;
sc_in_clk clk;
SC_CTOR(FF2P) {
SC_THREAD(process);
sensitive << clk.pos();
}
void process() {
T value;
enum ctrl_state {READINPUT, WRITEOUTPUT};
ctrl_state state;
// reset cycle
ready.write(false); state = READINPUT; wait();
while(true) {
if (state == READINPUT) {
ready.write(false); value = input.read();
state = WRITEOUTPUT;
} else {
if (ask.read() == true) {
output.write(value); ready.write(true);
state = READINPUT;
} else {
ready.write(false); state = WRITEOUTPUT;
};
};
wait();
}
return;
}
};
template <class T> SC_MODULE(P2FF) {
sc_fifo_out<T> output;
sc_in<T> input;
sc_in<bool> ready;
sc_out<bool> ask;
sc_in_clk clk;
SC_CTOR(P2FF) {
SC_THREAD(process)
sensitive << clk.pos();
}
void process() {
T value;
enum ctrl_state {READINPUT, WRITEOUTPUT};
ctrl_state state;
// reset cycle
ask.write(true); state = READINPUT; wait();
while(true) {
if (state == READINPUT) {
if (ready.read() == true) {
value = input.read(); ask.write(false);
output.write(value); state = WRITEOUTPUT;
} else {
ask.write(true); state = READINPUT;
};
} else {
ask.write(true); state = READINPUT;
};
wait();
}
return;
}
};
104
Communication refinement
 Communication refinement
 Communication refinement in SystemC
 Exercise 3: communication refinement for
the JPEG decoder
105
Exercise 3: communication
refinement for the JPEG encoder
 Goal: Replace the FIFO between the run-length encoder and decoder by
a handshake protocol
 You will find:
 In /exercises/exercise3/ : solution of exercise2
 In/exercises/modules/: JPEG-encoder modules
{r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules
{b2r,idct,normalize,zz_dec}.{h,cpp} and test bench modules
{src,snk,test}.{h,cpp}
 In /exercises/images/: test images
 In /exercises/add2systemc: FIFO to protocol conversion functions in
add2systemc: {FF2P, P2FF}.h
 Things to be done:
 Introduce a handshake protocol between rl_enc and rl_dec.
 introduce refined versions of rl_dec in jpeg_dec.h and main.cpp.
 simulate and verify correct operation.
Model Based System
Design
Class 4: computation
refinement
Marc Engels
e-mail:
marc.engels@flandersmake.be
107
Computation refinement in
SystemC
 Computation refinement
 Computation refinement in SystemC
 Exercise 4: computation refinement of a JPEG decoder
108
RTL refinement is the 3rd
step in
architectural design
ComputationsComputations
operations
DataData
variables, arrays
floating point
memories
fixed point
operators
CommunicationCommunication
point-to-point
queues
busses
detailed protocol
resource allocation
scheduling
memory allocation
address generation
word sizing
bus allocation
introduce arbiters
include protocols
System
Functionality
System
Architecture
109
beh4beh4RTL4RTL4beh2beh2RTL2RTL2
beh3beh3RTL3RTL3func1func1
For synthesis all blocks needs
to be transformed to RTL
 Transformation is a gradual refinement process
 switch a behavioral block with a RTL block
 verify by system simulation
SYSTEMSYSTEM
S1S1
S2S2
S3S3
TESTBENCHTESTBENCH
110
Behavioral model can be
represented by an FSM
Process_behavioral{// SC_CTHREAD
ask.write(TRUE);
while (ready.read() == FALSE) {wait();}
wait();
while(TRUE) {
ask.write(FALSE);
x = input.read();
wait();
d = x * b1;
y = d * b2;
output.write(y);
ask.write(TRUE);
while (ready.read() == FALSE)
{wait();}
wait();
}
}
=
!ready
ready !ready
ready
ask=1
ask=0
x=input
ask=1
d = x * b1
y = d * b2
output = y
111
Behavioral to RTL: scheduling of
operations in FSM
!ready
ready !ready
ready
ready
!ready
ready
!ready
ask=1
ask=0
x=input
ask=1
d = x * b1
y = d * b2
output = y
!ready!ready
ask=1
ask=0
x=input
d=x*b1
ask=1
y = d * b2
output = y
112
Rescheduled FSM is
represented in RTL code
=
ready
!ready
ready
!ready!ready
ask=1
ask=0
x=input
d=x*b1
ask=1
y = d * b2
output = y
Process_RTL{// SC_CTHREAD
ask.write(TRUE);
while (ready.read() == FALSE) {wait();}
wait();
while(TRUE) {
ask.write(FALSE);
x = input.read();
d = x * b1;
wait();
ask.write(TRUE);
y = d * b2;
output.write(y);
while (ready.read() == FALSE)
{wait();}
wait();
}
}
113
RTL description corresponds to
a datapath
possiblepossible
mappingmapping
**
b1b1
b2b2
xx
yy
dd
11
00
askask
RT description introduces synthesis
decisions:
register inference
resource sharing
parallelism
readyready
D QD Q
D QD Q
D QD Q
Process_RTL{// SC_CTHREAD
ask.write(TRUE);
while (ready.read() == FALSE) {wait();}
wait();
while(TRUE) {
ask.write(FALSE);
x = input.read();
d = x * b1;
wait();
ask.write(TRUE);
y = d * b2;
output.write(y);
while (ready.read() == FALSE)
{wait();}
wait();
}
}
114
ready
… and a controller
StateState
registerregister
OutputOutput
functionfunction
control: steers the register transfers in datapathcontrol: steers the register transfers in datapath
Next-stateNext-state
functionfunction
DatapathDatapath
ControllerController inputsinputs
outputsoutputs
controlcontrol
statusstatus
ins0
ins1
ins2
C0
c1
c2
115
Critical path of combinatorial
logic is crucial
Combinatorial
Logic
Multiplexers, Adders,
Multipliers, etc.
processclock
in
outcalc
clock
…
in
…
Critical path
calc
…
out
116
Pipelining reduces the critical
path
Area
critical
path
word
operator
delay
data Insertion
Interval (DII)
Non-pipelined
Bit word
pipelined
+
DII = operator delay
+
DII = critical path
+
+
1-bit
operator
delay
Word
pipelined
DII = operator delay/2
+
+
lsb
msb
+
+
+
…
…
117
Multiplexing reduces the area
of the solution
Area
data Insertion
Interval (DII)
Processor architecture
e.g. VLIW
Non
pipelined
DII = critical path
+
+
critical
path
Muxed DSP
+
DII = 2 x critical path
118
E.g. Robot Vision System
CCD
camera
line
delayobject
Sobel
operator
Edge
detector
Feature
extractor
Pattern
recognizer
Robot
controller
x
µ-CODE
ROM
PCLOGIC
µ-CODE
CONTROL
RAM
PROGRAM-
MABLE
FUNCTION.
UNITS
OFF-CHIP
MEMORY
MODULAR ARRAY OF
PROCESSING ELEMENTS
CON-
TROL
Global control and communication
µcoded processorMuxed DP's
HARDWIRED CONTROL
MEMORIES
DATA PATH
Array type
Real embedded systems show
architectural variability
119
Area can be estimated at a
high level
Source: Gaijski
State_reg
+
logic
# states
# states, # ctrl_lines, # states each ctrl_line is active
# bits and # words of each storage
# bits and type of each FU
#sources of muxes
+
# DP connections, # DP components
Storage
+
func_units (FU)
+
Muxes
+
wires
area Is a function of
Datapath(DP)
Control
Unit(CU)
TotalCircuit
120
Standard cell data can be
used to derive parameters
type name width
2 input MUX mxi2v0x1 3.08
2 input NMUX mxn2v0x1 3.52
2 input AND an2v4x2 2.20
3 input AND an3v4x2 3.08
4 input AND an4v4x2 3.52
2-bit half adders ha2v0x2 5.28
Q flip-flop dfnt1v0x2 7.92
… … …
Source: www.vlsitechnology.org
121
Storage: Registers vs. memories
 Inferred by
synthesis.
 Larger size per
storage bit.
 No overhead.
 Fast & parallel.
 Best < 1 kbits
storage
 Non sythesized – but
created by memory
generators.
 Smaller size per
storage bit.
 Fixed overhead.
 Slow & serial
 Best > 1 kbits
storage
122
Computation refinement in
SystemC
 Computation refinement
 Computation refinement in SystemC
 Exercise 4: computation refinement of a JPEG decoder
123
RTL design is modeled with modules
and processes
A sc_module is an identifiable hardware unit.
A module can contain multiple processes that run in parallel.
Signals are used to communicate between (executions of) processes.
Variables are used inside a single execution of a process.
124
Restrictions (1/2) in SystemC
Synthesizable Subset (draft 1.3)
 Modules
 Exactly one constructor.
 Processes
 Only SC_CTRHREAD and SC_METHOD are supported;
SC_THREAD is not supported.
 In a SC_CTHREAD there must be a wait() statement before
the infinite loop or as first statement in this loop.
 At most one clock signal is allowed per process.
 The reset behavior is specified in the process, not in the
constructor of the modules.
 Between two clock events, at most one assignment to a
signal is supported.
 Processes communicate through signals, not shared
variables.
125
Restrictions (2/2) in SystemC
Synthesizable Subset (draft 1.3)
 Datatypes:
 No floating point.
 Char is implemented as signed char, all integer types are
2’s complement.
 Pointers are not supported.
 Untemplated fixed point types are not supported.
 No division operator for fixed point types.
 No global variables but global constants are OK.
 Functions:
 No new(), delete() and sizeof() functions.
 Destructors have no effect.
 Exception handling is not supported.
126
Example: relation Synthesizable
SystemC and VHDL
System C
#include “systemc.h”
SC_MODULE(dff) {
sc_in<bool> din;
sc_clk_in clock;
sc_out<bool> dout;
void doit(); // Member function
SC_CTOR(dff) {
SC_CTHREAD(doit, clock.pos());
}
};
void dff::doit() { // Process body
while(TRUE){
wait();
dout.write(din.read());
}
}
VHDL
entity dff is
port ( din, clock : in bit; dout : out bit );
end dff;
architecture dff of dff is
begin
doit : process(clock) – Sensitivity List
begin
if (clock’event and clock=‘1’) then
dout <= din;
end if;
end process;
end dff;
127
Signals for communication
between processes
 Declaration
 Scalar Signal: sc_signal<sc_uint<32 > > a;
 Vector Signal: sc_signal<sc_logic> a[32];
 Signals use request-update mechanism: write takes effect after a delta-cycle
 When you assign a value to a signal or port, the value on the right side is
not transferred to the left side until the process halts. This means that the
signal value as seen by other processes is not updated immediately, but it
is deferred.
 When you assign a value to a variable, the value on the right side is
immediately transferred to the left side of the assignment statement.
 SystemC supports resolved Ports and Signals
 Multi-Valued Logic type : 0, 1, Z, X
 Allow Multiple Drives
128
Signals can infer registers
Synthesi
s
ww = x= x
y1 =y1 = ww * 10* 10
zz = x // writing at the end of cycle= x // writing at the end of cycle
wait()wait()
y2 =y2 = zz * 10 // reading at the beginning of cycle* 10 // reading at the beginning of cycle
x 1x 1 2 3 x2 3 x
y1 10 20 30 xy1 10 20 30 x
z x 1 2 3z x 1 2 3
y2 x 10 20 30y2 x 10 20 30
clockclock
ww
zz
1010
1010
xx
y1y1
y2y2
Simulation
D QD Q
129
Random Access Memory is
modeled with a behavioral model
// ram_asyn.h – asynchronous RAM
#include "systemc.h"
SC_MODULE(ram_asyn) {
sc_in<sc_unint<6> > addr;
sc_in<bool> rwb;
sc_in<int> datain;
sc_out<int> dout;
int memdata[64]; // local memory storage
void ramaction();
SC_CTOR(ram_asyn){
SC_METHOD(ramaction)
sensitive << addr << datain << rwb;
for (int i=0; i++; i<64) { memdata[i] = 0; }
}
};
Asynchronous
RAM (64)
address
datain
rwb
dataout
130
SystemC has a 4-step
simulation engine
1: Initialize
2: Iterative execution of
functional, behavioral & RTL
processes until no activity
3: Update primitive channels
4: Go back to 2
Functional1
behav2
RT3RT3
q1q1
s2s2
q3q3
q4q4
P2FF
s1s1
P2FF
s3s3
FF2P
s4s4
131
Measuring performance
 const sc_time& sc_time_stamp(): returns the current time
during simulation.
 Following functions are defined for sc_time:
 double to_seconds(): converts the time into seconds
 void print(): prints the time on the screen
 If the clock period is known, the number of clock cycles can
be calculated.
 Throughput ≥ Datarate/Simulation_time
132
Dump signals for wave plotting
sc_signal< sc_int<32> > signal1;
sc_signal<bool> signal2;
sc_trace_file *tracefile;
tracefile = sc_create_vcd_trace_file(tracefilename);
sc_trace(tracefile, signal1, “signal1");
sc_trace(tracefile, signal2, “signal2");
sc_close_vcd_trace_file(tracefile);
133
Computation refinement in
SystemC
 Computation refinement
 Computation refinement in SystemC
 Exercise 4: computation refinement of a JPEG decoder
134
How to start?
 Goal: refine run-length decoder in RTL model.
 You will find:
 In /exercises/exercise4/ : solution of exercise3
 In/exercises/modules/: JPEG-encoder modules
{r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules
{b2r,idct,normalize,zz_dec}.{h,cpp} and test bench modules {src,snk,test}.
{h,cpp}
 In /exercises/images/: test images
 In /exercises/add2systemc: behavioral RAM models.
 Things to be done:
 Make RTL model of run-length decoder.
 draw FSM of the RTL model.
 introduce the RTL model in jpeg_dec.h and integrate in main.cpp.
 simulate and verify correct operation with gtkwave viewer.
 Estimate the needed hardware for this RTL model.

More Related Content

What's hot

Introduction Linux Device Drivers
Introduction Linux Device DriversIntroduction Linux Device Drivers
Introduction Linux Device Drivers
NEEVEE Technologies
 
Presentation systemc
Presentation systemcPresentation systemc
Presentation systemc
SUBRAHMANYA S
 
SystemC Ports
SystemC PortsSystemC Ports
SystemC Ports
敬倫 林
 
Challenges in Using UVM at SoC Level
Challenges in Using UVM at SoC LevelChallenges in Using UVM at SoC Level
Challenges in Using UVM at SoC LevelDVClub
 
Pcie drivers basics
Pcie drivers basicsPcie drivers basics
Pcie drivers basics
Venkatesh Malla
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPF
RogerColl2
 
Embedded Linux on ARM
Embedded Linux on ARMEmbedded Linux on ARM
STM -32
STM -32STM -32
Basics of embedded system design
Basics of embedded system designBasics of embedded system design
Basics of embedded system design
K Senthil Kumar
 
HKG15-107: ACPI Power Management on ARM64 Servers (v2)
HKG15-107: ACPI Power Management on ARM64 Servers (v2)HKG15-107: ACPI Power Management on ARM64 Servers (v2)
HKG15-107: ACPI Power Management on ARM64 Servers (v2)
Linaro
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
Michael Kehoe
 
FPGA Verilog Processor Design
FPGA Verilog Processor DesignFPGA Verilog Processor Design
FPGA Verilog Processor Design
Archana Udaranga
 
Thread and method_2010
Thread and method_2010Thread and method_2010
Thread and method_2010
敬倫 林
 
GCC RTL and Machine Description
GCC RTL and Machine DescriptionGCC RTL and Machine Description
GCC RTL and Machine Description
Priyatham Bollimpalli
 
Embedded linux system development (slides)
Embedded linux system development (slides)Embedded linux system development (slides)
Embedded linux system development (slides)
Jaime Barragan
 
Linux Commands
Linux CommandsLinux Commands
Linux Commands
Ramasubbu .P
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
FPGA Central
 
The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421
Linaro
 
Control Flow Analysis
Control Flow AnalysisControl Flow Analysis
Control Flow Analysis
Edgar Barbosa
 

What's hot (20)

Introduction Linux Device Drivers
Introduction Linux Device DriversIntroduction Linux Device Drivers
Introduction Linux Device Drivers
 
Presentation systemc
Presentation systemcPresentation systemc
Presentation systemc
 
SystemC Ports
SystemC PortsSystemC Ports
SystemC Ports
 
Challenges in Using UVM at SoC Level
Challenges in Using UVM at SoC LevelChallenges in Using UVM at SoC Level
Challenges in Using UVM at SoC Level
 
Pcie drivers basics
Pcie drivers basicsPcie drivers basics
Pcie drivers basics
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPF
 
Embedded Linux on ARM
Embedded Linux on ARMEmbedded Linux on ARM
Embedded Linux on ARM
 
STM -32
STM -32STM -32
STM -32
 
Basics of embedded system design
Basics of embedded system designBasics of embedded system design
Basics of embedded system design
 
HKG15-107: ACPI Power Management on ARM64 Servers (v2)
HKG15-107: ACPI Power Management on ARM64 Servers (v2)HKG15-107: ACPI Power Management on ARM64 Servers (v2)
HKG15-107: ACPI Power Management on ARM64 Servers (v2)
 
Toolchain
ToolchainToolchain
Toolchain
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
FPGA Verilog Processor Design
FPGA Verilog Processor DesignFPGA Verilog Processor Design
FPGA Verilog Processor Design
 
Thread and method_2010
Thread and method_2010Thread and method_2010
Thread and method_2010
 
GCC RTL and Machine Description
GCC RTL and Machine DescriptionGCC RTL and Machine Description
GCC RTL and Machine Description
 
Embedded linux system development (slides)
Embedded linux system development (slides)Embedded linux system development (slides)
Embedded linux system development (slides)
 
Linux Commands
Linux CommandsLinux Commands
Linux Commands
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
 
The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421The Linux Kernel Scheduler (For Beginners) - SFO17-421
The Linux Kernel Scheduler (For Beginners) - SFO17-421
 
Control Flow Analysis
Control Flow AnalysisControl Flow Analysis
Control Flow Analysis
 

Similar to Digital design with Systemc

Basics of digital verilog design(alok singh kanpur)
Basics of digital verilog design(alok singh kanpur)Basics of digital verilog design(alok singh kanpur)
Basics of digital verilog design(alok singh kanpur)
Alok Singh
 
How to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveHow to Connect SystemVerilog with Octave
How to Connect SystemVerilog with Octave
Amiq Consulting
 
Performance Verification for ESL Design Methodology from AADL Models
Performance Verification for ESL Design Methodology from AADL ModelsPerformance Verification for ESL Design Methodology from AADL Models
Performance Verification for ESL Design Methodology from AADL Models
Space Codesign
 
Presentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemCPresentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemC
Mukit Ahmed Chowdhury
 
Virtual platform
Virtual platformVirtual platform
Virtual platformsean chen
 
CV-RENJINIK-27062016
CV-RENJINIK-27062016CV-RENJINIK-27062016
CV-RENJINIK-27062016Renjini K
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
gopikahari7
 
System programmin practical file
System programmin practical fileSystem programmin practical file
System programmin practical file
Ankit Dixit
 
Model_Driven_Development_SDR
Model_Driven_Development_SDRModel_Driven_Development_SDR
Model_Driven_Development_SDR
ADLINK Technology IoT
 
Verilog
VerilogVerilog
Verilog
Mohamed Rayan
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
ebpf and IO Visor: The What, how, and what next!
ebpf and IO Visor: The What, how, and what next!ebpf and IO Visor: The What, how, and what next!
ebpf and IO Visor: The What, how, and what next!
Affan Syed
 
MattsonTutorialSC14.pdf
MattsonTutorialSC14.pdfMattsonTutorialSC14.pdf
MattsonTutorialSC14.pdf
George Papaioannou
 
Runtime Environment Of .Net Divya Rathore
Runtime Environment Of .Net Divya RathoreRuntime Environment Of .Net Divya Rathore
Runtime Environment Of .Net Divya RathoreEsha Yadav
 
Spectra Cx V3.2 Webcast 19 May 2010
Spectra Cx V3.2 Webcast 19 May 2010Spectra Cx V3.2 Webcast 19 May 2010
Spectra Cx V3.2 Webcast 19 May 2010
ADLINK Technology IoT
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Aljoscha Krettek
 
Unix system programming
Unix system programmingUnix system programming
Unix system programming
Syed Mustafa
 

Similar to Digital design with Systemc (20)

Embedded system
Embedded systemEmbedded system
Embedded system
 
3DD 1e 31 Luglio Apertura
3DD 1e 31 Luglio Apertura3DD 1e 31 Luglio Apertura
3DD 1e 31 Luglio Apertura
 
Basics of digital verilog design(alok singh kanpur)
Basics of digital verilog design(alok singh kanpur)Basics of digital verilog design(alok singh kanpur)
Basics of digital verilog design(alok singh kanpur)
 
How to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveHow to Connect SystemVerilog with Octave
How to Connect SystemVerilog with Octave
 
Performance Verification for ESL Design Methodology from AADL Models
Performance Verification for ESL Design Methodology from AADL ModelsPerformance Verification for ESL Design Methodology from AADL Models
Performance Verification for ESL Design Methodology from AADL Models
 
Presentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemCPresentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemC
 
Virtual platform
Virtual platformVirtual platform
Virtual platform
 
CV-RENJINIK-27062016
CV-RENJINIK-27062016CV-RENJINIK-27062016
CV-RENJINIK-27062016
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
System programmin practical file
System programmin practical fileSystem programmin practical file
System programmin practical file
 
Model_Driven_Development_SDR
Model_Driven_Development_SDRModel_Driven_Development_SDR
Model_Driven_Development_SDR
 
Verilog
VerilogVerilog
Verilog
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
ebpf and IO Visor: The What, how, and what next!
ebpf and IO Visor: The What, how, and what next!ebpf and IO Visor: The What, how, and what next!
ebpf and IO Visor: The What, how, and what next!
 
MattsonTutorialSC14.pdf
MattsonTutorialSC14.pdfMattsonTutorialSC14.pdf
MattsonTutorialSC14.pdf
 
Runtime Environment Of .Net Divya Rathore
Runtime Environment Of .Net Divya RathoreRuntime Environment Of .Net Divya Rathore
Runtime Environment Of .Net Divya Rathore
 
Spectra Cx V3.2 Webcast 19 May 2010
Spectra Cx V3.2 Webcast 19 May 2010Spectra Cx V3.2 Webcast 19 May 2010
Spectra Cx V3.2 Webcast 19 May 2010
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
 
Unix system programming
Unix system programmingUnix system programming
Unix system programming
 

Recently uploaded

Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 

Recently uploaded (20)

Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 

Digital design with Systemc

  • 1. Specification Languages: Part 2 Marc Engels e-mail: marc.engels@flandersmake.be
  • 2. 2 Specification Languages  Part 1: Specification Models  Part 2: Model based system design  Show how the models of part 1 can be used for architectural design  Provide hands-on experience with SystemC v2.3.2 (released in October 2017).  Introduce OO techniques for design of hardware systems  Part 3: Project
  • 3. 3 Course Material for part 2  Prerequisite:  part 1 of specification languages  C++ (good tutorial at www.cplusplus.com)  Coding and debugging programs  RTL description of synchronous digital circuits  Material for part 2:  Slides with notes.  IEEE Standard SystemC Language Reference Manual, IEEE Std 1666-2011.
  • 4. Model Based System Design Class 1: constructing a functional model Marc Engels e-mail: marc.engels@flandersmake.be
  • 5. 5 Functional modeling in SystemC  Introduction to design of digital embedded systems  SystemC introduction  SystemC functional model syntax  Exercise 1: building a functional model in SystemC
  • 7. 7 … as well as professional equipment
  • 8. 8 Characteristics of embedded systems  Optimize for power, cost, and size  Robust design  Provide the ability for evolution and mass customization  Minimize time to market  Some functionality might be safety-critical  Interfacing with the real world, leading to real time constraints
  • 9. 9 Sensors Actuators Real world process Processing Embedded systems combine various types of real-time behavior ADC DAC event signal signal action user Signal conditioning Actuator Powering
  • 10. 10 Digital embedded systems combine hard- and software User interface NVM ROM µPorDSPcore RAM Conf. Logic Memories Peripheral Mo- dem buffers Video/ Graphics processor Protocol Speech Processing Analysis of channel + analog, sensors and actuators
  • 11. 11 Design flow for digital embedded systems System Functionality Functional Requirements Performance Requirements Architecture Template Architectural Requirements Mapping Dedicated Architecture C-code Non-functional Requirements
  • 12. 12 Function to architecture conversion follows three axes ComputationsComputations operations DataData variables, arrays floating point memories fixed point operators CommunicationCommunication point-to-point queues busses detailed protocol resource allocation scheduling memory allocation address generation word sizing bus allocation introduce arbiters include protocols System Functionality Dedicated Architecture
  • 13. 13 Functional modeling in SystemC  Introduction to design of digital embedded systems  SystemC introduction  SystemC functional model syntax  Exercise 1: building a functional model in SystemC
  • 14. 14 SystemC bridges gap between function and architecture MATLAB C/C++ VHDL Verilog SystemC System Functionality Dedicated Architecture
  • 15. 15 What is SystemC?  A modeling framework in C++ for the refinement of system from a functional description into an architecture  Contributions:  hardware modeling with C++: OCAPI (IMEC) and SCENIC (Synopsys/UC Irvine)  fixed-point data types: Frontier Design  hardware-software co-design: CoWare (IMEC/CoWare)  Language first standardized in December 2005 as IEEE 1666, revised in 2011 as IEEE 1666-2011  Extensions of SystemC:  Verification library.  Transaction level modeling library ( integrated in IEEE 1666-2011).  Analog and mixed-signal modeling.  More info: www.accellera.org
  • 16. 16 Which tools are available for SystemC?  Open source simulation library available  Open source translators from Verilog or VHDL to SystemC  Commercial synthesis tools:  Cadence (Stratus HLS).  Mentor(Catapult C).  NEC(CyberWorkBench).  SystemCrafter (SC).  Xilinx (Vivado Design Suite).
  • 17. 17 SystemC language architecture C++ language Core Language Modules Ports Exports Processes Interfaces Channels Events Event-driven simulation kernel Data-types 4-valued logic type 4-valued logic vectors Bit-vectors Finite-Precision integers Limited-Precision integers Fixed-Point types Pre-defined Channels Signal, Clock, fifo, Mutex, Semaphore. Libraries for Specific Models of Computation and/or methodologies, e.g. TLM interfaces, bus models, SystemC verification library Utilities Report Handling, Tracing User Application
  • 19. 19 Functional modeling in SystemC  Introduction to design of digital embedded systems  SystemC introduction  SystemC functional model syntax  Exercise 1: building a functional model in SystemC
  • 20. 20 processprocess processprocess FIFOFIFO Kahn Process Networks in SystemC  (Modules to structure design)  Functional processes  First-In-First-Out queues  Simulation engine
  • 21. 21 Modules are used for structural partitioning the functionality  Each module has its own class, derived from the sc_module class.  Every constructor of a module class shall have exactly one parameter of class sc_module_name.  It is good practice to make this name for an instance of the module the same as the C++ variable name through which the module is referenced.  A module can be hierarchical or contains processes. In the latter case, the SC_HAS_PROCESS(“class name”) macro is used to indicate that the module contains processes.
  • 22. 22 Example of a functional model of an adder SC_MODULE(adder) { //define ports //define processes, internal data, etc. SC_CTOR(adder) { // body of constructor; // process declaration, sensitivities, etc. }; }; Class adder : public sc_module { public: // define ports //define processes, , internal data, etc. SC_HAS_PROCESS(adder); adder(sc_module_name name): sc_module(name) { // body of constructor; // process declaration, sensitivities, etc. }; }; Explicit:Explicit: With MACROs:With MACROs:
  • 23. 23 Ports are used to communicate with a FIFO channel  General port definition: sc_port<interface>  Predefined ports are: sc_fifo_in<T> and sc_fifo_out<T>.  sc_fifo_in<T> is derived from sc_port<sc_fifo_in_if<T>,0> with interface functions read(), nb_read(), and num_available().  sc_fifo_out<T> is derived from sc_port<sc_fifo_out_if<T>,0> with interface functions write(), nb_write(), and num_free().  blocking read and write interface functions (automatic synchronization with implicit wait() operations) int a = f1.read(); // read a token f1.write(a); // write a token  Inspecting queues int a = f1.num_available(); // number of tokens in a queue int a = f1.num_free(); // number of free places in a queue
  • 24. 24 Example of a functional model of an adder (continued) SC_MODULE(adder) { sc_fifo_in<int> a,b; sc_fifo_out<int> c; //define processes, internal data, etc. SC_CTOR(adder) { // body of constructor; // process declaration, sensitivities, etc. }; };
  • 25. 25 SC_THREAD processes are used to model functional processes  SC_THREAD processes run forever once started.  SC_THREAD processes can be suspended by means of the wait(event) function. In functional modeling the wait statements are hidden in the read() and write() functions to the queues.  Multiple processes per module are possible  Processes can also be dynamically created.
  • 26. 26 Example of a functional model of an adder (continued) SC_MODULE(adder) { sc_fifo_in<int> a,b; sc_fifo_out<int> c; void compute() { while(true) { int valuea = a.read(); int valueb = b.read(); c.write(valuea+valueb); } } SC_CTOR(adder) { SC_THREAD(compute); } };
  • 27. 27 Define the main program  The systemc library must be included in the main program:  #include <systemc.h>  In sc_main() the following actions are taken:  Instantiate channels with: • sc_fifo<T> (”name”, length); // default length 16 • e.g. sc_fifo<int> f1(”f1”,2);  Instantiate the modules.  Bind ports of modules to channels: • Positional • named.  Call sc_start() to start simulation and run until end of any activity.
  • 28. 28 Example of a functional model of an adder (continued) int sc_main(int argc , char *argv[]) { sc_fifo<int> fifo_a, fifo_b, fifo_c; //channel instantiation … // instantiate signal generation and evaluation module adder my_adder(“my_adder”); // module instantiation my_adder.a(fifo_a); // binding of port to channel my_adder.b(fifo_b); my_adder.c(fifo_c); … // other modules and test bench, which drive fifo_a and fifo_b. sc_start(); // start simulation }; Elaborationphase
  • 29. 29 SC_MODULE(superfunc) { // IO ports sc_fifo_in<float> in; sc_fifo_out<float> out; //internal queues sc_fifo<float> d; // internal modules function func1; function *func2; // Module constructor SC_CTOR (superfunc): func1(“func1”) { func1.in(in); func1.out(d); func2 = new function (“func2”); func2->in(d); func2->out(out); } } Modules can also be used to create hierarchy func1func1 superfunc d func2func2 sc_module(function)
  • 30. 30 Simulation engine  In an un-timed model, the simulator only advances in delta- cycles:  If it is started to run for a finite amount of time, it will never stop.  We therefore run it until no events are present: sc_start();  Ways of stopping the simulator:  Terminate a process (return from SC_THREAD): the simulator will stop due to the lack of events.  Call sc_stop() when a termination condition is fulfilled.
  • 31. 31 Functional modeling in SystemC  Introduction to design of digital embedded systems  SystemC introduction  SystemC functional model syntax  Exercise 1: building a functional model in SystemC
  • 32. 32 Goal of this exercise  use a simplifiedJPEG block diagram to practice functional modeling  develop a functional process that fits into a system  simulate a functional model  observe the overall behavior of a system
  • 33. 33 What is JPEG?  “JPEG” stands for “Joint Photographic Experts Group”  “JPEG” is a standard for color image compression  “JPEG” is widely used (e.g. on the WWW)  More information?  http://www.jpeg.org/
  • 34. 34 (Partial) JPEG: a simple block diagram DCT Quantize (+table) ZIGZAG SCAN RUN-LENGTH ENCODER IDCT Normalize (+table) ZIGZAG SCAN RUN-LENGTH DECODER Original Image Reconstructed Image JPEG-ENCODER JPEG-DECODER R2B B2R Parameters: width, height, #bits Parameters: width, height, #bits
  • 35. 35 2D Discrete Cosine Transform  Non-optimized equation  DCT can be separated in consecutive 1-D operations  There are many optimized DCT-algorithms available ( ) ( ) ( ) ( ) ( ) ( ) ∑∑= = ++ ⋅= 7 0 7 0 16 12 cos. 16 12 cos, 4 1 , i j vjui jifvCuCvuF ππ ( ) ( ) ( ) ( ) ( ) ( ) ∑∑= = ++ ⋅= 7 0 7 0 16 12 cos. 16 12 cos, 4 1 , u v vjui vuFvCuCjif ππ 01 0 2 1 )(     ≠ = = l l lCwhere
  • 36. 36 Quantization  Each DCT coefficient is divided by the coefficient amplitude that is just detectable by the human eye (table)  The result is rounded to an integer  This reduces the number of bits needed to represent the DCT coefficient  The quantization is the place where information of the image might be lost, resulting in lossy compression.
  • 38. 38 The coefficients are zigzag scanned 0 1 5 6 14 15 27 28 2 4 7 13 16 26 29 42 3 8 12 17 25 30 41 43 9 11 18 24 31 40 44 53 10 19 23 32 39 45 52 54 20 22 33 38 46 51 55 60 21 34 37 47 50 56 59 61 35 36 48 49 57 58 62 63
  • 39. 39 (Simplified) Run-length coding  Send the DC value “as is”  Represent the high frequency data with (zero run-length, amplitude) combinations.  End the stream with EOB (= 63).  Example:  in: 79, 0, -2, -1, 3, -1, 0, 0, -1, 0, 0, 0, …  out: 79, 1,-2, 0,-1, 0, 3, 0,-1,2,-1, 63
  • 40. 40 How to start?  Download exercise files form http://www.icorsi.ch/  Follow installation instructions of exercises.  you will find:  In /exercises/exercise1/: main.cpp to start from  In/exercises/modules/: library with JPEG encoder modules {r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules {b2r,idct,normalize,zz_dec}.{h,cpp} and test bench modules {src,snk,test}.{h,cpp}  In /exercises/images/: test images  In /exercises/add2systemc additional functions (df_fork, fifo_stat)  Things to be done:  make rl_dec.h and rl_dec.cpp  complete the main.cpp with the modules.  Compile and execute the application.  Inspect the number of reads and writes in the fifos  Visualize resulting image  Test if you can launch the application in the debugger.  Optional: make a hierarchy for the encoder and decoder.
  • 41. 41 Using SystemC on Linux/Cygwin  Use g++ (I used version 4.5.3).  Make a workspace in Eclipse:  Add your source files to the project.  Add libmodules.a  Add libadd2systemc.a (for next exercises).  Add libsystemc.a  Put the right include paths and linker paths  Build your application from within Eclipse.  Execute your application from within Eclipse.  Exercise1.exe –i ../images/mountain.pgm –o result.pgm
  • 42. Model Based System Design Class 2: Fixed-point refinement Marc Engels e-mail: marc.engels@flandersmake.be
  • 43. 43 Fixed point refinement  Fixed word length optimization  Overflow and quantization  MSB determination  LSB determination  Fixed word length support in SystemC  Exercise 2: fixed point refinement of IDCT
  • 44. 44 Fixed point refinement is one of the steps in architectural design ComputationsComputations operations DataData variables, arrays floating point memories fixed point operators CommunicationCommunication point-to-point queues busses detailed protocol resource allocation scheduling memory allocation address generation word sizing bus allocation introduce arbiters include protocols System Functionality Dedicated Architecture
  • 45. 45 ** 3 bytes (mantissa)3 bytes (mantissa) + 1 byte (exponent)+ 1 byte (exponent) Fixed-point •minimum area •low power •high speed 88 **66 1414 Finite word lengths are a must for DSP applications Floating-point •powerful •expensive (storage & ops)
  • 46. 46 22 33 22 22 22 22 22i.2i.2 22 11 00 -1-1 -2-2 -3-3 WLWL IWLIWL MSBMSB LSBLSB How to model a fixed-point signal? Total number of bits WL Integer bits IWL Value representation •2’s complement (i=-1) •unsigned (i=1) WL-IWLWL-IWL
  • 47. 47 How do we quantize? truncatetruncate (floor)(floor) fxpfxp flpflp roundround fxpfxp flpflp magnitudemagnitude truncatetruncate fxpfxp flpflp ceilceil fxpfxp flpflp
  • 48. 48 What happens on an overflow? wrap-around saturation flp flp fxp fxp max. value
  • 51. 51 Fixed-point refinement is a complex optimization problem Minimize overall cost: minimal word lengths truncate and wrap-around MSB determination: goal: avoid unwanted overflows method: find min, max signal values result: MSB position, value representation, overflow LSB determination: goal: keep required precision method: evaluate difference between flp a fxp behavior result: LSB position, quantization safe rangesafe range quantizationquantization
  • 52. 52 MSB determination can be based on range calculations * + d m x y Put range (min, max) on inputs Propagate range over the operators This gives a save (pessimistic) estimate rangerange infoinfo [0,255] 12 rangerange calc.calc.[0,255] [0,3060] [0,3315] z-1
  • 53. 53 Range propagation is a simple calculation Operator minc maxc c=a+b mina+minb maxa+maxb c=a-b mina-maxb maxa-minb c=a*b MIN(mina*minb, mina*maxb, maxa*minb, maxa*maxb) MAX(mina*minb, mina*maxb, maxa*minb, maxa*maxb)
  • 54. 54 Range calculations can get unstable with feedback * + a X(n) Y(n) z-1 F(n) sample n maxF minF value
  • 55. 55 * + d m x 12 y stimuli ?min, max q1 Collecting signal statistics from simulations is an alternative Perform simulation with realistic stimuli. Collect minimum and maximum value on each signal during the simulation This gives an optimistic, stimuli dependent estimate z-1
  • 56. 56 signal statistic range propagation name min max MSB1 min max MSB2 signal1 -1.5 1.6 2 -1.9 1.9 2 signal2 -1.3 1.4 2 -2.1 2.1 3 signal3 -1.2 1.2 2 -22.0 22.0 6 signal4 -1.2 1.2 2 -∞ ∞ ∞ Combine both methods for accurate MSB determination If MSB1 == MSB2: wrap-around(MSB1) If MSB1 < MSB2: wrap-around(MSB2) If MSB1 << MSB2: saturation (MSB1) MSB2 is ∞ saturation (MSB1)
  • 57. 57 QQ ++ B bits input output outputinput noise Quantization effects can be modeled as additive noise Noise is approximated by a statistical model with the following assumptions: the noise is uncorrelated to the input. the noise is white. the probability distribution is uniform.
  • 58. 58 Each quantization effect has mean and variance  Rounding with step ∆:  Truncation with step ∆:  Magnitude truncation with step ∆: 12 and0 2 2 ∆ == nnm σ 12 and 2 2 2 ∆ = ∆ −= nnm σ 3 and0 2 2 ∆ == nnm σ
  • 59. 59 This results in an equivalent linear network Q1Q1 + * + d m x 12 y z-1 QQ 22 * + d m x 12 y z-1 e1(t) + e2(t) ))1()()(12())1()(12()( 121 −+++−+= tetetetxtxty
  • 60. 60 … but quantization is a non- linear operation * + -0.96 X(n) Y(n) z-1 QQ X(0) = 14, x(n) = 0 for n > 0 round to nearest integer B bits ... ... with rounding: without rounding:
  • 61. 61 LSB determination is based on simulations All fixed-point simulate output ok yes no * + stimuli 12 x ym QQ * + 12 x ym com pare QQ z-1 z-1
  • 62. 62 Signal to quantization noise ratio (SQNR)         + + = 22 22 10log10 ee ss x m m SQNR σ σ Q - e me,σe ms,σs xQ
  • 63. 63 LSB selection optimizes cost and performance quantization set SQNR pi SQNR accu SQNR pix SQNR coeffs SQNR block SQNR temp block SQNR blocki cost SNR PSNR 0 208 253 Inf 184 Inf 225 Inf 787968 27,64 31,49 1 45,5 59,76 Inf 174 Inf Inf Inf 759296 27,48 31,33 2 45,5 59,76 25,15 174 Inf Inf Inf 759296 22,66 26,51 3 45,5 59,76 38,77 174 Inf Inf Inf 759296 26,91 30,75 4 45,5 59,76 47,3 30,88 Inf Inf Inf 230912 27,35 31,19 5 45,5 59,8 47,3 30,88 29,38 Inf Inf 230912 27,34 31,19 6 45,5 61,4 47,3 30,88 29,38 -1,93 Inf 41472 20,47 24,32 7 45,5 59,8 47,3 30,88 29,38 Inf Inf 72192 27,34 31,19 8 45,5 60,23 47,3 30,88 29,38 16,73 Inf 56832 26,96 30,8 9 45,5 59,88 47,3 30,88 29,38 31,86 Inf 67072 27,31 31,16
  • 64. 64 Fixed point refinement  Fixed word length optimization  Overflow and quantization  MSB determination  LSB determination  Fixed word length support in SystemC  Exercise 2: fixed point refinement of IDCT
  • 65. 65 SystemC introduces a number of specific data types Type Description sc_logic 4 value {0,1,X,Z} single bit sc_int 1 to 64 bit signed integer sc_uint 1 to 64 bit unsigned integer sc_bigint Arbitrary size signed integer sc_biguint Arbitrary size unsigned integer sc_bv Arbitrary sized 2 value vector sc_lv Arbitrary sized 4 value vector sc_fixed Signed fixed point sc_ufixed Unsigned fixed point sc_fix Untemplated signed fixed point sc_ufix Untemplated unsigned fixed point
  • 66. 66 SystemC templated fixed-point types  Two fixed point templates  sc_fixed <wl, iwl, q_mode, o_mode, n_bits> x; // signed  sc_ufixed <wl, iwl, q_mode, o_mode, n_bits> y; // unsigned  Parameters:  wl = number of bits  Iwl = number of integer bits  q_mode = quantization method (SC_RND / SC_TRN / SC_TRN_ZERO / ...)  o_mode = overflow method (SC_SAT / SC_WRAP / … )  n_bits = number of saturated bits in case of wrapping (default 0)  If quantization and overflow not specified the defaults (SC_TRN and SC_WRAP) are used
  • 67. 67 Fixed point lengths sc_fixed <5, 7> v; X X X 0 0 [ -64 , 60 ]X X sc_fixed <5, 3> v; X X X [ -4 , 3.75 ]X X sc_fixed <5, -2> v; X X X X X [ -0.125 , 0.109375 ]S S
  • 68. 68 Quantization methods sc_ufixed <5, 3, SC_RND> v; v = 3.1875 0 1 1 0 1 3.1875 011.0011 3.25 0 1 1 0 0 3.0 sc_ufixed <5, 3, SC_TRN> v; v = 3.1875 [ 0 , 7.75 ] precision = 0.25 quantization error 0.0625 0.1875 3.1875 011.0011
  • 69. 69 Overflow handling sc_fixed <5, 5, SC_RND,SC_SAT> v; v = 18 ; 0 1 1 1 118 15 1 0 0 1 018 -14 sc_fixed <5, 5, SC_RND,SC_WRAP> v; v = 18; [ -16 , 15 ]
  • 70. 70 Fixed-point simulation operations in floating-point quantization and overflow handling during assignment sc_fixed <4,3> a; sc_fixed <4,1> b; sc_fixed <4,2> c; a = 1.6; b = 0.9; c = a * b; 1.6 1.5 0.9 0.875 1.31251.3125 1.251.25 QQ QQ QQ** 0.5 0.125 0.25 lsb precision a b c
  • 71. 71 SystemC fixed point types with non-static arguments  Fixed point parameter values  sc_fxtype_params my_type(wl,iwl,q_mode,o_mode,n_bits);  x = my_type.wl();  my_type.iwl()=x-2;  Two non-static fixed point types  sc_fix x(my_type); // signed  sc_ufix y(my_type); // unsigned  For arrays, these types are used with a context  sc_fxtype_context my_context(sc_fxtype_params);  sc_fix z[64];  Remark: for fixed point simulations, include in every file  #define SC_INCLUDE_FX  #include <systemc.h>
  • 72. 72 Fixed point refinement  Fixed word length optimization  Overflow and quantization  MSB determination  LSB determination  Fixed word length support in SystemC  Exercise 2: fixed point refinement of IDCT
  • 73. 73 Goal of this exercise  Perform fixed point refinement for all the internal variables of the IDCT in the JPEG example  determine the MSB to avoid internal overflows without overflow logic.  determine the LSB to have no more that 0,5dB degradation on the PSNR of the resulting image
  • 74. 74 How to start?  You find: In .../exercises/exercise2/ : the functional model with a fixed point IDCT implementation; types-file datatypes_original.txt In/exercises/modules/: library of JPEG-encoder modules {r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules {b2r,idct,normalize,zz_dec}.{h,cpp} and testbench modules {src,snk,test}. {h,cpp} Special fixed point support functions of directory …/exercises/add2systemc/ are used In /exercises/images/: test images  Things to do: inspect the code to understand the behavior Make the application change datatypes.txt file syntax: exercise2 -i <inputfile> -o <outputfile> -t <typefile>
  • 75. Model Based System Design Class 3: Communication Refinement Marc Engels e-mail: marc.engels@flandersmake.be
  • 76. 76 Communication refinement  Communication refinement  Communication refinement in SystemC  Exercise 3: communication refinement for the JPEG decoder
  • 77. 77 Communication refinement is one of the steps in architectural design ComputationsComputations operations DataData variables, arrays floating point memories fixed point operators CommunicationCommunication point-to-point queues busses detailed protocol resource allocation scheduling memory allocation address generation word sizing bus allocation introduce arbiters include protocols System Functionality Dedicated Architecture
  • 78. 78 Functional models use FIFO communication  Queues guarantee consistent data passing  Implementation could become expensive for large sizes  communication must be optimized Process1Process1 Process2Process2 (infinite) storage
  • 79. 79 wire Process1Process1 Process2Process2 Many communications can be reduced to a single register  Output of functions is registered  No extra implementation cost  No storage for data  Consistency of communication needs to be guaranteed
  • 80. 80 w=4w=4 Example of correct wired communication wire Process 1Process 1 Process 2Process 2 w=0w=0 w<4w<4 filt1 filt2 filt3 filt4 write() w++ read() op1 op2 op3 op4
  • 81. 81 1 w=1 2 w=2 3 w=3 4 w=4 5 read() op1 6 op2 7 op3 8 op4 9 read() op1 10 op2 Communication is perfectly aligned 1 filt1 2 filt2 3 filt3 4 filt4 write() 5 filt1 6 filt2 7 filt3 8 filt4 write() 9 filt1 10 filt2 … … We have to guarantee the condition that every write() comes before a read() ClockCycle
  • 82. 82 Small changes to design can result in errors  Increase (decrease) the number of operations in process 1 (2): the same data will be consumed twice.  Decrease (increase) the number of operations in process 1 (2): data will be lost  If the number of initial wait operations in process 2 is too low, we will use undefined data  If the number of initial wait operations in process 2 is too high, we will loose the first data elements)
  • 83. 83 Example of wrong wired communication wirefilt1 filt2 filt3 filt4 write() Process 1Process 1 Process 2Process 2 read() op1 op2
  • 84. 84 1 read() op1 2 op2 3 read() op1 4 op2 5 read() op1 6 op2 7 read() op1 8 op2 9 read() op1 10 op2 The example results in undesired behavior 1 filt1 2 filt2 3 filt3 4 filt4 write() 5 filt1 6 filt2 7 filt3 8 filt4 write() 9 filt1 10 filt2 ClockCycles … … ? Adapt cycle budget or introduce handshake protocol
  • 85. 85 Simple handshake protocol is more robust  The flag “a” (ask) indicates that the receiver is ready to read data in the next cycle.  The flag “r” (ready) indicates that data has been written  Save communication requires at least two cycles.
  • 86. 86 !r r a Simple handshake protocol is more robust Process 2Process 2 filt1 r=0 filt2 filt3 if (a==1){ filt4 write() r=1} Process 1Process 1 !a a if (r==1) { read() op1 a=0} op2 a=1 r a=1 r=0
  • 87. 87 1 a=1 2 a=1 3 a=1 4 a=1 5 a=0 read() op1 6 a=1 op2 7 a=1 8 a=1 9 a=0 read() op1 10 a=1 op2 … and effectively synchronizes the communication 1 r=0 filt1 2 r=0 filt2 3 r=0 filt3 4 r=1 filt4 write() 5 r=0 filt1 6 r=0 filt2 7 r=0 filt3 8 r=1 filt4 write() 9 r=0 filt1 10 r=0 filt2 ClockCycles … …
  • 88. 88 r a … also when receiver is slower than transmitter Process 1Process 1 Process 2Process 2 filt1 r=0 If(a==1){ filt2 write() r=1} !a !r If (r==1){ read() op1 a=0 } op2 r op3 a=1 a=1 r=0 a
  • 89. 89 1 a=1 2 a=1 3 a=0 read() op1 4 a=0 op2 5 a=1 op3 6 a=1 7 a=0 read() op1 8 a=0 op2 9 a=1 op3 10 a=1 … but introduces then one extra wait cycle at receiver 1 r=0 filt1 2 r=1 filt2 write() 3 r=0 filt1 4 r=0 5 r=0 6 r=1 filt2 write() 7 r=0 filt1 8 r=0 9 r=0 10 r=0 filt2 write() Cycles … … The extra wait cycle can be avoided by already putting a=1 during op2
  • 90. 90 Most general protocol: 4-phase handshake protocol Ack Ack Ack Req Req Req Req Ack Req Ack Req Req Ack Execute Ack Data Ack Req=1 Get Data Req=0 Ack=0 Put Data Ack=1 Ack=0
  • 91. 91 Multiple variations on these handshake protocols exist  In stead of signal levels, the protocol can be based on signal transitions.  The protocol can be simplified if both systems run on the same clock.  Protocols can be simplified if one knows that the receiver or the transmitter is fastest.  Synchronization can be performed on the basis of a block:  Set-up communication for first element of a block  Next, communicate every cycle  Some protocols are based on typical FIFO signals: full and empty.
  • 92. 92 In some cases buffered communication is required process2process2process1process1 Q1Q1 Queue size can be determined by monitoring the maximum number of elements in a queue during simulation. 1 write(Q1) 1 2 write(Q1) 2 3 write(Q2) 3 4 4 read(Q2) 5 5 read(Q1) 6 6 read(Q1) Q2Q2
  • 93. 93 r a Queues must be introduced explicitly in hardware FIFO process size N fsm Wired handshake protocol Process1 Process2 r a
  • 94. 94 Process1Process1 Process2Process2 Several communications can also be multiplexed on a bus Process3Process3 Process4Process4 Process1Process1 Process3Process3 Process2Process2 Process4Process4 busbus arbiterarbiter r a a r r a a r Bus and arbiter classes can be reused!
  • 95. 95 Communication refinement results in behavioral model  Model that defines the relative ordering of input and outputs  A clock signal is used for ordering  Pins are accurate to the final implementation  Internal resources are not mapped on clock cycles (scheduling) and functional units (resource binding)
  • 96. 96 Communication refinement  Communication refinement  Communication refinement in SystemC  Exercise 3: communication refinement for the JPEG decoder
  • 97. 97 In SystemC behavioral models use (clocked) threads  Modeled with thread processes SC_THREAD or with clocked thread processes SC_CTHREAD  Every module has a clock input:  sc_in_clk clk;  The SC_THREAD process is made static sensitive to a clock edge  Sensitive << clk.pos();  To separate clock cycles wait() statements are used.  A synchronous or asynchronous reset signal can be specified:  reset_signal_is(reset, true);  async_reset_signal_is(reset, true);  Simulation must be run for a finite time (or will not stop!) or halted explicitly.
  • 98. 98 Behavioral models communi- cate via standard signals  All input and outputs are standard signals  Define signals with:  sc_signal<T> a;  Predefined ports for sc_signal<T> channels:  sc_in<T> with interface function read() or assignment operator.  sc_out<T> with interface function write() or assignment operator.  sc_inout<T> that combines both interface functions.
  • 99. 99 Clocks in SystemC  Create clock  sc_clock clock1 ( “clock_label”, period, time_unit, duty_ratio, offset, first_value );  sc_clock clock2 ( “clock_label”, period, time_unit, duty_ratio);  sc_clock clock3 ( “clock_label”, period, time_unit);  Clock Binding • f1.clk( clock1 );  Clocks are typically defined in sc_main();  Example 2 12 22 32 42 sc_clock clock1 ("clock1", 20, SC_NS, 0.5, 2, true);
  • 100. 100 Example: summing 3 values on an input SC_MODULE(sum3) { sc_in_clk CLOCK; sc_in<bool> RESET; sc_in<unsigned> A; sc_out<unsigned> D; void compute(); SC_CTOR(sum3) { SC_CTHREAD(compute, CLOCK.pos()); reset_signal_is(RESET,true); }; }; void sum3::compute() { unsigned tmp; // reset section while (TRUE) { // main loop tmp = A.read(); wait(); // end first I/O cycle tmp += A.read(); wait(); // end second I/O cycle tmp += A.read(); D.write(tmp); wait(); // end third I/O cycle } }
  • 101. 101 Gradual Communication refinement (1/2) Process1Process1 Process2Process2 queue Process1Process1 Process2Process2 r a Behavioral_process1 Behavioral_process2 clock Converters Q1 Q2
  • 102. 102 Gradual Communication refinement (2/2) Process1Process1 BehavioralBehavioral Process2Process2 C1C1 r a Behavioral_process1 clock Q1 BehavioralBehavioral Process2Process2r a clock BehavioralBehavioral Process1Process1
  • 103. 103 Converter SystemC code template <class T> SC_MODULE(FF2P) { sc_fifo_in<T> input; sc_out<T> output; sc_in<bool> ask; sc_out<bool> ready; sc_in_clk clk; SC_CTOR(FF2P) { SC_THREAD(process); sensitive << clk.pos(); } void process() { T value; enum ctrl_state {READINPUT, WRITEOUTPUT}; ctrl_state state; // reset cycle ready.write(false); state = READINPUT; wait(); while(true) { if (state == READINPUT) { ready.write(false); value = input.read(); state = WRITEOUTPUT; } else { if (ask.read() == true) { output.write(value); ready.write(true); state = READINPUT; } else { ready.write(false); state = WRITEOUTPUT; }; }; wait(); } return; } }; template <class T> SC_MODULE(P2FF) { sc_fifo_out<T> output; sc_in<T> input; sc_in<bool> ready; sc_out<bool> ask; sc_in_clk clk; SC_CTOR(P2FF) { SC_THREAD(process) sensitive << clk.pos(); } void process() { T value; enum ctrl_state {READINPUT, WRITEOUTPUT}; ctrl_state state; // reset cycle ask.write(true); state = READINPUT; wait(); while(true) { if (state == READINPUT) { if (ready.read() == true) { value = input.read(); ask.write(false); output.write(value); state = WRITEOUTPUT; } else { ask.write(true); state = READINPUT; }; } else { ask.write(true); state = READINPUT; }; wait(); } return; } };
  • 104. 104 Communication refinement  Communication refinement  Communication refinement in SystemC  Exercise 3: communication refinement for the JPEG decoder
  • 105. 105 Exercise 3: communication refinement for the JPEG encoder  Goal: Replace the FIFO between the run-length encoder and decoder by a handshake protocol  You will find:  In /exercises/exercise3/ : solution of exercise2  In/exercises/modules/: JPEG-encoder modules {r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules {b2r,idct,normalize,zz_dec}.{h,cpp} and test bench modules {src,snk,test}.{h,cpp}  In /exercises/images/: test images  In /exercises/add2systemc: FIFO to protocol conversion functions in add2systemc: {FF2P, P2FF}.h  Things to be done:  Introduce a handshake protocol between rl_enc and rl_dec.  introduce refined versions of rl_dec in jpeg_dec.h and main.cpp.  simulate and verify correct operation.
  • 106. Model Based System Design Class 4: computation refinement Marc Engels e-mail: marc.engels@flandersmake.be
  • 107. 107 Computation refinement in SystemC  Computation refinement  Computation refinement in SystemC  Exercise 4: computation refinement of a JPEG decoder
  • 108. 108 RTL refinement is the 3rd step in architectural design ComputationsComputations operations DataData variables, arrays floating point memories fixed point operators CommunicationCommunication point-to-point queues busses detailed protocol resource allocation scheduling memory allocation address generation word sizing bus allocation introduce arbiters include protocols System Functionality System Architecture
  • 109. 109 beh4beh4RTL4RTL4beh2beh2RTL2RTL2 beh3beh3RTL3RTL3func1func1 For synthesis all blocks needs to be transformed to RTL  Transformation is a gradual refinement process  switch a behavioral block with a RTL block  verify by system simulation SYSTEMSYSTEM S1S1 S2S2 S3S3 TESTBENCHTESTBENCH
  • 110. 110 Behavioral model can be represented by an FSM Process_behavioral{// SC_CTHREAD ask.write(TRUE); while (ready.read() == FALSE) {wait();} wait(); while(TRUE) { ask.write(FALSE); x = input.read(); wait(); d = x * b1; y = d * b2; output.write(y); ask.write(TRUE); while (ready.read() == FALSE) {wait();} wait(); } } = !ready ready !ready ready ask=1 ask=0 x=input ask=1 d = x * b1 y = d * b2 output = y
  • 111. 111 Behavioral to RTL: scheduling of operations in FSM !ready ready !ready ready ready !ready ready !ready ask=1 ask=0 x=input ask=1 d = x * b1 y = d * b2 output = y !ready!ready ask=1 ask=0 x=input d=x*b1 ask=1 y = d * b2 output = y
  • 112. 112 Rescheduled FSM is represented in RTL code = ready !ready ready !ready!ready ask=1 ask=0 x=input d=x*b1 ask=1 y = d * b2 output = y Process_RTL{// SC_CTHREAD ask.write(TRUE); while (ready.read() == FALSE) {wait();} wait(); while(TRUE) { ask.write(FALSE); x = input.read(); d = x * b1; wait(); ask.write(TRUE); y = d * b2; output.write(y); while (ready.read() == FALSE) {wait();} wait(); } }
  • 113. 113 RTL description corresponds to a datapath possiblepossible mappingmapping ** b1b1 b2b2 xx yy dd 11 00 askask RT description introduces synthesis decisions: register inference resource sharing parallelism readyready D QD Q D QD Q D QD Q Process_RTL{// SC_CTHREAD ask.write(TRUE); while (ready.read() == FALSE) {wait();} wait(); while(TRUE) { ask.write(FALSE); x = input.read(); d = x * b1; wait(); ask.write(TRUE); y = d * b2; output.write(y); while (ready.read() == FALSE) {wait();} wait(); } }
  • 114. 114 ready … and a controller StateState registerregister OutputOutput functionfunction control: steers the register transfers in datapathcontrol: steers the register transfers in datapath Next-stateNext-state functionfunction DatapathDatapath ControllerController inputsinputs outputsoutputs controlcontrol statusstatus ins0 ins1 ins2 C0 c1 c2
  • 115. 115 Critical path of combinatorial logic is crucial Combinatorial Logic Multiplexers, Adders, Multipliers, etc. processclock in outcalc clock … in … Critical path calc … out
  • 116. 116 Pipelining reduces the critical path Area critical path word operator delay data Insertion Interval (DII) Non-pipelined Bit word pipelined + DII = operator delay + DII = critical path + + 1-bit operator delay Word pipelined DII = operator delay/2 + + lsb msb + + + … …
  • 117. 117 Multiplexing reduces the area of the solution Area data Insertion Interval (DII) Processor architecture e.g. VLIW Non pipelined DII = critical path + + critical path Muxed DSP + DII = 2 x critical path
  • 118. 118 E.g. Robot Vision System CCD camera line delayobject Sobel operator Edge detector Feature extractor Pattern recognizer Robot controller x µ-CODE ROM PCLOGIC µ-CODE CONTROL RAM PROGRAM- MABLE FUNCTION. UNITS OFF-CHIP MEMORY MODULAR ARRAY OF PROCESSING ELEMENTS CON- TROL Global control and communication µcoded processorMuxed DP's HARDWIRED CONTROL MEMORIES DATA PATH Array type Real embedded systems show architectural variability
  • 119. 119 Area can be estimated at a high level Source: Gaijski State_reg + logic # states # states, # ctrl_lines, # states each ctrl_line is active # bits and # words of each storage # bits and type of each FU #sources of muxes + # DP connections, # DP components Storage + func_units (FU) + Muxes + wires area Is a function of Datapath(DP) Control Unit(CU) TotalCircuit
  • 120. 120 Standard cell data can be used to derive parameters type name width 2 input MUX mxi2v0x1 3.08 2 input NMUX mxn2v0x1 3.52 2 input AND an2v4x2 2.20 3 input AND an3v4x2 3.08 4 input AND an4v4x2 3.52 2-bit half adders ha2v0x2 5.28 Q flip-flop dfnt1v0x2 7.92 … … … Source: www.vlsitechnology.org
  • 121. 121 Storage: Registers vs. memories  Inferred by synthesis.  Larger size per storage bit.  No overhead.  Fast & parallel.  Best < 1 kbits storage  Non sythesized – but created by memory generators.  Smaller size per storage bit.  Fixed overhead.  Slow & serial  Best > 1 kbits storage
  • 122. 122 Computation refinement in SystemC  Computation refinement  Computation refinement in SystemC  Exercise 4: computation refinement of a JPEG decoder
  • 123. 123 RTL design is modeled with modules and processes A sc_module is an identifiable hardware unit. A module can contain multiple processes that run in parallel. Signals are used to communicate between (executions of) processes. Variables are used inside a single execution of a process.
  • 124. 124 Restrictions (1/2) in SystemC Synthesizable Subset (draft 1.3)  Modules  Exactly one constructor.  Processes  Only SC_CTRHREAD and SC_METHOD are supported; SC_THREAD is not supported.  In a SC_CTHREAD there must be a wait() statement before the infinite loop or as first statement in this loop.  At most one clock signal is allowed per process.  The reset behavior is specified in the process, not in the constructor of the modules.  Between two clock events, at most one assignment to a signal is supported.  Processes communicate through signals, not shared variables.
  • 125. 125 Restrictions (2/2) in SystemC Synthesizable Subset (draft 1.3)  Datatypes:  No floating point.  Char is implemented as signed char, all integer types are 2’s complement.  Pointers are not supported.  Untemplated fixed point types are not supported.  No division operator for fixed point types.  No global variables but global constants are OK.  Functions:  No new(), delete() and sizeof() functions.  Destructors have no effect.  Exception handling is not supported.
  • 126. 126 Example: relation Synthesizable SystemC and VHDL System C #include “systemc.h” SC_MODULE(dff) { sc_in<bool> din; sc_clk_in clock; sc_out<bool> dout; void doit(); // Member function SC_CTOR(dff) { SC_CTHREAD(doit, clock.pos()); } }; void dff::doit() { // Process body while(TRUE){ wait(); dout.write(din.read()); } } VHDL entity dff is port ( din, clock : in bit; dout : out bit ); end dff; architecture dff of dff is begin doit : process(clock) – Sensitivity List begin if (clock’event and clock=‘1’) then dout <= din; end if; end process; end dff;
  • 127. 127 Signals for communication between processes  Declaration  Scalar Signal: sc_signal<sc_uint<32 > > a;  Vector Signal: sc_signal<sc_logic> a[32];  Signals use request-update mechanism: write takes effect after a delta-cycle  When you assign a value to a signal or port, the value on the right side is not transferred to the left side until the process halts. This means that the signal value as seen by other processes is not updated immediately, but it is deferred.  When you assign a value to a variable, the value on the right side is immediately transferred to the left side of the assignment statement.  SystemC supports resolved Ports and Signals  Multi-Valued Logic type : 0, 1, Z, X  Allow Multiple Drives
  • 128. 128 Signals can infer registers Synthesi s ww = x= x y1 =y1 = ww * 10* 10 zz = x // writing at the end of cycle= x // writing at the end of cycle wait()wait() y2 =y2 = zz * 10 // reading at the beginning of cycle* 10 // reading at the beginning of cycle x 1x 1 2 3 x2 3 x y1 10 20 30 xy1 10 20 30 x z x 1 2 3z x 1 2 3 y2 x 10 20 30y2 x 10 20 30 clockclock ww zz 1010 1010 xx y1y1 y2y2 Simulation D QD Q
  • 129. 129 Random Access Memory is modeled with a behavioral model // ram_asyn.h – asynchronous RAM #include "systemc.h" SC_MODULE(ram_asyn) { sc_in<sc_unint<6> > addr; sc_in<bool> rwb; sc_in<int> datain; sc_out<int> dout; int memdata[64]; // local memory storage void ramaction(); SC_CTOR(ram_asyn){ SC_METHOD(ramaction) sensitive << addr << datain << rwb; for (int i=0; i++; i<64) { memdata[i] = 0; } } }; Asynchronous RAM (64) address datain rwb dataout
  • 130. 130 SystemC has a 4-step simulation engine 1: Initialize 2: Iterative execution of functional, behavioral & RTL processes until no activity 3: Update primitive channels 4: Go back to 2 Functional1 behav2 RT3RT3 q1q1 s2s2 q3q3 q4q4 P2FF s1s1 P2FF s3s3 FF2P s4s4
  • 131. 131 Measuring performance  const sc_time& sc_time_stamp(): returns the current time during simulation.  Following functions are defined for sc_time:  double to_seconds(): converts the time into seconds  void print(): prints the time on the screen  If the clock period is known, the number of clock cycles can be calculated.  Throughput ≥ Datarate/Simulation_time
  • 132. 132 Dump signals for wave plotting sc_signal< sc_int<32> > signal1; sc_signal<bool> signal2; sc_trace_file *tracefile; tracefile = sc_create_vcd_trace_file(tracefilename); sc_trace(tracefile, signal1, “signal1"); sc_trace(tracefile, signal2, “signal2"); sc_close_vcd_trace_file(tracefile);
  • 133. 133 Computation refinement in SystemC  Computation refinement  Computation refinement in SystemC  Exercise 4: computation refinement of a JPEG decoder
  • 134. 134 How to start?  Goal: refine run-length decoder in RTL model.  You will find:  In /exercises/exercise4/ : solution of exercise3  In/exercises/modules/: JPEG-encoder modules {r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules {b2r,idct,normalize,zz_dec}.{h,cpp} and test bench modules {src,snk,test}. {h,cpp}  In /exercises/images/: test images  In /exercises/add2systemc: behavioral RAM models.  Things to be done:  Make RTL model of run-length decoder.  draw FSM of the RTL model.  introduce the RTL model in jpeg_dec.h and integrate in main.cpp.  simulate and verify correct operation with gtkwave viewer.  Estimate the needed hardware for this RTL model.

Editor's Notes

  1. Welcome to the second part of the course on specification languages.
  2. The course on specification languages consists of 3 parts: First, an extensive overview was given of various specification models, ranging from dataflow to finite state machines. In this second part, I will focus on the use of a subset of these models for the architectural design of digital embedded systems. The main goal of this part of the course is to learn how the specification models of part 1 can be used for the architectural design of embedded systems. For this purpose, we will rely on SystemC version 2.3.2, which was standardized by the IEEE in January 2012 (IEEE 1666-2011 language reference manual) and for which the simulation library was released in April 2014. SystemC is a class library on top of C++. As such, all object oriented (OO) constructs of C++ can be used in the design of an architecture. These OO techniques can bring the same benefits with respect to re-use to architectural design as that they have brought to software design. Finally, you will apply the acquired skills in a small, but realistic, project.
  3. As prerequisites for this course, I expect the following: Quite obvious you should have a good understanding of the first part of this course, and particularly the presented models. Next, as SystemC is based on C++, also a decent knowledge of this programming language is required. Basic OO concepts like classes, inheritance and templates should be familiar to you. If not, review the C++ tutorial at www.cplusplus.com. In general, a structured methodology for developing and debugging programs is essential for executing the exercises and the project. Familiarity with Integrated Design Environments (IDE) like Eclipse is a benefit. When writing SystemC code, you should be able to describe the hardware that will be generated from this code. Therefore a basic knowledge of register transfer level (RTL) description of synchronous digital circuits is necessary. An RTL description of a circuit consists of registers (e.g. D flip-flops) and combinatorial logic. The registers synchronize the operation of the circuit to the clock signal while the combinatorial logic describes the calculations performed by the circuit. RTL descriptions are used in hardware description languages like Verilog or VHDL. For part 2, the following material is available: The slides with notes can be found on the icorsi (icorsi.ch). The SystemC language reference manual, which can be downloaded from the IEEE standards website (http://standards.ieee.org/getieee/1666/download/1666-2011.pdf)
  4. In this first class we will focus on the functional modeling of a digital embedded system. A functional model will describe the functionality of the embedded system, independent of the platform or architecture on which this functionality is executed. Therefore it is sometimes called a platform independent model (PIM). In this class we will focus on the data flow modeling paradigm for describing the functional model. At the end of the class, you will be able to program a functional model of a digital embedded system in SystemC.
  5. This class covers 4 topics: A general introduction to the design of digital embedded systems The role of SystemC in the design of digital embedded systems The syntax of the SystemC language for functional modeling (with the dataflow paradigm) And finally an exercise to build a functional model in SystemC Lets start with the general introduction.
  6. Consumer as well as professional equipment is becoming increasingly smarter. A few examples: Your car is being converted into a multimedia theater. The value of the electronics in a car has increased consistently, resulting in almost 100 electronic units in a luxury model. Recently a lot of new safety functions (ABS, ESP, parking sensors, anti-collision systems, etc.) have been introduced. It is hard to find a mobile phone with which you can only make a call. Taking pictures, playing music, surfing the web, reading e-mail, etc. are also features of a state-of-the-art mobile phone. Most phones even have GPS functionality and run office software. Gaming becomes more interactive (e.g. Nintendo Wii, Microsoft Kinect) and mobile. Photography has dramatically changed over the last decade: it has become fully digital. Digital cameras are currently extended with features like wireless connections, automatic picture enhancements (e.g. red eye correction), etc. The era of service robots is coming. Robots to vacuum clean the house, mown the lawn in the garden, etc. are already on the market.
  7. The evolution towards smart products is not limited to consumer devices. We observe, for instance, the same trends in production machines. Harvesters have a growing number of functions for quality control, obstacle detection and precision farming. To realize these smart functions, the electronic control units become increasingly more complex. Especially the software content is growing very fast (20% average growth per year). The long term vision for combine harvesters is to evolve towards full autonomous machines, that can work without any operator on board and just receive a command of the job to be done. Many more smart functionalities will be needed to reach this goal. In compressors functions are introduced to optimize the energy consumption based on the instantaneous demand of air. Weaving looms can adapt their speed to the quality (strength) of the textile fibers. Professional washing machines automatically detect the load, hardness of the water, etc. and adapt their washing program.
  8. To realize this smart functionalities, electronic systems and software have to be embedded in consumer and professional devices. Such embedded systems are minimizing power, cost and size, and hence work on a minimal platform. For instance, 8-bit and 16-bit processors are still extensively used in embedded devices. They must be robust. For instance, a mobile phone must survive rude treatment. A car has an operation life of 7000 hours and some machines are expected to work up to 100000 hours. Over their lifetime, products are increasingly expected to evolve. Also more variants are designed from the same platform. A typical example is the customization of the mobile phones. And the product needs to be on the market before the Christmas shopping. In many cases the system has even safety-critical functionality, think about automatic braking system (ABS) or emergency buttons, which require a guarantee on the reliability of the system. For the development of such safety-critical functions, specific standards have to be followed. The main distinctive characteristic of an embedded system, however is that it has to interact with the real world, necessitating real-time behavior.
  9. A system is said to be real-time if the correctness of an operation depends not only upon its logical correctness, but also upon the time in which it is performed. In a hard real-time system, the completion of an operation after its deadline is considered useless - ultimately, this may lead to a critical failure of the complete system. A soft real-time system on the other hand will tolerate such lateness, and may respond with decreased service quality (e.g. bank terminal). Depending on the inputs, two types of hard real-time constraints are distinguished in embedded systems: Signal processing systems process inputs that arrive at regular intervals and the system must be ready after a fixed time to process the next input. Signal processing systems typically interact with their environment through sensors (observe the environment) and actuators (control/influence the environment). Sensors are components that translate non-electrical quantities (e.g. temperature, pressure, ...) into electrical quantities (voltage, current). Since most observable quantities are analog signals, sensors usually produce analog electrical signals. In most cases signal conditioning is required to compensate the non-idealities in the sensors and to prepare the sensor signals for the actual signal processing. Because the signal processing is done digitally, an Analog to Digital Converter (ADC) puts the sensor signal in the right format. Actuators perform the reverse operation of sensors: they translate electrical quantities into non-electrical quantities. Also actuators need analog signals and therefore a Digital to Analog Converter (DAC) is needed. Because actuators need to influence the physical environment they often require high power, hence power electronics circuits are introduced to condition the control signal. When the input is an event and the system has to react within a certain time, this is called a reactive system. Examples of reactive parts of an embedded system are the interaction with the user or responses to external alarms. As shown on the picture, embedded systems often combine various types of real-time behavior.
  10. An embedded system can be separated into a digital part and an analog part. The analog part contains for instance signal conditioning, ADCs and DACs. In high-frequency applications, like radios or radars, it will be a large part of the embedded system. Also sensors and actuators are part of the embedded system. Traditionally these were discrete external components, but recently they are increasingly integrated, when power permits, in a package and even on chips. The digital part is where the actual “intelligence” is. A growing part of the functionality of embedded systems is implemented in software called “embedded software”. This offers the advantage of increased flexibility (functionality can be changed after production). As a consequence, the digital part of an embedded system consists of 3 components: Programmable processor cores. They can be general-purpose micro-processors or more specialized digital signal processors (DSPs). Volatile and non-volatile memories. Configurable (though parameters) dedicated logic. The digital part can be implemented as a PCB with discrete components, a multi-chip package, an FPGA or a fully integrated chip. In the latter case this is often referred to as a System-on-Chip (or SOC). In these classes we will mainly focus on the design of the configurable logic (on FPGA or chip), although SystemC is also extensively used for the modeling of SOCs.
  11. For the design of a digital embedded system, we use a design flow that consist of the following elements: During the functional design of the system, the designer determines what the system has to do, based on the performance requirements (e.g. bit error rates in communication systems) and functional requirements (e.g. specified protocols). He also determines all algorithms. The system functionality is expressed in a platform independent way. A reusable architecture template, or platform, consisting of processors, memories, and dedicated logic, is defined or selected. The architecture template should guarantee architectural requirements (e.g. interface formats) and non-functional requirements (e.g. power or cost). Each function in the functionality is mapped on an element in the architecture template. For the dedicated logic a circuit corresponding to the required functionality is created, resulting in a dedicated architecture. Finally, by means of RTL-synthesis the designer generates a gate level netlist. By the place and route step this netlist is next transformed into a physical layout for this dedicated architecture, which can be manufactured by a foundry. Alternatively, the design is mapped to a configuration file for a programmable platform (e.g. field programmable gate array or FPGA). For the functions mapped on processors, C-code is generated and compiled. The Y-model is represented as a top-down approach, but in a realistic design flow, multiple iterations are performed before reaching the final embedded system.
  12. In this course we concentrate on the architectural design of dedicated logic, where the algorithms are mapped into an optimal architecture. The algorithm will typically be specified into a functional model, e.g. data flow and asynchronous state machines. The architecture needs a timed model, e.g. register transfer level (RTL). To obtain the RTL description, a refinement needs to be done for the computations, communications, and data. The order of these refinements is not fixed. However, it is good practice to take the most important design decisions first. Remark that for parts of the system that are implemented on software, the complete refinement does not need to be performed. However, a processor and a memory structure has to be selected. For this purpose, certain refinement, like fixed point, can be useful.
  13. We now take a closer look at the role of SystemC in the design of digital embedded systems.
  14. Traditionally, a system functionality is expressed in MATLAB (SIMULINK/STATEFLOW) or a standard computer language (C/C++). To express the RTL description of the system, VHDL or Verilog is used. As a consequence the transformation from functionality into architecture does not only involve a change in semantics but also in syntax. Moreover, because of the different languages, this transformation cannot be done incrementally. SystemC resolves this issue, by offering a language that can express both functionality and architecture.
  15. SystemC is a C++ library that allows to refine a system from a functional description into an architecture. Three contributions were essential into the creation of SystemC: The modeling of RTL hardware with C++ was demonstrated in the OCAPI framework of IMEC, as well as the SCENIC project of UC Irvine in cooperation with Synopsys. Frontier Design (an IMEC spin-off) contributed to the fixed-point data types. CoWare (another IMEC spin-off) introduced concepts of hardware-software co-design. The SystemC language was first standardized in December 2005 by the IEEE. A revision (IEEE 1666-2011) was made in 2011. More recently a number of extensions of the SystemC language were proposed: Verification library adds random generator and transaction recording. Transaction level modeling, a high-level approach to modeling digital systems where details of communication among modules are separated from the details of the implementation of functional units or of the communication architecture. This extension is included in the revised IEEE standard. Analog and mixed-signal library extends SystemC with the following modeling paradigms: timed data flow, linear signal flow modeling, and electrical linear network modeling. All information about SystemC can be downloaded from the www.accellera.org website.
  16. With respect to tool support, the Accellera System Initiative (www.accellera.org) makes an open-source simulation library available. Various academic institutes also offer translators from Verilog or VHDL to SystemC. For synthesis however, we have to rely on commercial tools.
  17. The classes of the SystemC library fall into four categories: the core language, the SystemC data types, the predefined channels, and the utilities. The core language and the data types may be used independently of one another. At the core of SystemC is a simulation engine containing a process scheduler. Processes are executed in response to the notification of events. Events are notified at specific points in simulated time. In the case of time-ordered events, the scheduler is deterministic. In the case of events occurring at the same point in simulation time, the scheduler is non-deterministic. The scheduler is non-preemptive, which means that once an execution of a process is started, it cannot be halted but executes till the end of the process.
  18. The SystemC core language contains a number of primitives to define parallelism. A system is split in a number of modules (sc_module). A module communicates with the external world through ports (sc_port). Two ports are connected through a channel. SystemC predefines some primitive channels (sc_prim_channel), but more complex channels can be user defined. A channels connects to a port via an export (sc_export). A hierarchical module consists of a structure of other modules. A non-hierarchical module contains one or more processes (sc_process). A process is executed in case that an events (sc_event) happens. A process interacts with a channel through an interface (sc_interface), which is a collection of functions that are supported by sc_port.
  19. SystemC contains all necessary constructs to model the functionality of a system. We will focus on activity-oriented models, although SystemC can also express other modeling paradigms. Let’s review these constructs.
  20. SystemC has support to model Kahn process networks, with the limitation of bounded queues. A Kahn process network is a directed network of processes that are interconnected by first-in-first-out (FIFO) queues of infinite size. Each time that a process is executed, tokens are consumed from the input queues and new ones are produced in the output queues. If a token is not present on an input queue, the consumption of the token will block. Kahn process networks exhibits deterministic behavior that does not depend on computation or communication delays. In SystemC the constructs are available to define the processes and the queues. These constructs interact with a simulation engine, which schedules the execution of the processes. The simulation engine stops when there is no longer activity in the network.
  21. Modules are used to partition the functionality in the design. However, you should not use too many modules, as this complicates the design, but also not too few. In general, functionality that is implemented in a different architectural style (e.g. software or dedicated hardware) or on a different location should be in different modules. Every module is derived from the base class sc_module and should have a name, which is used for debugging purposes. The macro SC_HAS_PROCESS(“class name”) indicates that the module in not hierarchical and contains processes.
  22. The slide shows an explicit definition of a modules, consisting of the class definition, the SC_HAS_PROCESS macro and the constructor. To compact the definition, two more macros are provides: SC_MODULE(“class name”) is equivalent to the first two lines of the explicit definition SC_CTOR(“class name”) equals the SC_HAS_PROCESS macro and the first lines of the constructor. It can be used when if only a name is passed to the constructor. If you also want to pass parameters, an explicit declaration is needed.
  23. In SystemC the sc_port object is used to communicate with a channel. Ports provide the means by which a module can be coded such that it is independent of the context in which it is instantiated. A port forwards interface method calls to the channel to which the port is bound. For functional modeling, processes communicate through fifo ports. Two port types for sc_fifo&amp;lt;T&amp;gt; channel, where T is the basic type of the elements in the fifo channel, are supported: Input: sc_fifo_in&amp;lt;T&amp;gt; which is basically equivalent to sc_port&amp;lt;sc_fifo_in_if&amp;lt;T&amp;gt;,0&amp;gt;, where the first parameter is the input interface of a FIFO and the second parameter specifies that multiple channels can be connected to a FIFO. However the practical use of these multiple bindings is not clear. Therefore it could be useful to define its own fifo port with a restriction of a single binding. Output: sc_fifo_out&amp;lt;T&amp;gt; which is equivalent to sc_port&amp;lt;sc_fifo_out_if&amp;lt;T&amp;gt;,0&amp;gt;. Also here, the use of multiple bindings is not recommended. Several functions are associated to the sc_fifo class: read() gets a token from the queue. It blocks when no tokens are available. write() puts a token on a queue. It blocks when there are no free spaces in the queue There are also inspecting functions available to look at the number of tokens or free spaces.
  24. When we add the definition of the ports to the constructor of the adder we obtain the code on the slide.
  25. The actual computation in the application is performed in the processes. As a consequence, they also define the parallelism in the application. SystemC supports three types of processes. For functional modeling we use the SC_THREAD process. An SC_THREAD process runs forever when started. It can be suspended by a wait(event) function. Often the wait(event) function is implicitly present in the communication functions. Processes are executed on events. These events can be statically or dynamically defined. Static sensitivity is set by means of the variable sensitive of sc_module. Dynamic sensitivity to a certain event is set by wait (event) for an SC_THREAD process. A module can have multiple processes. Processes might be dynamically created during simulation. However, no synthesis support exists for dynamic processes. Therefore, we do not use them in this course.
  26. Adding the definition of an SC_THREAD process to the adder results in the code on the slide. This adder waits for data on both its input queues sequentially and next produces a token on its output queue.
  27. The global structure of the system is defined in the main function. Because main() is already used by the SystemC library, the main function for the user application is sc_main(). In sc_main(), the following actions are taken: Instantiation of the channels. The basic channels that we use in functional modeling is sc_fifo. A FIFO queue is defined by means of the template class sc_fifo&amp;lt;T&amp;gt;. T can take on any basic data type, e.g. int, float, etc. The sc_fifo class declares a finite length buffer of tokens. The default length is 16 elements. The queue also has a name for debugging and statistics retrieval purposes. The constructor for the queue is sc_fifo&amp;lt;T&amp;gt; f1 (“name f1”, length); A sc_fifo can only be written from one process. Instantiation of the modules. A module can be instantiated multiple times. Binding the ports of the modules to the channels. This can be done in two ways: positional or named. Named binding is preferred because it is less prone to errors than positional port binding. Start the simulation.
  28. The sc_main() function for the adder is shown on the slide. Remark that the arguments of sc_main() are identical to these of main(). To connect the ports to the channels, named bindings are used.
  29. In a functional model hierarchy will be used to make the design more readable. The hierarchy is fully transparent: it basically acts as a container for the basic modules, but does not add any functionality or synchronization. The definition of a hierarchical module consists of the definition of the ports and internal queues. Next the internal modules are defined. Care must be taken that the module objects will still exist after execution of the constructor. Two alternatives exist to guarantee this: either construct them when calling the constructor, or create them with a new function. The constructor creates the two modules and binds the ports to the channels.
  30. In a functional model no notion of time is present. Every action processes infinitely fast. As a consequence, the simulation kernel only advances in delta cycles of infinite small time units. If we would start the simulation kernel with a finite amount of time to run, it would never reach that time and hence run forever. Therefore we run the simulation kernel until no events are present any more. This is achieved with the sc_start() command. With this approach, there are two ways of stopping the simulation: We can exit a SC_THREAD. By doing so, no events will be produced anymore and the simulation will finally stop because of the lack of events. We can check for a termination condition and explicitly call sc_stop(). This approach was used in the exercise of class 1. When the whole image is processed and written to file, the simulation is explicitly stopped. In general this is also the safest and most elegant way of controlling the simulation.
  31. Finally, let’s exercise what we have learned so far.
  32. The goal of this exercise is to practice functional modeling. We will use a simplified JPEG block diagram for this purpose. A process will be defined and integrated in a JPEG functional model. Next the functional model will be simulated and the overall behavior of the system will be observed.
  33. JPEG stands for “Joint Photographic Experts Group” and is a compression standard for color images. It is widely used. More information can be found on www.jpeg.org
  34. A simplified block diagram of a JPEG encoder and decoder is shown on the slide. First and original image is inputted and split in 8x8 blocks (R2B). Together with the pixel data, also width, height and number of bits per pixel are extracted from the image. Next, on each 8x8 block, a discrete cosine transform (DCT) is performed, resulting in 8x8 DCT coefficients. These DCT coefficients are quantized and reorganized in the zigzag scan module. The resulting coefficient stream is run-length encoded. This last block is different from the JPEG standard where an Huffman encoder is used. In the decoder the reverse operations are performed in the reverse order.
  35. The discrete cosine transform (DCT) is performed on a 8x8 pixel block and returns an 8x8 block of DCT coefficients. Each DCT coefficient indicates the amplitude of a horizontal and vertical frequency component. The inverse discrete cosine transform (IDCT) returns pixel values from DCT coefficients. The formal definition of the DCT and IDCT are shown on the slide. In stead of this straight forward 2D operation the calculation can be split in consecutive 1D operations, which is more efficient. There is also a large set of optimized DCT-algorithms that exploit the regular structure of the cosine values.
  36. Next the DCT coefficients are quantized. To this end each DCT coefficient is divided by the corresponding value in the quantization table. The result is rounded to the nearest integer, reducing the number of bits needed to represent the DCT coefficient. In the quantization step image information might be lost, resulting in lossy compression.
  37. An example of a typical quantization table is shown on the slide. It can be remarked that the quantization values grow for higher horizontal or vertical frequencies. JPEG contains a number of predefined quantization tables. If a custom quantization table is used, it must be sent to the decoder.
  38. The resulting quantized DCT coefficients are next zigzag scanned. This is done in such an order that statistically long sequences of zero coefficients can be expected.
  39. Next we use a non-JPEG run-length coder for our exercise. This coding works as follows: The DC value is sent “as is” The high frequency data is split in sections consisting of a number of zero’s followed by a non-zero coefficient. Each segment is represented by a couple consisting of the number of subsequent zero’s and the value of the non-zero coefficient. When all remaining coefficients for a block are zero, an end of block (EOB=63) value is sent.
  40. You will find all files for starting in the exercise1 directory. Perform the actions as indicated on the slide. To obtain information about the number of writes and reads in the fifo’s, use the type fifo_stat&amp;lt;T&amp;gt; i.s.o. sc_fifo&amp;lt;T&amp;gt;. To prevent multiple bindings of a fifo_port, the classes my_fifo_in&amp;lt;T&amp;gt; and my_fifo_out&amp;lt;T&amp;gt; are used in the exercises.
  41. We will make the exercises in a Linux environment, using g++ and Eclipse. Eclipse is an integrated development and debugging environment. In the exercise directory there is a step-by-step guide of how to get started with the exercises in Eclipse. The recent sources of the exercises and libraries can be found at http://www.icorsi.ch/ Libraries have to be compiled before starting the exercise session.
  42. In this second class we will focus on the refinement of the data types of the functional model. More in particular we will explain the definition of fixed-point word lengths for the variables in the functional model. This action is relevant both for mapping on embedded processors with limited data sizes, e.g. 16-bit processors, or for mapping on a dedicated architecture. A the end of the class, you will be able to perform fixed point refinement on a functional model of an embedded system in SystemC.
  43. This lecture on fixed point refinement consists of three parts: In the first part we introduce the quantization and overflow effects of fixed point representations. We also present some methods to determine the most and least significant bits (MSB and LSB). Next, we introduce the fixed point support in SystemC. This consists of an extensive set of fixed point types. In addition, SystemC also supports 4-valued logic to define bus structures. Finally, we introduce the exercise on fixed point refinement.
  44. Let’s concentrate on the architectural design step that translates an algorithm into an optimal architecture. The algorithm will typically be specified into a functional model, like data flow. The architecture needs a timed model, e.g. register transfer level (RTL). Initially the algorithm will be modeled in floating point. Cost-effective implementation requires, however, a refinement into fixed point types.
  45. Most signal processing algorithms are specified in floating point precision. This is a very powerful signal representation with high accuracy, but is also expensive in storage and operation cost. For instance, a typical representation of a floating point number is a mantissa of 24 bits and an exponent of 8 bits. As a consequence, a floating point multiplication is equivalent to a 24-bit multiplication and a 8-bit addition. However, many applications, like cable modems and wireless communication devices, require low cost and low power for a high processing speed. As a consequence, the DSP algorithms will be performed in fixed-point arithmetic. With an 8-bit fixed point notation, for instance, the cost will drop dramatically as the hardware cost for a multiplication is a quadratic function of its input width. This requires the designer to translate floating point types into fixed point types, using a refinement strategy.
  46. A fixed point type can be defined by three parameters: The total number of bits WL. The position of the decimal point, indicated by the number of integer bits IWL. The way in which the value is represented. In the case of a signed number, 2’s complement notation is the most common because it allows easy arithmetic. However, alternatives like sign-magnitude and 1’s complement are also feasible.
  47. If the result of a calculation has more precision than available in the fixed point format, the value has to be quantized. Several ways of quantization exist: Truncate or floor is the cheapest approach because it is standard available in hardware. However, it generally gives the worst performance of the quantization techniques. Magnitude truncate realizes a floor function for positive values and a ceil function for negative values. The technique is natural for sign magnitude representations. The advantage is a symmetrical behavior around the zero value. Applying the ceil function to the complete range is an alternative which is seldom used. Rounding is the technique with the best performance for most cases. However, it also is the most expensive one. In hardware this requires the addition of 0,5 the least significant bit followed by a truncation operation.
  48. When the result of an operation is larger than the maximum value that can be represented by the fixed point format (overflow), we have two possibilities: Wrap-around: the overflow bits are neglected. For unsigned values, this is equivalent to a modulo operation (see figure on slide). For 2’s complement numbers, a one bit overflow results in the maximum negative number. This is the standard behavior in a hardware implementation. Saturation: when an overflow occurs, the signal is set to the maximum value that can be represented. Additional hardware is necessary to realize this behavior. Remark that a similar situation can occur for the minimum value of a signal. For instance, if the subtraction of two unsigned signals results in a negative value and must be represented in an unsigned format. For such underflow, similar remedies are possible.
  49. When we opt for a saturation strategy, the following hardware is needed. The result of the operation must be compared to the maximum positive and negative numbers. This can be done with an explicit comparator or with the overflow flags from the adders. If overflow or underflow is reached, the result of the operation is replaced by the maximum or minimum value respectively. Remark that the hardware complexity of a comparator or multiplexer is comparable to a adder. As a consequence, saturation hardware can require a significant amount of area.
  50. Going back to the need for fixed point representations, the designer is faced with the following problem. He obtains a floating point algorithm and needs to translate the floating point types into fixed point types, using a refinement strategy. For each floating point number, a fixed point characteristic (including total and integer word lengths, overflow and rounding behavior) must be chosen. In most situations the input and output formats are defined by the system context (e.g. analog-to-digital converter). Remark that determining these ADC and DAC precisions is an important task in the overall system design.
  51. This fixed-point refinement is a complex optimization problem where the search space grows exponentially with the number of signals. The goal of the optimization is to minimize the overall implementation cost and power consumption. At the same time the performance degradation (e.g. implementation loss for telecom systems) must be small. Remark that it is essential to define a performance degradation bound (e.g. implementation loss for communication systems, visual performance measure for multimedia systems) before starting the fixed point refinement. The optimization problem can be separated in two parts: Determination of the most significant bit (MSB). First, the minimum and maximum signal value must be determined. From this the MSB position, value representation and overflow behavior is selected such that overflows are avoided as much as possible. Determination of the least significant bit (LSB). By evaluating the difference in performance between the fixed and floating point behavior of the algorithm, the LSB position and quantization method are determined for each signal. The goal is to stay within the performance degradation bound. In the next slides we will take a closer look at methods for MSB and LSB determination.
  52. MSB determination can be done by means of range propagation. This analytical method works as follows: On each input signal, the range, i.e. the minimum and maximum values that occur in a signal, are specified. Next, the signal flow graph of the algorithm is traversed and for each operator, the range of its output is calculated based on its input ranges. Because the method exactly calculates the exact minimum and maximum signal values, it results in a safe, but sometimes pessimistic, estimation of MSB position.
  53. Range propagation on the operators is a simple operation. The table on the slides shows the rules for add, subtract and multiply operations.
  54. When applied to feedback signals, range propagation can become unstable and cause continuous growth of the minimum and maximum values. An example of such a situation is shown on the slide. In such a situation, a statistical inspection of the real signals will be needed to determine a realistic MSB position. Remark that the propagation mechanism also causes that all signals within this feedback loop or depending on the output of the feedback loop will struggle from this range explosion. Once saturation logic is introduced at one place in the loop this problem will be solved.
  55. As an alternative to the analytical range propagation method, we can collect the signal statistics during simulations. Because the obtained range information will be stimuli-dependent, this will give an optimistic estimation of the minimum and maximum values. As a consequence, to maximize the confidence in the obtained results, the stimuli set should be large and provide a complete coverage of the algorithm code.
  56. As can be expected, combining both methods gives the best results. Each signal in the system will then be in one of the following situations: Both methods result in the same MSB position. Quite logically, the signal can safely be specified with the resulting MSB position and wrap-around overflow behavior. When the analytical MSB position is larger than the statistical MSB position, we can make a trade-off between the analytical MSB with wrap-around and the statistical method with saturation. In most case the wrap-around functionality will be the most economical. Only when the statistical MSB position is much smaller, saturation logic can be beneficial. In the case of a range growth because of feedback, the analytical MSB position cannot be calculates (going to infinity). In this case, the statistical MSB position is chosen together with a saturation behavior. After introducing the saturation on one signal in the feedback loop, we need to re-simulate to get useful results for the rest of the algorithm. An example of each of these situations is shown on the slide.
  57. When we look at the LSB side, the question arises what the effect is of quantization. Many authors approximate the quantization effect as an additional noise source. They assume that: The noise sequence is a sample of a stationary random process (i.e. whose statistical parameters do not change over time). The noise sequence is uncorrelated with the input sequence. The random variables of the noise process are uncorrelated, i.e. the error is a white-noise process. The probability distribution of the error process is uniform over the range of the quantization error.
  58. The noise process can then be modeled by means of its mean and variance. The expressions for mean and variance for the three most popular quantization methods are shown on the slide.  is the quantization step. Rounding and magnitude truncation result in a 0 mean, but rounding has the lowest variance. Truncation and rounding have the same variance, but rounding has the lowest mean. As can be expected, rounding introduces the least quantization noise.
  59. Replacing the quantization by an additional noise source results in a linear model of the quantized algorithm. This can then be analytically analyzed by means of well-developed linear signal processing theory. For many quantization effects, this linear model is a good approximation. It has, for instance been used to determine the effects of quantizing the signals in FIR filters. As an exercise, calculate the resulting signal to noise ratio in the case that: x(t) ranges between 0 and 255 with a uniform distribution. both quatization steps are rounding the values to the nearest integer.
  60. However, not all applications are linear. Quantization in non-linear systems can lead to non-intuitive behavior. In infinite impulse response (IIR) filters, for instance, quantization can generate limit cycles. For a stable floating-point IIR filter implementation, the output will decay asymptotically to zero when the input becomes zero. For the same system, implemented with finite precision, the output may continue to oscillate indefinitely with a periodic pattern while the input remains equal to zero. This effect is often referred to as zero-input limit cycle behavior. An example of such behavior is shown on the slide.
  61. Non-linear quantization effects are difficult to analyze analytically. Therefore, mostly simulation based methods are used. To this end the output of a reference simulation is compared to a simulation with the quantized signals. Again sufficient large stimuli sets, which have a complete code coverage, must be used.
  62. To get a better insight in the optimization trade-off, the difference between the floating-point and fixed-point values (e) and the resulting signal to quantization noise (SQNR) is a useful guidance. The SQNR for all signals is calculated as follows: During signal assignments the statistics (mean, standard deviation) for the error signal as well as for the output signal are collected. At the end of the calculate the signal to quantization noise ratio SQNR is calculated for each signal.
  63. The optimal LSB is determined by running the simulation multiple times with various quantization sets. For each quantization set, the SQNR per signal, the overall SNR and PSNR, and the cost is calculated. The goal is to find the cheapest solution that realizes the specified performance. This procedure can be automated by means of an optimization routine. When changing the quantization for one signal at the time, the statistics give an impression of the sensitivity of the cost and the performance to the quantization of a signal. As a rule of thumb, the SQNR of a signal should be higher than the overall SNR. Remark that the SQNR and SNR statistics are dependent on the input. As a consequence, the optimization should be performed on a representative set of inputs.
  64. In the next part we discuss the fixed point support in SystemC
  65. SystemC introduces a number of specific data types, which correspond to data types that are frequently used in Hardware Description Languages (HDLs). These types include sc_logic to make 4 valued representation that can be high (1), low (0), undefined (X) or in a high-impedance (Z) state. Integers can be of arbitrary length with sc_int, sc_uint, sc_bigint and sc_biguint. SystemC also supports logic vectors with 2 or 4 valued logic with sc_bv and sc_lv. sc_fixed and sc_ufixed define fixed point numbers where the characteristics of the number are defined by a template. sc_fix and sc_ufix use a run-time argument to define the fixed point characteristics. This is interesting to try out different quantization settings without recompilation. However, these types can not be used in synthesis, while the others can.
  66. Two data types provide full flexibility in representing fixed point numbers with static parameters: sc_fixed (signed, 2’s complement numbers) and sc_ufixed (unsigned numbers). The constructor of these fixed-point types carry the information of the word lengths and quantization and overflow behavior: wl is the total number of bits iwl represents the number of integer bits, i.e. left from the binary point. q_mode specifies the quantization method to be rounding (SC_RND), flooring (SC_TRN), or magnitude truncate (SC_TRN_ZERO). In addition, some very particular, rarely used quantization modes are specified. o_mode selects the overflow mode to be saturation (SC_SAT), saturation to zero (SC_SAT_ZERO), symmetrical saturation (SC_SAT_SYM), wrap-around (SC_WRAP), or sign-magnitude wrapping (SC_WRAP_SM). n_bits specifies the number of saturated bits in case of wrapping. This allows to generate some special wrapping methods that keep the sign of the signal. Default nb is set to 0.
  67. Two of the arguments specified to the fixed point data type were word length (wl) and integer word length (iwl). Word length must be greater than 0. Integer word length can be positive or negative, and larger than the word length. For instance if the word length is specified as 5 bits but the integer word length is 7 then two zeroes will be added to the end of the object. If the integer word length is a negative value then sign bits after the binary point will be extended. For instance if wl = 5 and iwl = -2 then two sign bits will be added to the object. The sign bits are simply the most significant bit of the 5 bit number. By extending the sign bits, the value of the number is maintained.
  68. This slide shows an example that illustrates the difference between rounding and flooring functionality. As can be seen, rounding always results in smaller quantization errors than flooring.
  69. The slide shows an example with different overflow handling methods: saturation and wrap-around for a two’s complement number. As can be seen largely different outputs are generated for this different overflow methods.
  70. When working with fixed-point arithmetic, it is vital to have an efficient representation of values and simulation of operations. For this purpose, all operations are performed with floating point arithmetic. Only on assignment, the quantization is performed. In case an intermediate result needs to be quantized, an explicit assignment operation has to be used. In the example above the multiplication a*b is a floating-point operation having as input two fixed point values. During the assignment to c the floating point result is automatically casted to the specified fixed point type of variable c.
  71. SystemC also allow to define fixed point types with non-static arguments: sc_fix (signed, 2’s complement numbers) and sc_ufix (unsigned numbers). Type sc_fxtype_params is used to configure the parameters of types sc_fix, and sc_ufix. To set the parameters for these types declare an object of type sc_fxtype_params, initialize the parameter values as desired, and pass the sc_fxtype_params object as an argument to the sc_fix or sc_ufix declarations. The sc_fxtype_params object has the same arguments passed to an object of type sc_fixed. These include: • wl - word length • iwl - integer word length • q_mode - quantization mode • o_mode - overflow mode • n_bits - saturated bits Any combination of arguments are allowed, but the order cannot be changed. A variable of type sc_fxtype_params can be initialized by another variable of type sc_fxtype_params. One variable of type sc_fxtype_params can also be assigned to another. Individual argument values can be read and written using methods with the same name as the arguments shown above.
  72. We now turn to the exercise, where we will perform fixed point refinement of the IDCT operator in the JPEG decoder.
  73. The goal of this exercise is to get familiar with fixed point refinement, by practicing it on the IDCT block of the JPEG decoder. To this end, we will determine the LSB and MSB value for every variable in the IDCT function. By observing the overall behavior it will be possible to optimize the LSB and MSB values. The MSB should be determined in such a way that overflow is avoided without introduction of overflow logic. To determine the LSB the impact on the image quality (e.g. peak signal to noise ratio PSNR) should be kept below 0,5dB. The PSNR is defined as the ratio between the maximum power of a signal and the power of the corrupting noise. In our case the noise is the mean squared error (MSE) between the original and the decompressed image. The maximum power of the signal is MAX2, where MAX is the maximum grey value of a pixel.
  74. In this third class we will focus on the refinement of the communication between the modules of the functional model. More in particular we will explain how the FIFO communication channels can be replaced by protocols on simple wires.
  75. This lecture on communication refinement consists of three parts: In the first part we introduce the concept of refining the inter process FIFO communication into real protocols. Next, we review the support in SystemC for communication refinement. Finally we introduce the exercise to practice what we have learned.
  76. In the architectural design process that translates an algorithm into an optimal architecture, communication refinement is an important step. The algorithm will typically be specified into a functional model, like data flow. In this data flow model, the communication between processes is performed via point-to-point queues. The architecture needs a model with explicit protocols. In addition, signals could be multiplexed on a bus to reduce the wiring overhead.
  77. A FIFO is a very robust structure because it guarantees correct processing of the data independently from the processing times of the functions and communication times. However, queues require a large amount of storage and also some addressing hardware. A typical implementation, for instance, would be a memory array with modulo addressing and a read and write pointer. Because of this large implementation cost, the communication must be optimized.
  78. Ideally, from an implementation point of view, a FIFO communication could be reduced to a simple wire when the output signal is registered. This requires no storage and no implementation cost for the addressing or protocol. However, consistency of the communication must be guaranteed: Process 2 should not use the data before it is generated and Process 1 should not produce new data before the previous has been read by Process 2.
  79. To analyze the behavior of a wired connection, we represent the two processes with a Synchronous Finite State Machine (FSM). In such a Synchronous FSM the transitions take place on a clock edge. In our analysis we assume that both processes are running on the same clock. Process 1 will perform a filtering operation in 4 cycles and will also write the data in the register in the 4th cycle. Process 2 will initially wait for 4 cycles. Next cycle, it will read the data and perform a first operation, followed by three more cycles of operation. This sequence will be repeated continuously.
  80. If we look at a timing diagram, we see that the timing is guaranteed. Every read() happens after a write() of the signal. Also no data is lost.
  81. However, small changes to the finite state machines of one of the two processes can result in errors: If we increase the number of operations in process 1, process 2 will consume too early and hence twice the same data is used. If we decrease the number of operations in process 2, the same happens. If we decrease the number of operations in process 1, process 2 will be relatively too slow and some data will be overwritten before it has been used. Increasing the number of operations in process 2 will have the same effect. Also remark that the number of initial wait operations in process 2 should not be too low or too high.
  82. In the slide an example is shown where process 2 has only two states. As a consequence it can be expected that the data produced by process 1 is used multiple times. Because no initial wait operations are present in process 2, we also expect that undefined data will be used.
  83. The expected behavior is confirmed on the time diagram. As can been seen on the diagram, the first two data elements for process 2 will be undefined. Next, the read() operation of process 2 will use twice the same data produced from process 1. To guarantee correct behavior, two approaches exist: Adapt the cycle budget of process 2, for instance by introducing two dummy cycles. However, this breaks the general approach of making modules independent from the environment in which they operate. Introduce a handshake protocol that automatically synchronizes on the data transfers. This is the most robust and reliable approach. On the other hand, handshake protocols introduce some overhead and should be performed on larger units.
  84. Many different handshake protocols are feasible. Let’s illustrate the concept with a very simple one with two handshake lines. The handshake line “a” (ask) is generated by the receiver and indicates that the receiver is ready to read in the next cycle. The handshake line “r” (ready) is generated by the transmitter and indicates that he has written data in the cycle when the flag is raised. At least two cycles are needed for a reliable communication of a value. Remark that this protocol is only suited for synchronous designs where both processes are executed on the same clock.
  85. The finite state machines enhanced with the protocol operations (in red) is shown in this picture. When “a” is set, process 2 waits for the “r” flag to be raised. Next it reads the data, lowers “a”. performs its operations, and sets “a” again for a next sequence. Process 1 performs its operations and next waits for flag “a” before it writes its data and raised flag “r”. The basic assumption of this protocol is that when data is written it is read in the next cycle.
  86. Looking at the time diagram shows that the operation of the two processes are automatically synchronized by this protocol.
  87. When we add a state in process 2 and reduce the number of states in process 1 to two, we make the receiving process slower than the transmitting one.
  88. Also now, the protocol synchronizes the two processes automatically. However, after “op3” in process2, an extra clock cycle is introduced automatically. This is caused by the fact that process 1 has to observe that “a” is raised before it can write the data and raise “r”. The extra cycle can be avoided by raising’ ”a” already during “op2”.
  89. The simple handshake protocol of previous slides is just one of the many possibilities. The most general protocol is the 4-phase handshake protocol that can synchronize two systems, independent of a clock signal. The 4 phase handshake protocol consists of 4 phases: Initially, both request (Req) and acknowledgement (Ack) signals are low. Next, the Req signal is raised and the operation is executed. After the execution of the operation, the Ack signal is raised. Here starts the third phase. When the Ack signal is detected, the Req signal is turned off. This phase continues until the low Req signal is detected and the Ack signal is turned off. The picture on the slide shows the asynchronous FSM for the four-phase handshake protocol. In an asynchronous FSM the transitions are not clocked and happen as soon as the guard statement is valid.
  90. Besides the 4-phase handshake protocol, many other protocols exist. For example a protocol can be constructed that is based on signal transitions rather than signal levels. Handshake protocol can also be simplified when both systems run on the same clock or for the cases that the receiver or transmitter is known to be the fastest. Also, the efficiency of the communication can be improved by block based handshake protocols. In such a protocol, the communication is set-up for the first element of the block. Next, a data element is communicated every cycle. There also exists a set of protocols based on typical FIFO signals.
  91. The replacement of the FIFO by protocols is only possible if no intermediate storage is needed. This is not always the case. For example, the system on the slides needs at least a storage for two data elements on queue 1. In most cases, the number of required data storages can be derived from the maximum number of elements in a queue during functional simulations. Also remark that changing the order in which data is produced in process 1 or consumed in process 2 will change the storage requirements. Another option is to integrate the required storage in one of the two processes and match the production and consumption sequences.
  92. If intermediate storage is needed, a FIFO must be explicitly introduced in hardware. A FIFO will be a module with storage, a finite state machines and communication protocols for the producing and consuming processes. The FIFO structure can be defined once and next reused in many designs.
  93. Up till now, we have considered point-to-point communications. Each channel in the functional model is then mapped to a physical channel in the hardware. However, when this communication structure becomes complicated it might be advantageous to multiplex multiple communications on a bus structure. Communication with off-chip devices might also take advantage of a bus structure because of the limited amount of available pins. The bus can be modeled as a set of multiplexers. To decide when a module is allowed to communicate on this bus, an arbiter is needed. The arbiter works with handshake protocols with the processes. If we reuse our simple protocol, the arbiter would react on the ask signals from the receiving processes and reserve and transfer this ask signal to the sending process when the bus is free for data transfer. The bus and arbiter are modules that can be designed ones and reused in multiple designs.
  94. After communication refinement of a functional model, we obtain a behavioral model. A behavioral model defines the functionality and also the relative ordering of inputs and outputs. To perform this ordering, a clock signal is used. Also, the pins of a module are identical to the final implementation. On the other hand, the internal operations are functionally modeled. They are not mapped on clock cycles and no functional units are allocated. Increasingly synthesis tools are moving up from the register transfer level (RTL) synthesis toward behavioral synthesis. In the latter the synthesis tool autonomously decided on the number and types of functional units and schedules the operations on these functional units.
  95. We now take a look at the support for communication refinement in SystemC
  96. Representing behavioral models in SystemC is straight forward. The processes are represented with (clocked) thread processes (SC_CTHREAD or SC_THREAD). To order the inputs and outputs, every module has a clock input. In the case of a SC_THREAD process, it must be made static sensitive to this clock. To separate clock cycles, wait() statements will be used in the SC_THREAD or SC_CTHREAD process. It is possible to assign a synchronous reset signal to the thread processes. In the case that the reset signal is active at a clock event, the current process will be stopped, and called again from the start of the function. Also an asynchronous reset is supported. Remark that because of the introduction of the clock we cannot run until the end of activity (this would never stop). Therefore we must run the simulation for a finite time or halt it explicitly.
  97. Standard signals are used to communicate between behavioral processes. A signal can only be written from one process. For the sc_signal&amp;lt;T&amp;gt; channel, three ports are predefined: sc_in &amp;lt;T&amp;gt; is essentially equivalent to sc_port&amp;lt;sc_signal_in_if &amp;lt;T&amp;gt; &amp;gt; sc_inout &amp;lt;T&amp;gt; is essentially equivalent to sc_port&amp;lt;sc_signal_inout_if &amp;lt;T&amp;gt; &amp;gt; sc_out &amp;lt;T&amp;gt; is identical to sc_inout&amp;lt;T&amp;gt; The write() operation on a signal overwrites the present value. The read() operation reads the current value. Also the assignment operators are available for signals. These three ports must be bounded to exactly one signal.
  98. Finally we need also a clock in a behavioral model. SystemC offers special clock functions, where you can choose the period, duty ratio, initial offset and first value. An example is shown on the slide.
  99. On the slide an example is shown where three values are read in sequentially and summed. The resulting sum is put on the output. The example is modeled with a clocked thread. It could also be implemented with a thread process.
  100. To replace the queues it is advocated to follow a gradual approach. First, converters (between sc_fifo and protocol) are introduced between the processes.
  101. Next the protocol can be integrated in each process separately. At each moment the correct operation of the system can be validated through simulations.
  102. On the slide we show an example for the converters that translate between a sc_fifo and the simple synchronization protocol and vice versa.
  103. The exercise is intended to get you familiar with communication refinement. We turn again to the simplified JPEG decoder.
  104. The goal of this exercise is to replace the FIFO channel between the run-length encoder and decoder by a handshake protocol. To this end we will add converters between the two blocks to obtain a behavioral model. Next integrate the protocol functionality in the run-length decoder process, integrate the resulting behavioral model in the application, simulate the system, and verify correct operation.
  105. In this 4th class we focus on the refinement of the computations, resulting in RTL description of the circuit. This model should be synthesizable with an RTL synthesis tool.
  106. The class consists of three parts: First, we describe the conceptual steps to transform from a behavioral into an RTL description of the circuit. Next we introduce the constructs that are available in SystemC to support this RTL modeling. Finally we exercise the new knowledge on the JPEG decoder.
  107. Next to fixed point and communication refinement, computation refinement is an important step in architectural design (from functional model towards RTL model). Remark that the order in which these three steps are performed is not defined. Refinements along these three axes can even be intermixed. There also exist interdependences between these operations. For instance if two operations share a common operator they will use the same word size.
  108. At the start of the computation refinement the embedded system is modeled with behavioral blocks, where both the data types and communications are refined. The test bench is not evolved and is still the original functional model. The RTL modeling can be introduced gradually by replacing individual behavioral blocks with RTL descriptions. The correctness of the system can be verified during this process by simulating the combination of functional, behavioral, and RTL models.
  109. Behavioral models are represented as threads which wait on clock edges to synchronize their inputs and outputs (IO). As a consequence, they can be represented by a clocked finite state machine (FSM). In the slide a Moore-type state machine, whose outputs are only determined by the state, is used.
  110. The transformation from behavioral to RTL can conceptually be represented by the scheduling of operations on this FSM. In this scheduling activity additional states can be introduced. Remark also that the scheduling of the operations can have major impact on the inter-process communication: Additional states can introduce errors in synchronized communication. Protocol based communication is more robust but the settings of the protocol signals might have to be adapted Separation of operator scheduling and communication refinement is a desire in many design flows but is rarely achieved completely.
  111. The resulting FSM can be transformed back in code. The resulting RTL model can be represented either with a SC_METHOD or a SC_CTHREAD. Both can be synthesized into gate level circuits. For simplicity, we will use SC_CTHREADS.
  112. The resulting RTL description defines the datapath of the resulting scheme: The degree of parallelism is defined and as a consequence the number of operators are defined. Most synthesis tools automatically identifies that operators in different states of the FSM can be shared. To enable this sharing, multiplexers and demultiplexers must be introduced. The RTL description also defines where registers must be inferred. In general, any signal that is generated in one state and must be used in another will be stored in a register. In general, we also register all outputs of the circuit. If a register does not have to be changed each clock cycle, a multiplexing circuit is introduced in front. This is a more robust alternative to the gating of the clock.
  113. The datapath is controlled by a controller. Each cycle it outputs all the necessary control signals for the various (de)multiplexer, which together can be considered as a long instruction word. This instruction word is determined by the output function on the basis of the state and de status inputs of the controller. The next-state function determines based on the same data what the next state is of the controller. The process that we described in the previous slides is performed manually for a RTL design. However, it can also be automated with behavioral synthesis.
  114. An important parameter of the datapath is the critical path of the combinatorial logic. The critical path is defined as the longest physical delay between the input (referred to the clock edge) and the output of a combinatorial circuit. The critical path must be smaller than the clock period in order to have correct operation.
  115. For signal processing systems, the data insertion interval (DII) determines the architectural style and the selected clock operation. The data insertion interval is defined as the time between two data samples of the signal. The data insertion interval is the inverse of the throughput of a circuit. When the critical path of a maximum parallel architecture is larger than the data insertion interval, pipelining is the consequence. Pipelining reduces the critical path but introduces extra registers and hence increases the area and cost of the architecture. Pipelining can be done down to the operator or even bit-operator level. In pipelined architectures the clock period is normally chosen equal to the data insertion interval.
  116. On the other hand when the data insertion interval is much larger than the critical path of the maximum parallel architecture, a multiplexed architecture is possible. In such an architecture less operators are needed, what reduces the area and cost. However, this operator reduction must be balanced against the increased number of registers and (de)multiplexers. In a multiplexed architecture the clock period is only a fraction of the data insertion interval.
  117. Real embedded systems do in most cases not consist of a single architectural style. For example an image processing system starts with the raw image, which results in a small DII. Hence a fully pipelined architecture can be assumed. After the image filters, edges will be detected and these edges will be grouped into features. Because there are far less edges than points in the image, the DII will reduce. However, the algorithms that are performed on these edges become more complicated. As a consequence a dedicated multiplexed architecture will be optimal. Next the systems detects patterns (e.g. objects) and will control a robot to pick the object. Again the DII is lower and the algorithm more complex. Here a general purpose microcontroller will be ideal.
  118. The cost of a circuit is determines by its area. The area of a circuit is the sum of the area of its datapath and control units. Gaijski worked out a complete scheme of factors that contribute to these areas. The precise weight of each of these elements is technology dependent.
  119. To derive the technological parameters for the area model we need to investigate the implementation technology. In the case of a standard cell technology, the library will provide the information for the basic cells. The height of these cells is identical for all cells. As a consequence the area of a cell is directly proportional to its width. The data of the slide come from the vsclib from www.vlsitechnology.org
  120. The class consists of three parts: First, we describe the conceptual steps to transform from a behavioral into an RTL description of the circuit. Next we introduce the constructs that are available in SystemC to support this RTL modeling. Finally we exercise the new knowledge on the JPEG decoder.
  121. No new concepts are needed to model circuits at the RTL level: modules, processes and signals are the basic elements. Each sc_module contains one or more processes and is translated in a separate hardware unit. Variables have no data storage capability and are only used for intermediate values in a single execution of a process. If data has to be stored between multiple executions of a process or has to be communicated between processes, signals have to be used.
  122. Synthesis tools put restrictions on the C++ and SystemC constructs that can be used. Currently a standardization of a synthesizable SystemC subset is underway. The slide gives an overview of the restrictions proposed in the draft standard version 1.3 for modules and processes.
  123. On datatypes and functions the following restrictions apply.
  124. At the RTL level there is a strong similarity between SystemC and other Hardware Description Languages (HDLs), like VHDL and Verilog.
  125. Communication between multiple execution of a process or between processes is done by signals.
  126. As a consequence, signals can store data and result in registers. On the slide, an example is shown of a communication between two cycles in a sequential process which results in a register.
  127. Different types of memory can be used in a design. Read Only Memory (ROM) can be specified in SystemC by means of an array of constants. A Register File is generated by defining an array of sc_signals which lead to register inference. In contrary to ROM and Register files, Random Access Memory (RAM) blocks are not synthesized, but generated by a RAM generator. Therefore they must be isolated from the rest of the design and modeled by a behavioral model. The slide shows the behavioral model for a single port asynchronous RAM. Synchronous RAM, where the output is generated based on a clock edge is, is an alternative. In stead of single port RAM also dual port RAM is sometimes used.
  128. A unique feature of SystemC is the capability for mixed simulation of functional, behavioral and RTL models. To this end the simulation engine follows a scheduling approach where functional, behavioral and RTL processes are executed iteratively until no activity remains in the system. Next, the primitive channels are updates and the scheduler goes back to the iterative execution of the processes.
  129. An important aspect of a simulation at the RTL level, is the measurement of the performance of the system. SystemC offers a number of supporting functions to obtain the simulation time (virtual time for the circuit), convert it to seconds and print it on the screen. If the period of the clock is known, the simulation time can be converted into clock cycles. Remark that the circuit might require some clock cycles for initialization. As a consequence, the throughput of the system is normally better than the datarate divided by the simulation time.
  130. SystemC defines a number of auxiliary functions to dump tracefiles. To open a tracefile the function sc_trace_file *sc_create_vcd_trace_file(char*) is called. To include a sc_signal into the tracefile, one uses the sc_trace(sc_trace_file*, sc_signal, char*) function. Finally the trace file is closed with sc_close_vcd_trace-file(sc_trace_file*). The tracefile is stored in vcd (value change dump) format. It can be viewed by trace file viewers, like gtkwave.
  131. The class consists of three parts: First, we describe the conceptual steps to transform from a behavioral into an RTL description of the circuit. Next we introduce the constructs that are available in SystemC to support this RTL modeling. Finally we exercise the new knowledge on the JPEG decoder.
  132. The goal of this exercise is to refine the RL decoder into an RTL model an to integrate it into the system. Normally the RTL model will be synthesized to generate dedicated hardware. In advance we can already estimate the hardware complexity (in number of adders, multipliers, multiplexers, registers, memories, etc.).