This document provides an overview of part 2 of a course on specification languages. It discusses model based system design using SystemC. It introduces object oriented techniques for designing hardware systems and provides hands-on experience with SystemC. The material for part 2 includes slides, the SystemC language reference manual, and an exercise on building a functional model of a JPEG encoder/decoder in SystemC. It discusses key aspects of functional modeling in SystemC including modules, ports, processes, channels and the simulation engine.
Embedded Systems are basically Single Board Computers (SBCs) with limited and specific functional capabilities. All the components that make up a computer like the Microprocessor, Memory Unit, I/O Unit etc. are hosted on a single board. Their functionality is subject to constraints, and is embedded as a part of the complete device including the hardware, in contrast to the Desktop and Laptop computers which are essentially general purpose (Read more about what is embedded system). The software part of embedded systems used to be vendor specific instruction sets built in as firmware. However, drastic changes have been brought about in the last decade driven by the spurt in technology, and thankfully, the Moore’s Law. New, smaller, smarter, elegant but more powerful and resource hungry devices like Smart-phones, PDAs and cell-phones have forced the vendors to make a decision between hosting System Firmware or full-featured Operating Systems embedded with devices. The choice is often crucial and is decided by parameters like scope, future expansion plans, molecularity, scalability, cost etc. Most of these features being inbuilt into Operating Systems, hosting operating systems more than compensates the slightly higher cost overhead associated with them. Among various Embedded System Operating Systems like VxWorks, pSOS, QNX, Integrity, VRTX, Symbian OS, Windows CE and many other commercial and open-source varieties, Linux has exploded into the computing scene. Owing to its popularity and open source nature, Linux is evolving as an architecturally neutral OS, with reliable support for popular standards and features
This course gets you started with writing device drivers in Linux by providing real time hardware exposure. Equip you with real-time tools, debugging techniques and industry usage in a hands-on manner. Dedicated hardware by Emertxe's device driver learning kit. Special focus on character and USB device drivers.
Embitude's Linux SPI Drivers Training Slides. Contains the details of AM335X specific low level programming, SPI components such as SPI Master Driver, SPI Client Driver, Device Tree for SPI
Embedded Systems are basically Single Board Computers (SBCs) with limited and specific functional capabilities. All the components that make up a computer like the Microprocessor, Memory Unit, I/O Unit etc. are hosted on a single board. Their functionality is subject to constraints, and is embedded as a part of the complete device including the hardware, in contrast to the Desktop and Laptop computers which are essentially general purpose (Read more about what is embedded system). The software part of embedded systems used to be vendor specific instruction sets built in as firmware. However, drastic changes have been brought about in the last decade driven by the spurt in technology, and thankfully, the Moore’s Law. New, smaller, smarter, elegant but more powerful and resource hungry devices like Smart-phones, PDAs and cell-phones have forced the vendors to make a decision between hosting System Firmware or full-featured Operating Systems embedded with devices. The choice is often crucial and is decided by parameters like scope, future expansion plans, molecularity, scalability, cost etc. Most of these features being inbuilt into Operating Systems, hosting operating systems more than compensates the slightly higher cost overhead associated with them. Among various Embedded System Operating Systems like VxWorks, pSOS, QNX, Integrity, VRTX, Symbian OS, Windows CE and many other commercial and open-source varieties, Linux has exploded into the computing scene. Owing to its popularity and open source nature, Linux is evolving as an architecturally neutral OS, with reliable support for popular standards and features
This course gets you started with writing device drivers in Linux by providing real time hardware exposure. Equip you with real-time tools, debugging techniques and industry usage in a hands-on manner. Dedicated hardware by Emertxe's device driver learning kit. Special focus on character and USB device drivers.
Embitude's Linux SPI Drivers Training Slides. Contains the details of AM335X specific low level programming, SPI components such as SPI Master Driver, SPI Client Driver, Device Tree for SPI
This presentation briefs about the Linux Kernel Module and Character Device Driver. This also contains sample code snippets. Also briefs about character driver registration and access.
Linux has emerged as a number one choice for developing OS based Embedded Systems. Open Source development model, Customizability, Portability, Tool chain availability are some reasons for this success. This course gives a practical perspective of customizing, building and bringing up Linux Kernel on an ARM based target hardware. It combines various previous modules you have learned, by combing Linux administration, Hardware knowledge, Linux as OS, C/Computer programming areas. After bringing up Linux, you can port any of the existing applications into the target hardware.
We are one of the best embedded systems training institute for advance courses. We are the pioneer of the embedded system training in Pune & Pcmc with the expertise of over 16 years. we are working in the field training & development of embedded systems & currently we are also working on live projects as per the requirements of clients. though we provide many different courses & training in embedded all aim at giving good practical knowledge to students as well help them in their career.
HKG15-107: ACPI Power Management on ARM64 Servers (v2)Linaro
HKG15-107: ACPI Power Management on ARM64 Servers
---------------------------------------------------
Speaker: Ashwin Chaugule
Date: February 9, 2015
---------------------------------------------------
★ Session Summary ★
Status of CPPC with runtime PM and discussion on idle PM with ACPI
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250767
Video: https://www.youtube.com/watch?v=eDDgYIkUHLI
Etherpad: http://pad.linaro.org/p/hkg15-107
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
An Introduction to eBPF (and cBPF). Topics covered include history, implementation, program types & maps. Also gives a brief introduction to XDP and DPDK
Verilog code for design a specific processor to down sample a given image via a math-lab by using SPARTAN-6 FPGA. Math-lab code, results also included.
PCIe Gen 3.0 Presentation @ 4th FPGA CampFPGA Central
PCIe Gen3 presentation by PLDA at 4th FPGA Camp in Santa Clara, CA. For more details visit http://www.fpgacentral.com/fpgacamp or http://www.fpgacentral.com
The Linux Kernel Scheduler (For Beginners) - SFO17-421Linaro
Session ID: SFO17-421
Session Name: The Linux Kernel Scheduler (For Beginners) - SFO17-421
Speaker: Viresh Kumar
Track: Power Management
★ Session Summary ★
This talk will take you through the internals of the Linux Kernel scheduler.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/sfo17/sfo17-421/
Presentation:
Video: https://www.youtube.com/watch?v=q283Wm__QQ0
---------------------------------------------------
★ Event Details ★
Linaro Connect San Francisco 2017 (SFO17)
25-29 September 2017
Hyatt Regency San Francisco Airport
---------------------------------------------------
Keyword:
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
This presentation briefs about the Linux Kernel Module and Character Device Driver. This also contains sample code snippets. Also briefs about character driver registration and access.
Linux has emerged as a number one choice for developing OS based Embedded Systems. Open Source development model, Customizability, Portability, Tool chain availability are some reasons for this success. This course gives a practical perspective of customizing, building and bringing up Linux Kernel on an ARM based target hardware. It combines various previous modules you have learned, by combing Linux administration, Hardware knowledge, Linux as OS, C/Computer programming areas. After bringing up Linux, you can port any of the existing applications into the target hardware.
We are one of the best embedded systems training institute for advance courses. We are the pioneer of the embedded system training in Pune & Pcmc with the expertise of over 16 years. we are working in the field training & development of embedded systems & currently we are also working on live projects as per the requirements of clients. though we provide many different courses & training in embedded all aim at giving good practical knowledge to students as well help them in their career.
HKG15-107: ACPI Power Management on ARM64 Servers (v2)Linaro
HKG15-107: ACPI Power Management on ARM64 Servers
---------------------------------------------------
Speaker: Ashwin Chaugule
Date: February 9, 2015
---------------------------------------------------
★ Session Summary ★
Status of CPPC with runtime PM and discussion on idle PM with ACPI
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250767
Video: https://www.youtube.com/watch?v=eDDgYIkUHLI
Etherpad: http://pad.linaro.org/p/hkg15-107
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
An Introduction to eBPF (and cBPF). Topics covered include history, implementation, program types & maps. Also gives a brief introduction to XDP and DPDK
Verilog code for design a specific processor to down sample a given image via a math-lab by using SPARTAN-6 FPGA. Math-lab code, results also included.
PCIe Gen 3.0 Presentation @ 4th FPGA CampFPGA Central
PCIe Gen3 presentation by PLDA at 4th FPGA Camp in Santa Clara, CA. For more details visit http://www.fpgacentral.com/fpgacamp or http://www.fpgacentral.com
The Linux Kernel Scheduler (For Beginners) - SFO17-421Linaro
Session ID: SFO17-421
Session Name: The Linux Kernel Scheduler (For Beginners) - SFO17-421
Speaker: Viresh Kumar
Track: Power Management
★ Session Summary ★
This talk will take you through the internals of the Linux Kernel scheduler.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/sfo17/sfo17-421/
Presentation:
Video: https://www.youtube.com/watch?v=q283Wm__QQ0
---------------------------------------------------
★ Event Details ★
Linaro Connect San Francisco 2017 (SFO17)
25-29 September 2017
Hyatt Regency San Francisco Airport
---------------------------------------------------
Keyword:
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
This tutorial is intended for verification engineers that must validate algorithmic designs. It presents the detailed steps for implementing a SystemVerilog verification environment that interfaces with a GNU Octave mathematical model. It describes the SystemVerilog – C++ communication layer with its challenges, like proper creation and activation or piped algorithm synchronization handling. The implementation is illustrated for Ncsim, VCS and Questa.
Those slides describe digital design using Verilog HDL,
starting with Design methodologies for any digital circuit then difference between s/w (C/C++) and H/w (Verilog) and the most important constructs that let us start hardware design using Verilog HDL.
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
In this deck from IWOCL / SYCLcon 2020, Hal Finkel from Argonne National Laboratory presents: Preparing to program Aurora at Exascale - Early experiences and future directions.
"Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms. Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.
This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success. Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems."
Watch the video: https://wp.me/p3RLHQ-lPT
Learn more: https://www.iwocl.org/iwocl-2020/conference-program/
and
https://www.anl.gov/topic/aurora
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A presentation introducing Spectra\'s product family, competitive advantage, CX V3.2\'s solutions, features & benefits, availability, price, collateral & demonstrating its uses.
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Aljoscha Krettek
Flink is a great stream processor, Python is a great programming language, Apache Beam is a great programming model and portability layer. Using all three together is a great idea! We will demo and discuss writing Beam Python pipelines and running them on Flink. We will cover Beam's portability vision that led here, what you need to know about how Beam Python pipelines are executed on Flink, and where Beam's portability framework is headed next (hint: Python pipelines reading from non-Python connectors)
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
2. 2
Specification Languages
Part 1: Specification Models
Part 2: Model based system design
Show how the models of part 1 can be used for
architectural design
Provide hands-on experience with SystemC v2.3.2
(released in October 2017).
Introduce OO techniques for design of hardware systems
Part 3: Project
3. 3
Course Material for part 2
Prerequisite:
part 1 of specification languages
C++ (good tutorial at www.cplusplus.com)
Coding and debugging programs
RTL description of synchronous digital circuits
Material for part 2:
Slides with notes.
IEEE Standard SystemC Language Reference Manual, IEEE
Std 1666-2011.
5. 5
Functional modeling in
SystemC
Introduction to design of digital embedded systems
SystemC introduction
SystemC functional model syntax
Exercise 1: building a functional model in SystemC
8. 8
Characteristics of embedded
systems
Optimize for power, cost, and size
Robust design
Provide the ability for evolution and mass customization
Minimize time to market
Some functionality might be safety-critical
Interfacing with the real world, leading to real time constraints
9. 9
Sensors Actuators
Real world process
Processing
Embedded systems combine
various types of real-time behavior
ADC DAC
event
signal signal
action
user
Signal
conditioning
Actuator
Powering
10. 10
Digital embedded systems
combine hard- and software
User
interface
NVM
ROM
µPorDSPcore
RAM
Conf. Logic
Memories
Peripheral
Mo-
dem
buffers
Video/
Graphics
processor Protocol
Speech
Processing
Analysis of
channel
+ analog, sensors and actuators
11. 11
Design flow for digital embedded
systems
System
Functionality
Functional
Requirements
Performance
Requirements
Architecture
Template
Architectural
Requirements
Mapping
Dedicated
Architecture
C-code
Non-functional
Requirements
12. 12
Function to architecture
conversion follows three axes
ComputationsComputations
operations
DataData
variables, arrays
floating point
memories
fixed point
operators
CommunicationCommunication
point-to-point
queues
busses
detailed protocol
resource allocation
scheduling
memory allocation
address generation
word sizing
bus allocation
introduce arbiters
include protocols
System
Functionality
Dedicated
Architecture
13. 13
Functional modeling in
SystemC
Introduction to design of digital embedded systems
SystemC introduction
SystemC functional model syntax
Exercise 1: building a functional model in SystemC
14. 14
SystemC bridges gap between
function and architecture
MATLAB
C/C++
VHDL
Verilog
SystemC
System
Functionality
Dedicated
Architecture
15. 15
What is SystemC?
A modeling framework in C++ for the refinement of system from a functional
description into an architecture
Contributions:
hardware modeling with C++: OCAPI (IMEC) and SCENIC (Synopsys/UC
Irvine)
fixed-point data types: Frontier Design
hardware-software co-design: CoWare (IMEC/CoWare)
Language first standardized in December 2005 as IEEE 1666, revised in 2011 as
IEEE 1666-2011
Extensions of SystemC:
Verification library.
Transaction level modeling library ( integrated in IEEE 1666-2011).
Analog and mixed-signal modeling.
More info: www.accellera.org
16. 16
Which tools are available for
SystemC?
Open source simulation library available
Open source translators from Verilog or VHDL to SystemC
Commercial synthesis tools:
Cadence (Stratus HLS).
Mentor(Catapult C).
NEC(CyberWorkBench).
SystemCrafter (SC).
Xilinx (Vivado Design Suite).
17. 17
SystemC language
architecture
C++ language
Core Language
Modules
Ports
Exports
Processes
Interfaces
Channels
Events
Event-driven simulation kernel
Data-types
4-valued logic type
4-valued logic vectors
Bit-vectors
Finite-Precision integers
Limited-Precision integers
Fixed-Point types
Pre-defined Channels
Signal, Clock, fifo,
Mutex, Semaphore.
Libraries for Specific Models of Computation and/or methodologies, e.g. TLM
interfaces, bus models, SystemC verification library
Utilities
Report Handling,
Tracing
User Application
19. 19
Functional modeling in
SystemC
Introduction to design of digital embedded systems
SystemC introduction
SystemC functional model syntax
Exercise 1: building a functional model in SystemC
21. 21
Modules are used for structural
partitioning the functionality
Each module has its own class, derived from the sc_module
class.
Every constructor of a module class shall have exactly one
parameter of class sc_module_name.
It is good practice to make this name for an instance of the
module the same as the C++ variable name through which
the module is referenced.
A module can be hierarchical or contains processes. In the latter case,
the SC_HAS_PROCESS(“class name”) macro is used to indicate
that the module contains processes.
22. 22
Example of a functional model of
an adder
SC_MODULE(adder) {
//define ports
//define processes, internal data, etc.
SC_CTOR(adder) {
// body of constructor;
// process declaration, sensitivities, etc.
};
};
Class adder : public sc_module {
public:
// define ports
//define processes, , internal data, etc.
SC_HAS_PROCESS(adder);
adder(sc_module_name name):
sc_module(name) {
// body of constructor;
// process declaration, sensitivities, etc.
};
};
Explicit:Explicit: With MACROs:With MACROs:
23. 23
Ports are used to communicate
with a FIFO channel
General port definition: sc_port<interface>
Predefined ports are: sc_fifo_in<T> and sc_fifo_out<T>.
sc_fifo_in<T> is derived from sc_port<sc_fifo_in_if<T>,0> with interface
functions read(), nb_read(), and num_available().
sc_fifo_out<T> is derived from sc_port<sc_fifo_out_if<T>,0> with interface
functions write(), nb_write(), and num_free().
blocking read and write interface functions (automatic synchronization with
implicit wait() operations)
int a = f1.read(); // read a token
f1.write(a); // write a token
Inspecting queues
int a = f1.num_available(); // number of tokens in a queue
int a = f1.num_free(); // number of free places in a queue
24. 24
Example of a functional model of
an adder (continued)
SC_MODULE(adder) {
sc_fifo_in<int> a,b;
sc_fifo_out<int> c;
//define processes, internal data, etc.
SC_CTOR(adder) {
// body of constructor;
// process declaration, sensitivities, etc.
};
};
25. 25
SC_THREAD processes are used
to model functional processes
SC_THREAD processes run forever once started.
SC_THREAD processes can be suspended by means of the
wait(event) function. In functional modeling the wait
statements are hidden in the read() and write() functions to the
queues.
Multiple processes per module are possible
Processes can also be dynamically created.
26. 26
Example of a functional model of
an adder (continued)
SC_MODULE(adder) {
sc_fifo_in<int> a,b;
sc_fifo_out<int> c;
void compute() {
while(true) {
int valuea = a.read();
int valueb = b.read();
c.write(valuea+valueb);
}
}
SC_CTOR(adder) {
SC_THREAD(compute);
}
};
27. 27
Define the main program
The systemc library must be included in the main program:
#include <systemc.h>
In sc_main() the following actions are taken:
Instantiate channels with:
• sc_fifo<T> (”name”, length); // default length 16
• e.g. sc_fifo<int> f1(”f1”,2);
Instantiate the modules.
Bind ports of modules to channels:
• Positional
• named.
Call sc_start() to start simulation and run until end of any
activity.
28. 28
Example of a functional model of
an adder (continued)
int sc_main(int argc , char *argv[]) {
sc_fifo<int> fifo_a, fifo_b, fifo_c; //channel instantiation
… // instantiate signal generation and evaluation module
adder my_adder(“my_adder”); // module instantiation
my_adder.a(fifo_a); // binding of port to channel
my_adder.b(fifo_b);
my_adder.c(fifo_c);
… // other modules and test bench, which drive fifo_a and fifo_b.
sc_start(); // start simulation
};
Elaborationphase
29. 29
SC_MODULE(superfunc) {
// IO ports
sc_fifo_in<float> in;
sc_fifo_out<float> out;
//internal queues
sc_fifo<float> d;
// internal modules
function func1;
function *func2;
// Module constructor
SC_CTOR (superfunc):
func1(“func1”) {
func1.in(in);
func1.out(d);
func2 = new function (“func2”);
func2->in(d);
func2->out(out);
}
}
Modules can also be used to
create hierarchy
func1func1
superfunc
d
func2func2
sc_module(function)
30. 30
Simulation engine
In an un-timed model, the simulator only advances in delta-
cycles:
If it is started to run for a finite amount of time, it will never
stop.
We therefore run it until no events are present: sc_start();
Ways of stopping the simulator:
Terminate a process (return from SC_THREAD): the
simulator will stop due to the lack of events.
Call sc_stop() when a termination condition is fulfilled.
31. 31
Functional modeling in
SystemC
Introduction to design of digital embedded systems
SystemC introduction
SystemC functional model syntax
Exercise 1: building a functional model in SystemC
32. 32
Goal of this exercise
use a simplifiedJPEG block diagram to practice functional
modeling
develop a functional process that fits into a system
simulate a functional model
observe the overall behavior of a system
33. 33
What is JPEG?
“JPEG” stands for
“Joint Photographic Experts Group”
“JPEG” is a standard for color image compression
“JPEG” is widely used (e.g. on the WWW)
More information?
http://www.jpeg.org/
35. 35
2D Discrete Cosine Transform
Non-optimized equation
DCT can be separated in consecutive 1-D operations
There are many optimized DCT-algorithms available
( ) ( ) ( ) ( ) ( ) ( )
∑∑= =
++
⋅=
7
0
7
0 16
12
cos.
16
12
cos,
4
1
,
i j
vjui
jifvCuCvuF
ππ
( ) ( ) ( ) ( ) ( ) ( )
∑∑= =
++
⋅=
7
0
7
0 16
12
cos.
16
12
cos,
4
1
,
u v
vjui
vuFvCuCjif
ππ
01
0
2
1
)(
≠
=
=
l
l
lCwhere
36. 36
Quantization
Each DCT coefficient is divided by the coefficient amplitude
that is just detectable by the human eye (table)
The result is rounded to an integer
This reduces the number of bits needed to represent the DCT
coefficient
The quantization is the place where information of the image
might be lost, resulting in lossy compression.
39. 39
(Simplified) Run-length coding
Send the DC value “as is”
Represent the high frequency data with (zero run-length,
amplitude) combinations.
End the stream with EOB (= 63).
Example:
in: 79, 0, -2, -1, 3, -1, 0, 0, -1, 0, 0, 0, …
out: 79, 1,-2, 0,-1, 0, 3, 0,-1,2,-1, 63
40. 40
How to start?
Download exercise files form http://www.icorsi.ch/
Follow installation instructions of exercises.
you will find:
In /exercises/exercise1/: main.cpp to start from
In/exercises/modules/: library with JPEG encoder modules
{r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules
{b2r,idct,normalize,zz_dec}.{h,cpp} and test bench modules {src,snk,test}.{h,cpp}
In /exercises/images/: test images
In /exercises/add2systemc additional functions (df_fork, fifo_stat)
Things to be done:
make rl_dec.h and rl_dec.cpp
complete the main.cpp with the modules.
Compile and execute the application.
Inspect the number of reads and writes in the fifos
Visualize resulting image
Test if you can launch the application in the debugger.
Optional: make a hierarchy for the encoder and decoder.
41. 41
Using SystemC on
Linux/Cygwin
Use g++ (I used version 4.5.3).
Make a workspace in Eclipse:
Add your source files to the project.
Add libmodules.a
Add libadd2systemc.a (for next exercises).
Add libsystemc.a
Put the right include paths and linker paths
Build your application from within Eclipse.
Execute your application from within Eclipse.
Exercise1.exe –i ../images/mountain.pgm –o result.pgm
43. 43
Fixed point refinement
Fixed word length optimization
Overflow and quantization
MSB determination
LSB determination
Fixed word length support in SystemC
Exercise 2: fixed point refinement of IDCT
44. 44
Fixed point refinement is one of the
steps in architectural design
ComputationsComputations
operations
DataData
variables, arrays
floating point
memories
fixed point
operators
CommunicationCommunication
point-to-point
queues
busses
detailed protocol
resource allocation
scheduling
memory allocation
address generation
word sizing
bus allocation
introduce arbiters
include protocols
System
Functionality
Dedicated
Architecture
45. 45
**
3 bytes (mantissa)3 bytes (mantissa)
+ 1 byte (exponent)+ 1 byte (exponent)
Fixed-point
•minimum area
•low power
•high speed
88
**66
1414
Finite word lengths are a must
for DSP applications
Floating-point
•powerful
•expensive (storage & ops)
46. 46
22
33
22 22 22 22 22i.2i.2
22 11 00 -1-1 -2-2 -3-3
WLWL
IWLIWL
MSBMSB LSBLSB
How to model a fixed-point
signal?
Total number of bits WL
Integer bits IWL
Value representation
•2’s complement (i=-1)
•unsigned (i=1)
WL-IWLWL-IWL
47. 47
How do we quantize?
truncatetruncate
(floor)(floor)
fxpfxp
flpflp
roundround
fxpfxp
flpflp
magnitudemagnitude
truncatetruncate
fxpfxp
flpflp
ceilceil
fxpfxp
flpflp
48. 48
What happens on an overflow?
wrap-around saturation
flp flp
fxp fxp
max. value
51. 51
Fixed-point refinement is a
complex optimization problem
Minimize overall cost:
minimal word lengths
truncate and wrap-around
MSB determination:
goal: avoid unwanted overflows
method: find min, max signal values
result: MSB position, value
representation, overflow
LSB determination:
goal: keep required precision
method: evaluate difference
between flp a fxp behavior
result: LSB position, quantization
safe rangesafe range
quantizationquantization
52. 52
MSB determination can be
based on range calculations
* +
d
m
x
y
Put range (min, max) on inputs
Propagate range over the operators
This gives a save (pessimistic) estimate
rangerange
infoinfo
[0,255]
12
rangerange
calc.calc.[0,255]
[0,3060] [0,3315]
z-1
53. 53
Range propagation is a simple
calculation
Operator minc maxc
c=a+b mina+minb maxa+maxb
c=a-b mina-maxb maxa-minb
c=a*b MIN(mina*minb,
mina*maxb,
maxa*minb,
maxa*maxb)
MAX(mina*minb,
mina*maxb,
maxa*minb,
maxa*maxb)
54. 54
Range calculations can get
unstable with feedback
*
+
a
X(n) Y(n)
z-1
F(n)
sample n
maxF
minF
value
55. 55
* +
d
m
x
12 y
stimuli
?min, max
q1
Collecting signal statistics from
simulations is an alternative
Perform simulation with realistic stimuli.
Collect minimum and maximum value on each signal during the
simulation
This gives an optimistic, stimuli dependent estimate
z-1
56. 56
signal statistic range propagation
name min max MSB1 min max MSB2
signal1 -1.5 1.6 2 -1.9 1.9 2
signal2 -1.3 1.4 2 -2.1 2.1 3
signal3 -1.2 1.2 2 -22.0 22.0 6
signal4 -1.2 1.2 2 -∞ ∞ ∞
Combine both methods for
accurate MSB determination
If MSB1 == MSB2: wrap-around(MSB1)
If MSB1 < MSB2: wrap-around(MSB2)
If MSB1 << MSB2: saturation (MSB1)
MSB2 is ∞ saturation (MSB1)
57. 57
QQ ++
B bits
input output outputinput
noise
Quantization effects can be
modeled as additive noise
Noise is approximated by a statistical model with the following
assumptions:
the noise is uncorrelated to the input.
the noise is white.
the probability distribution is uniform.
58. 58
Each quantization effect has
mean and variance
Rounding with step ∆:
Truncation with step ∆:
Magnitude truncation with step ∆:
12
and0
2
2 ∆
== nnm σ
12
and
2
2
2 ∆
=
∆
−= nnm σ
3
and0
2
2 ∆
== nnm σ
59. 59
This results in an equivalent
linear network
Q1Q1 +
* +
d
m
x
12 y
z-1
QQ
22
* +
d
m
x
12 y
z-1
e1(t)
+
e2(t)
))1()()(12())1()(12()( 121 −+++−+= tetetetxtxty
60. 60
… but quantization is a non-
linear operation
*
+
-0.96
X(n) Y(n)
z-1
QQ
X(0) = 14, x(n) = 0 for n > 0
round to nearest integer
B bits
...
...
with rounding:
without rounding:
61. 61
LSB determination is based on
simulations
All fixed-point
simulate
output
ok
yes
no
* +
stimuli
12
x
ym
QQ
* +
12
x
ym com
pare
QQ
z-1
z-1
62. 62
Signal to quantization noise
ratio (SQNR)
+
+
= 22
22
10log10
ee
ss
x
m
m
SQNR
σ
σ
Q
-
e
me,σe
ms,σs
xQ
64. 64
Fixed point refinement
Fixed word length optimization
Overflow and quantization
MSB determination
LSB determination
Fixed word length support in SystemC
Exercise 2: fixed point refinement of IDCT
65. 65
SystemC introduces a number
of specific data types
Type Description
sc_logic 4 value {0,1,X,Z} single bit
sc_int 1 to 64 bit signed integer
sc_uint 1 to 64 bit unsigned integer
sc_bigint Arbitrary size signed integer
sc_biguint Arbitrary size unsigned integer
sc_bv Arbitrary sized 2 value vector
sc_lv Arbitrary sized 4 value vector
sc_fixed Signed fixed point
sc_ufixed Unsigned fixed point
sc_fix Untemplated signed fixed point
sc_ufix Untemplated unsigned fixed point
66. 66
SystemC templated fixed-point
types
Two fixed point templates
sc_fixed <wl, iwl, q_mode, o_mode, n_bits> x; // signed
sc_ufixed <wl, iwl, q_mode, o_mode, n_bits> y; // unsigned
Parameters:
wl = number of bits
Iwl = number of integer bits
q_mode = quantization method (SC_RND / SC_TRN /
SC_TRN_ZERO / ...)
o_mode = overflow method (SC_SAT / SC_WRAP / … )
n_bits = number of saturated bits in case of wrapping (default 0)
If quantization and overflow not specified the defaults (SC_TRN and
SC_WRAP) are used
67. 67
Fixed point lengths
sc_fixed <5, 7> v;
X X X 0 0 [ -64 , 60 ]X X
sc_fixed <5, 3> v;
X X X [ -4 , 3.75 ]X X
sc_fixed <5, -2> v;
X X X X X [ -0.125 , 0.109375 ]S S
70. 70
Fixed-point simulation
operations in floating-point
quantization and overflow handling during assignment
sc_fixed <4,3> a;
sc_fixed <4,1> b;
sc_fixed <4,2> c;
a = 1.6;
b = 0.9;
c = a * b;
1.6 1.5
0.9 0.875
1.31251.3125 1.251.25
QQ
QQ
QQ**
0.5
0.125
0.25
lsb precision
a
b
c
71. 71
SystemC fixed point types with
non-static arguments
Fixed point parameter values
sc_fxtype_params my_type(wl,iwl,q_mode,o_mode,n_bits);
x = my_type.wl();
my_type.iwl()=x-2;
Two non-static fixed point types
sc_fix x(my_type); // signed
sc_ufix y(my_type); // unsigned
For arrays, these types are used with a context
sc_fxtype_context my_context(sc_fxtype_params);
sc_fix z[64];
Remark: for fixed point simulations, include in every file
#define SC_INCLUDE_FX
#include <systemc.h>
72. 72
Fixed point refinement
Fixed word length optimization
Overflow and quantization
MSB determination
LSB determination
Fixed word length support in SystemC
Exercise 2: fixed point refinement of IDCT
73. 73
Goal of this exercise
Perform fixed point refinement for all the internal variables of
the IDCT in the JPEG example
determine the MSB to avoid internal overflows without overflow
logic.
determine the LSB to have no more that 0,5dB degradation on
the PSNR of the resulting image
74. 74
How to start?
You find:
In .../exercises/exercise2/ : the functional model with a fixed point IDCT
implementation; types-file datatypes_original.txt
In/exercises/modules/: library of JPEG-encoder modules
{r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules
{b2r,idct,normalize,zz_dec}.{h,cpp} and testbench modules {src,snk,test}.
{h,cpp}
Special fixed point support functions of directory
…/exercises/add2systemc/ are used
In /exercises/images/: test images
Things to do:
inspect the code to understand the behavior
Make the application
change datatypes.txt file
syntax: exercise2 -i <inputfile> -o <outputfile> -t <typefile>
75. Model Based System Design
Class 3: Communication
Refinement
Marc Engels
e-mail: marc.engels@flandersmake.be
77. 77
Communication refinement is one
of the steps in architectural design
ComputationsComputations
operations
DataData
variables, arrays
floating point
memories
fixed point
operators
CommunicationCommunication
point-to-point
queues
busses
detailed protocol
resource allocation
scheduling
memory allocation
address generation
word sizing
bus allocation
introduce arbiters
include protocols
System
Functionality
Dedicated
Architecture
78. 78
Functional models use FIFO
communication
Queues guarantee consistent data passing
Implementation could become expensive for large sizes
communication must be optimized
Process1Process1 Process2Process2
(infinite) storage
80. 80
w=4w=4
Example of correct wired
communication
wire
Process 1Process 1 Process 2Process 2
w=0w=0
w<4w<4
filt1
filt2
filt3
filt4
write()
w++
read()
op1
op2 op3
op4
81. 81
1 w=1
2 w=2
3 w=3
4 w=4
5 read() op1
6 op2
7 op3
8 op4
9 read() op1
10 op2
Communication is perfectly
aligned
1 filt1
2 filt2
3 filt3
4 filt4 write()
5 filt1
6 filt2
7 filt3
8 filt4 write()
9 filt1
10 filt2
… …
We have to guarantee the condition that every write()
comes before a read()
ClockCycle
82. 82
Small changes to design can
result in errors
Increase (decrease) the number of operations in process 1 (2):
the same data will be consumed twice.
Decrease (increase) the number of operations in process 1 (2):
data will be lost
If the number of initial wait operations in process 2 is too low,
we will use undefined data
If the number of initial wait operations in process 2 is too high,
we will loose the first data elements)
83. 83
Example of wrong wired
communication
wirefilt1
filt2
filt3
filt4
write()
Process 1Process 1 Process 2Process 2
read()
op1
op2
85. 85
Simple handshake protocol is
more robust
The flag “a” (ask) indicates that the receiver is ready to read
data in the next cycle.
The flag “r” (ready) indicates that data has been written
Save communication requires at least two cycles.
86. 86
!r
r a
Simple handshake protocol is
more robust
Process 2Process 2
filt1
r=0
filt2 filt3
if (a==1){
filt4
write()
r=1}
Process 1Process 1
!a
a
if (r==1) {
read()
op1
a=0}
op2
a=1
r
a=1
r=0
88. 88
r a
… also when receiver is slower
than transmitter
Process 1Process 1 Process 2Process 2
filt1
r=0
If(a==1){
filt2
write()
r=1} !a
!r If (r==1){
read()
op1
a=0 }
op2
r
op3
a=1
a=1
r=0
a
89. 89
1 a=1
2 a=1
3 a=0 read() op1
4 a=0 op2
5 a=1 op3
6 a=1
7 a=0 read() op1
8 a=0 op2
9 a=1 op3
10 a=1
… but introduces then one
extra wait cycle at receiver
1 r=0 filt1
2 r=1 filt2 write()
3 r=0 filt1
4 r=0
5 r=0
6 r=1 filt2 write()
7 r=0 filt1
8 r=0
9 r=0
10 r=0 filt2 write()
Cycles
… …
The extra wait cycle can be avoided by already putting a=1 during op2
90. 90
Most general protocol: 4-phase
handshake protocol
Ack
Ack
Ack
Req
Req
Req
Req
Ack
Req
Ack
Req
Req
Ack
Execute
Ack
Data
Ack
Req=1
Get Data
Req=0
Ack=0
Put Data
Ack=1
Ack=0
91. 91
Multiple variations on these
handshake protocols exist
In stead of signal levels, the protocol can be based on signal
transitions.
The protocol can be simplified if both systems run on the same
clock.
Protocols can be simplified if one knows that the receiver or
the transmitter is fastest.
Synchronization can be performed on the basis of a block:
Set-up communication for first element of a block
Next, communicate every cycle
Some protocols are based on typical FIFO signals: full and
empty.
92. 92
In some cases buffered
communication is required
process2process2process1process1
Q1Q1
Queue size can be determined by monitoring the maximum
number of elements in a queue during simulation.
1 write(Q1) 1
2 write(Q1) 2
3 write(Q2) 3
4 4 read(Q2)
5 5 read(Q1)
6 6 read(Q1)
Q2Q2
93. 93
r a
Queues must be introduced
explicitly in hardware
FIFO process
size N
fsm
Wired
handshake
protocol
Process1 Process2
r a
94. 94
Process1Process1 Process2Process2
Several communications can
also be multiplexed on a bus
Process3Process3 Process4Process4
Process1Process1
Process3Process3
Process2Process2
Process4Process4
busbus
arbiterarbiter
r a
a r
r a
a r
Bus and arbiter classes
can be reused!
95. 95
Communication refinement
results in behavioral model
Model that defines the relative ordering of input and outputs
A clock signal is used for ordering
Pins are accurate to the final implementation
Internal resources are not mapped on clock cycles
(scheduling) and functional units (resource binding)
97. 97
In SystemC behavioral models
use (clocked) threads
Modeled with thread processes SC_THREAD or with clocked thread
processes SC_CTHREAD
Every module has a clock input:
sc_in_clk clk;
The SC_THREAD process is made static sensitive to a clock edge
Sensitive << clk.pos();
To separate clock cycles wait() statements are used.
A synchronous or asynchronous reset signal can be specified:
reset_signal_is(reset, true);
async_reset_signal_is(reset, true);
Simulation must be run for a finite time (or will not stop!) or halted
explicitly.
98. 98
Behavioral models communi-
cate via standard signals
All input and outputs are standard signals
Define signals with:
sc_signal<T> a;
Predefined ports for sc_signal<T> channels:
sc_in<T> with interface function read() or assignment operator.
sc_out<T> with interface function write() or assignment operator.
sc_inout<T> that combines both interface functions.
105. 105
Exercise 3: communication
refinement for the JPEG encoder
Goal: Replace the FIFO between the run-length encoder and decoder by
a handshake protocol
You will find:
In /exercises/exercise3/ : solution of exercise2
In/exercises/modules/: JPEG-encoder modules
{r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules
{b2r,idct,normalize,zz_dec}.{h,cpp} and test bench modules
{src,snk,test}.{h,cpp}
In /exercises/images/: test images
In /exercises/add2systemc: FIFO to protocol conversion functions in
add2systemc: {FF2P, P2FF}.h
Things to be done:
Introduce a handshake protocol between rl_enc and rl_dec.
introduce refined versions of rl_dec in jpeg_dec.h and main.cpp.
simulate and verify correct operation.
108. 108
RTL refinement is the 3rd
step in
architectural design
ComputationsComputations
operations
DataData
variables, arrays
floating point
memories
fixed point
operators
CommunicationCommunication
point-to-point
queues
busses
detailed protocol
resource allocation
scheduling
memory allocation
address generation
word sizing
bus allocation
introduce arbiters
include protocols
System
Functionality
System
Architecture
110. 110
Behavioral model can be
represented by an FSM
Process_behavioral{// SC_CTHREAD
ask.write(TRUE);
while (ready.read() == FALSE) {wait();}
wait();
while(TRUE) {
ask.write(FALSE);
x = input.read();
wait();
d = x * b1;
y = d * b2;
output.write(y);
ask.write(TRUE);
while (ready.read() == FALSE)
{wait();}
wait();
}
}
=
!ready
ready !ready
ready
ask=1
ask=0
x=input
ask=1
d = x * b1
y = d * b2
output = y
111. 111
Behavioral to RTL: scheduling of
operations in FSM
!ready
ready !ready
ready
ready
!ready
ready
!ready
ask=1
ask=0
x=input
ask=1
d = x * b1
y = d * b2
output = y
!ready!ready
ask=1
ask=0
x=input
d=x*b1
ask=1
y = d * b2
output = y
112. 112
Rescheduled FSM is
represented in RTL code
=
ready
!ready
ready
!ready!ready
ask=1
ask=0
x=input
d=x*b1
ask=1
y = d * b2
output = y
Process_RTL{// SC_CTHREAD
ask.write(TRUE);
while (ready.read() == FALSE) {wait();}
wait();
while(TRUE) {
ask.write(FALSE);
x = input.read();
d = x * b1;
wait();
ask.write(TRUE);
y = d * b2;
output.write(y);
while (ready.read() == FALSE)
{wait();}
wait();
}
}
113. 113
RTL description corresponds to
a datapath
possiblepossible
mappingmapping
**
b1b1
b2b2
xx
yy
dd
11
00
askask
RT description introduces synthesis
decisions:
register inference
resource sharing
parallelism
readyready
D QD Q
D QD Q
D QD Q
Process_RTL{// SC_CTHREAD
ask.write(TRUE);
while (ready.read() == FALSE) {wait();}
wait();
while(TRUE) {
ask.write(FALSE);
x = input.read();
d = x * b1;
wait();
ask.write(TRUE);
y = d * b2;
output.write(y);
while (ready.read() == FALSE)
{wait();}
wait();
}
}
114. 114
ready
… and a controller
StateState
registerregister
OutputOutput
functionfunction
control: steers the register transfers in datapathcontrol: steers the register transfers in datapath
Next-stateNext-state
functionfunction
DatapathDatapath
ControllerController inputsinputs
outputsoutputs
controlcontrol
statusstatus
ins0
ins1
ins2
C0
c1
c2
115. 115
Critical path of combinatorial
logic is crucial
Combinatorial
Logic
Multiplexers, Adders,
Multipliers, etc.
processclock
in
outcalc
clock
…
in
…
Critical path
calc
…
out
116. 116
Pipelining reduces the critical
path
Area
critical
path
word
operator
delay
data Insertion
Interval (DII)
Non-pipelined
Bit word
pipelined
+
DII = operator delay
+
DII = critical path
+
+
1-bit
operator
delay
Word
pipelined
DII = operator delay/2
+
+
lsb
msb
+
+
+
…
…
117. 117
Multiplexing reduces the area
of the solution
Area
data Insertion
Interval (DII)
Processor architecture
e.g. VLIW
Non
pipelined
DII = critical path
+
+
critical
path
Muxed DSP
+
DII = 2 x critical path
118. 118
E.g. Robot Vision System
CCD
camera
line
delayobject
Sobel
operator
Edge
detector
Feature
extractor
Pattern
recognizer
Robot
controller
x
µ-CODE
ROM
PCLOGIC
µ-CODE
CONTROL
RAM
PROGRAM-
MABLE
FUNCTION.
UNITS
OFF-CHIP
MEMORY
MODULAR ARRAY OF
PROCESSING ELEMENTS
CON-
TROL
Global control and communication
µcoded processorMuxed DP's
HARDWIRED CONTROL
MEMORIES
DATA PATH
Array type
Real embedded systems show
architectural variability
119. 119
Area can be estimated at a
high level
Source: Gaijski
State_reg
+
logic
# states
# states, # ctrl_lines, # states each ctrl_line is active
# bits and # words of each storage
# bits and type of each FU
#sources of muxes
+
# DP connections, # DP components
Storage
+
func_units (FU)
+
Muxes
+
wires
area Is a function of
Datapath(DP)
Control
Unit(CU)
TotalCircuit
120. 120
Standard cell data can be
used to derive parameters
type name width
2 input MUX mxi2v0x1 3.08
2 input NMUX mxn2v0x1 3.52
2 input AND an2v4x2 2.20
3 input AND an3v4x2 3.08
4 input AND an4v4x2 3.52
2-bit half adders ha2v0x2 5.28
Q flip-flop dfnt1v0x2 7.92
… … …
Source: www.vlsitechnology.org
121. 121
Storage: Registers vs. memories
Inferred by
synthesis.
Larger size per
storage bit.
No overhead.
Fast & parallel.
Best < 1 kbits
storage
Non sythesized – but
created by memory
generators.
Smaller size per
storage bit.
Fixed overhead.
Slow & serial
Best > 1 kbits
storage
123. 123
RTL design is modeled with modules
and processes
A sc_module is an identifiable hardware unit.
A module can contain multiple processes that run in parallel.
Signals are used to communicate between (executions of) processes.
Variables are used inside a single execution of a process.
124. 124
Restrictions (1/2) in SystemC
Synthesizable Subset (draft 1.3)
Modules
Exactly one constructor.
Processes
Only SC_CTRHREAD and SC_METHOD are supported;
SC_THREAD is not supported.
In a SC_CTHREAD there must be a wait() statement before
the infinite loop or as first statement in this loop.
At most one clock signal is allowed per process.
The reset behavior is specified in the process, not in the
constructor of the modules.
Between two clock events, at most one assignment to a
signal is supported.
Processes communicate through signals, not shared
variables.
125. 125
Restrictions (2/2) in SystemC
Synthesizable Subset (draft 1.3)
Datatypes:
No floating point.
Char is implemented as signed char, all integer types are
2’s complement.
Pointers are not supported.
Untemplated fixed point types are not supported.
No division operator for fixed point types.
No global variables but global constants are OK.
Functions:
No new(), delete() and sizeof() functions.
Destructors have no effect.
Exception handling is not supported.
126. 126
Example: relation Synthesizable
SystemC and VHDL
System C
#include “systemc.h”
SC_MODULE(dff) {
sc_in<bool> din;
sc_clk_in clock;
sc_out<bool> dout;
void doit(); // Member function
SC_CTOR(dff) {
SC_CTHREAD(doit, clock.pos());
}
};
void dff::doit() { // Process body
while(TRUE){
wait();
dout.write(din.read());
}
}
VHDL
entity dff is
port ( din, clock : in bit; dout : out bit );
end dff;
architecture dff of dff is
begin
doit : process(clock) – Sensitivity List
begin
if (clock’event and clock=‘1’) then
dout <= din;
end if;
end process;
end dff;
127. 127
Signals for communication
between processes
Declaration
Scalar Signal: sc_signal<sc_uint<32 > > a;
Vector Signal: sc_signal<sc_logic> a[32];
Signals use request-update mechanism: write takes effect after a delta-cycle
When you assign a value to a signal or port, the value on the right side is
not transferred to the left side until the process halts. This means that the
signal value as seen by other processes is not updated immediately, but it
is deferred.
When you assign a value to a variable, the value on the right side is
immediately transferred to the left side of the assignment statement.
SystemC supports resolved Ports and Signals
Multi-Valued Logic type : 0, 1, Z, X
Allow Multiple Drives
128. 128
Signals can infer registers
Synthesi
s
ww = x= x
y1 =y1 = ww * 10* 10
zz = x // writing at the end of cycle= x // writing at the end of cycle
wait()wait()
y2 =y2 = zz * 10 // reading at the beginning of cycle* 10 // reading at the beginning of cycle
x 1x 1 2 3 x2 3 x
y1 10 20 30 xy1 10 20 30 x
z x 1 2 3z x 1 2 3
y2 x 10 20 30y2 x 10 20 30
clockclock
ww
zz
1010
1010
xx
y1y1
y2y2
Simulation
D QD Q
129. 129
Random Access Memory is
modeled with a behavioral model
// ram_asyn.h – asynchronous RAM
#include "systemc.h"
SC_MODULE(ram_asyn) {
sc_in<sc_unint<6> > addr;
sc_in<bool> rwb;
sc_in<int> datain;
sc_out<int> dout;
int memdata[64]; // local memory storage
void ramaction();
SC_CTOR(ram_asyn){
SC_METHOD(ramaction)
sensitive << addr << datain << rwb;
for (int i=0; i++; i<64) { memdata[i] = 0; }
}
};
Asynchronous
RAM (64)
address
datain
rwb
dataout
130. 130
SystemC has a 4-step
simulation engine
1: Initialize
2: Iterative execution of
functional, behavioral & RTL
processes until no activity
3: Update primitive channels
4: Go back to 2
Functional1
behav2
RT3RT3
q1q1
s2s2
q3q3
q4q4
P2FF
s1s1
P2FF
s3s3
FF2P
s4s4
131. 131
Measuring performance
const sc_time& sc_time_stamp(): returns the current time
during simulation.
Following functions are defined for sc_time:
double to_seconds(): converts the time into seconds
void print(): prints the time on the screen
If the clock period is known, the number of clock cycles can
be calculated.
Throughput ≥ Datarate/Simulation_time
134. 134
How to start?
Goal: refine run-length decoder in RTL model.
You will find:
In /exercises/exercise4/ : solution of exercise3
In/exercises/modules/: JPEG-encoder modules
{r2b,dct,quantize,zz_enc,rl_enc}.{h,cpp}, JPEG decoder modules
{b2r,idct,normalize,zz_dec}.{h,cpp} and test bench modules {src,snk,test}.
{h,cpp}
In /exercises/images/: test images
In /exercises/add2systemc: behavioral RAM models.
Things to be done:
Make RTL model of run-length decoder.
draw FSM of the RTL model.
introduce the RTL model in jpeg_dec.h and integrate in main.cpp.
simulate and verify correct operation with gtkwave viewer.
Estimate the needed hardware for this RTL model.
Editor's Notes
Welcome to the second part of the course on specification languages.
The course on specification languages consists of 3 parts:
First, an extensive overview was given of various specification models, ranging from dataflow to finite state machines.
In this second part, I will focus on the use of a subset of these models for the architectural design of digital embedded systems. The main goal of this part of the course is to learn how the specification models of part 1 can be used for the architectural design of embedded systems. For this purpose, we will rely on SystemC version 2.3.2, which was standardized by the IEEE in January 2012 (IEEE 1666-2011 language reference manual) and for which the simulation library was released in April 2014. SystemC is a class library on top of C++. As such, all object oriented (OO) constructs of C++ can be used in the design of an architecture. These OO techniques can bring the same benefits with respect to re-use to architectural design as that they have brought to software design.
Finally, you will apply the acquired skills in a small, but realistic, project.
As prerequisites for this course, I expect the following:
Quite obvious you should have a good understanding of the first part of this course, and particularly the presented models.
Next, as SystemC is based on C++, also a decent knowledge of this programming language is required. Basic OO concepts like classes, inheritance and templates should be familiar to you. If not, review the C++ tutorial at www.cplusplus.com.
In general, a structured methodology for developing and debugging programs is essential for executing the exercises and the project. Familiarity with Integrated Design Environments (IDE) like Eclipse is a benefit.
When writing SystemC code, you should be able to describe the hardware that will be generated from this code. Therefore a basic knowledge of register transfer level (RTL) description of synchronous digital circuits is necessary. An RTL description of a circuit consists of registers (e.g. D flip-flops) and combinatorial logic. The registers synchronize the operation of the circuit to the clock signal while the combinatorial logic describes the calculations performed by the circuit. RTL descriptions are used in hardware description languages like Verilog or VHDL.
For part 2, the following material is available:
The slides with notes can be found on the icorsi (icorsi.ch).
The SystemC language reference manual, which can be downloaded from the IEEE standards website (http://standards.ieee.org/getieee/1666/download/1666-2011.pdf)
In this first class we will focus on the functional modeling of a digital embedded system. A functional model will describe the functionality of the embedded system, independent of the platform or architecture on which this functionality is executed. Therefore it is sometimes called a platform independent model (PIM). In this class we will focus on the data flow modeling paradigm for describing the functional model.
At the end of the class, you will be able to program a functional model of a digital embedded system in SystemC.
This class covers 4 topics:
A general introduction to the design of digital embedded systems
The role of SystemC in the design of digital embedded systems
The syntax of the SystemC language for functional modeling (with the dataflow paradigm)
And finally an exercise to build a functional model in SystemC
Lets start with the general introduction.
Consumer as well as professional equipment is becoming increasingly smarter. A few examples:
Your car is being converted into a multimedia theater. The value of the electronics in a car has increased consistently, resulting in almost 100 electronic units in a luxury model. Recently a lot of new safety functions (ABS, ESP, parking sensors, anti-collision systems, etc.) have been introduced.
It is hard to find a mobile phone with which you can only make a call. Taking pictures, playing music, surfing the web, reading e-mail, etc. are also features of a state-of-the-art mobile phone. Most phones even have GPS functionality and run office software.
Gaming becomes more interactive (e.g. Nintendo Wii, Microsoft Kinect) and mobile.
Photography has dramatically changed over the last decade: it has become fully digital. Digital cameras are currently extended with features like wireless connections, automatic picture enhancements (e.g. red eye correction), etc.
The era of service robots is coming. Robots to vacuum clean the house, mown the lawn in the garden, etc. are already on the market.
The evolution towards smart products is not limited to consumer devices. We observe, for instance, the same trends in production machines.
Harvesters have a growing number of functions for quality control, obstacle detection and precision farming. To realize these smart functions, the electronic control units become increasingly more complex. Especially the software content is growing very fast (20% average growth per year). The long term vision for combine harvesters is to evolve towards full autonomous machines, that can work without any operator on board and just receive a command of the job to be done. Many more smart functionalities will be needed to reach this goal.
In compressors functions are introduced to optimize the energy consumption based on the instantaneous demand of air.
Weaving looms can adapt their speed to the quality (strength) of the textile fibers.
Professional washing machines automatically detect the load, hardness of the water, etc. and adapt their washing program.
To realize this smart functionalities, electronic systems and software have to be embedded in consumer and professional devices. Such embedded systems are minimizing power, cost and size, and hence work on a minimal platform. For instance, 8-bit and 16-bit processors are still extensively used in embedded devices. They must be robust. For instance, a mobile phone must survive rude treatment. A car has an operation life of 7000 hours and some machines are expected to work up to 100000 hours. Over their lifetime, products are increasingly expected to evolve. Also more variants are designed from the same platform. A typical example is the customization of the mobile phones. And the product needs to be on the market before the Christmas shopping. In many cases the system has even safety-critical functionality, think about automatic braking system (ABS) or emergency buttons, which require a guarantee on the reliability of the system. For the development of such safety-critical functions, specific standards have to be followed. The main distinctive characteristic of an embedded system, however is that it has to interact with the real world, necessitating real-time behavior.
A system is said to be real-time if the correctness of an operation depends not only upon its logical correctness, but also upon the time in which it is performed. In a hard real-time system, the completion of an operation after its deadline is considered useless - ultimately, this may lead to a critical failure of the complete system. A soft real-time system on the other hand will tolerate such lateness, and may respond with decreased service quality (e.g. bank terminal).
Depending on the inputs, two types of hard real-time constraints are distinguished in embedded systems:
Signal processing systems process inputs that arrive at regular intervals and the system must be ready after a fixed time to process the next input. Signal processing systems typically interact with their environment through sensors (observe the environment) and actuators (control/influence the environment). Sensors are components that translate non-electrical quantities (e.g. temperature, pressure, ...) into electrical quantities (voltage, current). Since most observable quantities are analog signals, sensors usually produce analog electrical signals. In most cases signal conditioning is required to compensate the non-idealities in the sensors and to prepare the sensor signals for the actual signal processing. Because the signal processing is done digitally, an Analog to Digital Converter (ADC) puts the sensor signal in the right format. Actuators perform the reverse operation of sensors: they translate electrical quantities into non-electrical quantities. Also actuators need analog signals and therefore a Digital to Analog Converter (DAC) is needed. Because actuators need to influence the physical environment they often require high power, hence power electronics circuits are introduced to condition the control signal.
When the input is an event and the system has to react within a certain time, this is called a reactive system. Examples of reactive parts of an embedded system are the interaction with the user or responses to external alarms.
As shown on the picture, embedded systems often combine various types of real-time behavior.
An embedded system can be separated into a digital part and an analog part. The analog part contains for instance signal conditioning, ADCs and DACs. In high-frequency applications, like radios or radars, it will be a large part of the embedded system. Also sensors and actuators are part of the embedded system. Traditionally these were discrete external components, but recently they are increasingly integrated, when power permits, in a package and even on chips.
The digital part is where the actual “intelligence” is. A growing part of the functionality of embedded systems is implemented in software called “embedded software”. This offers the advantage of increased flexibility (functionality can be changed after production). As a consequence, the digital part of an embedded system consists of 3 components:
Programmable processor cores. They can be general-purpose micro-processors or more specialized digital signal processors (DSPs).
Volatile and non-volatile memories.
Configurable (though parameters) dedicated logic.
The digital part can be implemented as a PCB with discrete components, a multi-chip package, an FPGA or a fully integrated chip. In the latter case this is often referred to as a System-on-Chip (or SOC).
In these classes we will mainly focus on the design of the configurable logic (on FPGA or chip), although SystemC is also extensively used for the modeling of SOCs.
For the design of a digital embedded system, we use a design flow that consist of the following elements:
During the functional design of the system, the designer determines what the system has to do, based on the performance requirements (e.g. bit error rates in communication systems) and functional requirements (e.g. specified protocols). He also determines all algorithms. The system functionality is expressed in a platform independent way.
A reusable architecture template, or platform, consisting of processors, memories, and dedicated logic, is defined or selected. The architecture template should guarantee architectural requirements (e.g. interface formats) and non-functional requirements (e.g. power or cost).
Each function in the functionality is mapped on an element in the architecture template.
For the dedicated logic a circuit corresponding to the required functionality is created, resulting in a dedicated architecture. Finally, by means of RTL-synthesis the designer generates a gate level netlist. By the place and route step this netlist is next transformed into a physical layout for this dedicated architecture, which can be manufactured by a foundry. Alternatively, the design is mapped to a configuration file for a programmable platform (e.g. field programmable gate array or FPGA).
For the functions mapped on processors, C-code is generated and compiled.
The Y-model is represented as a top-down approach, but in a realistic design flow, multiple iterations are performed before reaching the final embedded system.
In this course we concentrate on the architectural design of dedicated logic, where the algorithms are mapped into an optimal architecture. The algorithm will typically be specified into a functional model, e.g. data flow and asynchronous state machines. The architecture needs a timed model, e.g. register transfer level (RTL). To obtain the RTL description, a refinement needs to be done for the computations, communications, and data. The order of these refinements is not fixed. However, it is good practice to take the most important design decisions first. Remark that for parts of the system that are implemented on software, the complete refinement does not need to be performed. However, a processor and a memory structure has to be selected. For this purpose, certain refinement, like fixed point, can be useful.
We now take a closer look at the role of SystemC in the design of digital embedded systems.
Traditionally, a system functionality is expressed in MATLAB (SIMULINK/STATEFLOW) or a standard computer language (C/C++). To express the RTL description of the system, VHDL or Verilog is used. As a consequence the transformation from functionality into architecture does not only involve a change in semantics but also in syntax. Moreover, because of the different languages, this transformation cannot be done incrementally. SystemC resolves this issue, by offering a language that can express both functionality and architecture.
SystemC is a C++ library that allows to refine a system from a functional description into an architecture.
Three contributions were essential into the creation of SystemC:
The modeling of RTL hardware with C++ was demonstrated in the OCAPI framework of IMEC, as well as the SCENIC project of UC Irvine in cooperation with Synopsys.
Frontier Design (an IMEC spin-off) contributed to the fixed-point data types.
CoWare (another IMEC spin-off) introduced concepts of hardware-software co-design.
The SystemC language was first standardized in December 2005 by the IEEE. A revision (IEEE 1666-2011) was made in 2011.
More recently a number of extensions of the SystemC language were proposed:
Verification library adds random generator and transaction recording.
Transaction level modeling, a high-level approach to modeling digital systems where details of communication among modules are separated from the details of the implementation of functional units or of the communication architecture. This extension is included in the revised IEEE standard.
Analog and mixed-signal library extends SystemC with the following modeling paradigms: timed data flow, linear signal flow modeling, and electrical linear network modeling.
All information about SystemC can be downloaded from the www.accellera.org website.
With respect to tool support, the Accellera System Initiative (www.accellera.org) makes an open-source simulation library available. Various academic institutes also offer translators from Verilog or VHDL to SystemC. For synthesis however, we have to rely on commercial tools.
The classes of the SystemC library fall into four categories: the core language, the SystemC data types, the predefined channels, and the utilities. The core language and the data types may be used independently of one another.
At the core of SystemC is a simulation engine containing a process scheduler. Processes are executed in response to the notification of events. Events are notified at specific points in simulated time. In the case of time-ordered events, the scheduler is deterministic. In the case of events occurring at the same point in simulation time, the scheduler is non-deterministic. The scheduler is non-preemptive, which means that once an execution of a process is started, it cannot be halted but executes till the end of the process.
The SystemC core language contains a number of primitives to define parallelism. A system is split in a number of modules (sc_module). A module communicates with the external world through ports (sc_port). Two ports are connected through a channel. SystemC predefines some primitive channels (sc_prim_channel), but more complex channels can be user defined. A channels connects to a port via an export (sc_export).
A hierarchical module consists of a structure of other modules. A non-hierarchical module contains one or more processes (sc_process). A process is executed in case that an events (sc_event) happens. A process interacts with a channel through an interface (sc_interface), which is a collection of functions that are supported by sc_port.
SystemC contains all necessary constructs to model the functionality of a system. We will focus on activity-oriented models, although SystemC can also express other modeling paradigms. Let’s review these constructs.
SystemC has support to model Kahn process networks, with the limitation of bounded queues. A Kahn process network is a directed network of processes that are interconnected by first-in-first-out (FIFO) queues of infinite size. Each time that a process is executed, tokens are consumed from the input queues and new ones are produced in the output queues. If a token is not present on an input queue, the consumption of the token will block. Kahn process networks exhibits deterministic behavior that does not depend on computation or communication delays. In SystemC the constructs are available to define the processes and the queues. These constructs interact with a simulation engine, which schedules the execution of the processes. The simulation engine stops when there is no longer activity in the network.
Modules are used to partition the functionality in the design. However, you should not use too many modules, as this complicates the design, but also not too few. In general, functionality that is implemented in a different architectural style (e.g. software or dedicated hardware) or on a different location should be in different modules.
Every module is derived from the base class sc_module and should have a name, which is used for debugging purposes.
The macro SC_HAS_PROCESS(“class name”) indicates that the module in not hierarchical and contains processes.
The slide shows an explicit definition of a modules, consisting of the class definition, the SC_HAS_PROCESS macro and the constructor.
To compact the definition, two more macros are provides:
SC_MODULE(“class name”) is equivalent to the first two lines of the explicit definition
SC_CTOR(“class name”) equals the SC_HAS_PROCESS macro and the first lines of the constructor. It can be used when if only a name is passed to the constructor. If you also want to pass parameters, an explicit declaration is needed.
In SystemC the sc_port object is used to communicate with a channel. Ports provide the means by which a module can be coded such that it is independent of the context in which it is instantiated. A port forwards interface method calls to the channel to which the port is bound.
For functional modeling, processes communicate through fifo ports. Two port types for sc_fifo&lt;T&gt; channel, where T is the basic type of the elements in the fifo channel, are supported:
Input: sc_fifo_in&lt;T&gt; which is basically equivalent to sc_port&lt;sc_fifo_in_if&lt;T&gt;,0&gt;, where the first parameter is the input interface of a FIFO and the second parameter specifies that multiple channels can be connected to a FIFO. However the practical use of these multiple bindings is not clear. Therefore it could be useful to define its own fifo port with a restriction of a single binding.
Output: sc_fifo_out&lt;T&gt; which is equivalent to sc_port&lt;sc_fifo_out_if&lt;T&gt;,0&gt;. Also here, the use of multiple bindings is not recommended.
Several functions are associated to the sc_fifo class:
read() gets a token from the queue. It blocks when no tokens are available.
write() puts a token on a queue. It blocks when there are no free spaces in the queue
There are also inspecting functions available to look at the number of tokens or free spaces.
When we add the definition of the ports to the constructor of the adder we obtain the code on the slide.
The actual computation in the application is performed in the processes. As a consequence, they also define the parallelism in the application.
SystemC supports three types of processes. For functional modeling we use the SC_THREAD process. An SC_THREAD process runs forever when started. It can be suspended by a wait(event) function. Often the wait(event) function is implicitly present in the communication functions.
Processes are executed on events. These events can be statically or dynamically defined. Static sensitivity is set by means of the variable sensitive of sc_module. Dynamic sensitivity to a certain event is set by wait (event) for an SC_THREAD process.
A module can have multiple processes.
Processes might be dynamically created during simulation. However, no synthesis support exists for dynamic processes. Therefore, we do not use them in this course.
Adding the definition of an SC_THREAD process to the adder results in the code on the slide. This adder waits for data on both its input queues sequentially and next produces a token on its output queue.
The global structure of the system is defined in the main function. Because main() is already used by the SystemC library, the main function for the user application is sc_main().
In sc_main(), the following actions are taken:
Instantiation of the channels. The basic channels that we use in functional modeling is sc_fifo. A FIFO queue is defined by means of the template class sc_fifo&lt;T&gt;. T can take on any basic data type, e.g. int, float, etc. The sc_fifo class declares a finite length buffer of tokens. The default length is 16 elements. The queue also has a name for debugging and statistics retrieval purposes. The constructor for the queue is sc_fifo&lt;T&gt; f1 (“name f1”, length); A sc_fifo can only be written from one process.
Instantiation of the modules. A module can be instantiated multiple times.
Binding the ports of the modules to the channels. This can be done in two ways: positional or named. Named binding is preferred because it is less prone to errors than positional port binding.
Start the simulation.
The sc_main() function for the adder is shown on the slide.
Remark that the arguments of sc_main() are identical to these of main().
To connect the ports to the channels, named bindings are used.
In a functional model hierarchy will be used to make the design more readable. The hierarchy is fully transparent: it basically acts as a container for the basic modules, but does not add any functionality or synchronization.
The definition of a hierarchical module consists of the definition of the ports and internal queues. Next the internal modules are defined. Care must be taken that the module objects will still exist after execution of the constructor. Two alternatives exist to guarantee this: either construct them when calling the constructor, or create them with a new function.
The constructor creates the two modules and binds the ports to the channels.
In a functional model no notion of time is present. Every action processes infinitely fast. As a consequence, the simulation kernel only advances in delta cycles of infinite small time units. If we would start the simulation kernel with a finite amount of time to run, it would never reach that time and hence run forever. Therefore we run the simulation kernel until no events are present any more. This is achieved with the sc_start() command.
With this approach, there are two ways of stopping the simulation:
We can exit a SC_THREAD. By doing so, no events will be produced anymore and the simulation will finally stop because of the lack of events.
We can check for a termination condition and explicitly call sc_stop(). This approach was used in the exercise of class 1. When the whole image is processed and written to file, the simulation is explicitly stopped. In general this is also the safest and most elegant way of controlling the simulation.
Finally, let’s exercise what we have learned so far.
The goal of this exercise is to practice functional modeling. We will use a simplified JPEG block diagram for this purpose. A process will be defined and integrated in a JPEG functional model. Next the functional model will be simulated and the overall behavior of the system will be observed.
JPEG stands for “Joint Photographic Experts Group” and is a compression standard for color images. It is widely used. More information can be found on www.jpeg.org
A simplified block diagram of a JPEG encoder and decoder is shown on the slide.
First and original image is inputted and split in 8x8 blocks (R2B). Together with the pixel data, also width, height and number of bits per pixel are extracted from the image.
Next, on each 8x8 block, a discrete cosine transform (DCT) is performed, resulting in 8x8 DCT coefficients. These DCT coefficients are quantized and reorganized in the zigzag scan module. The resulting coefficient stream is run-length encoded. This last block is different from the JPEG standard where an Huffman encoder is used.
In the decoder the reverse operations are performed in the reverse order.
The discrete cosine transform (DCT) is performed on a 8x8 pixel block and returns an 8x8 block of DCT coefficients. Each DCT coefficient indicates the amplitude of a horizontal and vertical frequency component. The inverse discrete cosine transform (IDCT) returns pixel values from DCT coefficients. The formal definition of the DCT and IDCT are shown on the slide. In stead of this straight forward 2D operation the calculation can be split in consecutive 1D operations, which is more efficient. There is also a large set of optimized DCT-algorithms that exploit the regular structure of the cosine values.
Next the DCT coefficients are quantized. To this end each DCT coefficient is divided by the corresponding value in the quantization table.
The result is rounded to the nearest integer, reducing the number of bits needed to represent the DCT coefficient.
In the quantization step image information might be lost, resulting in lossy compression.
An example of a typical quantization table is shown on the slide. It can be remarked that the quantization values grow for higher horizontal or vertical frequencies.
JPEG contains a number of predefined quantization tables. If a custom quantization table is used, it must be sent to the decoder.
The resulting quantized DCT coefficients are next zigzag scanned. This is done in such an order that statistically long sequences of zero coefficients can be expected.
Next we use a non-JPEG run-length coder for our exercise. This coding works as follows:
The DC value is sent “as is”
The high frequency data is split in sections consisting of a number of zero’s followed by a non-zero coefficient. Each segment is represented by a couple consisting of the number of subsequent zero’s and the value of the non-zero coefficient.
When all remaining coefficients for a block are zero, an end of block (EOB=63) value is sent.
You will find all files for starting in the exercise1 directory.
Perform the actions as indicated on the slide.
To obtain information about the number of writes and reads in the fifo’s, use the type fifo_stat&lt;T&gt; i.s.o. sc_fifo&lt;T&gt;.
To prevent multiple bindings of a fifo_port, the classes my_fifo_in&lt;T&gt; and my_fifo_out&lt;T&gt; are used in the exercises.
We will make the exercises in a Linux environment, using g++ and Eclipse. Eclipse is an integrated development and debugging environment. In the exercise directory there is a step-by-step guide of how to get started with the exercises in Eclipse.
The recent sources of the exercises and libraries can be found at http://www.icorsi.ch/
Libraries have to be compiled before starting the exercise session.
In this second class we will focus on the refinement of the data types of the functional model. More in particular we will explain the definition of fixed-point word lengths for the variables in the functional model. This action is relevant both for mapping on embedded processors with limited data sizes, e.g. 16-bit processors, or for mapping on a dedicated architecture.
A the end of the class, you will be able to perform fixed point refinement on a functional model of an embedded system in SystemC.
This lecture on fixed point refinement consists of three parts:
In the first part we introduce the quantization and overflow effects of fixed point representations. We also present some methods to determine the most and least significant bits (MSB and LSB).
Next, we introduce the fixed point support in SystemC. This consists of an extensive set of fixed point types. In addition, SystemC also supports 4-valued logic to define bus structures.
Finally, we introduce the exercise on fixed point refinement.
Let’s concentrate on the architectural design step that translates an algorithm into an optimal architecture. The algorithm will typically be specified into a functional model, like data flow. The architecture needs a timed model, e.g. register transfer level (RTL). Initially the algorithm will be modeled in floating point. Cost-effective implementation requires, however, a refinement into fixed point types.
Most signal processing algorithms are specified in floating point precision. This is a very powerful signal representation with high accuracy, but is also expensive in storage and operation cost. For instance, a typical representation of a floating point number is a mantissa of 24 bits and an exponent of 8 bits. As a consequence, a floating point multiplication is equivalent to a 24-bit multiplication and a 8-bit addition.
However, many applications, like cable modems and wireless communication devices, require low cost and low power for a high processing speed. As a consequence, the DSP algorithms will be performed in fixed-point arithmetic. With an 8-bit fixed point notation, for instance, the cost will drop dramatically as the hardware cost for a multiplication is a quadratic function of its input width.
This requires the designer to translate floating point types into fixed point types, using a refinement strategy.
A fixed point type can be defined by three parameters:
The total number of bits WL.
The position of the decimal point, indicated by the number of integer bits IWL.
The way in which the value is represented. In the case of a signed number, 2’s complement notation is the most common because it allows easy arithmetic. However, alternatives like sign-magnitude and 1’s complement are also feasible.
If the result of a calculation has more precision than available in the fixed point format, the value has to be quantized. Several ways of quantization exist:
Truncate or floor is the cheapest approach because it is standard available in hardware. However, it generally gives the worst performance of the quantization techniques.
Magnitude truncate realizes a floor function for positive values and a ceil function for negative values. The technique is natural for sign magnitude representations. The advantage is a symmetrical behavior around the zero value.
Applying the ceil function to the complete range is an alternative which is seldom used.
Rounding is the technique with the best performance for most cases. However, it also is the most expensive one. In hardware this requires the addition of 0,5 the least significant bit followed by a truncation operation.
When the result of an operation is larger than the maximum value that can be represented by the fixed point format (overflow), we have two possibilities:
Wrap-around: the overflow bits are neglected. For unsigned values, this is equivalent to a modulo operation (see figure on slide). For 2’s complement numbers, a one bit overflow results in the maximum negative number. This is the standard behavior in a hardware implementation.
Saturation: when an overflow occurs, the signal is set to the maximum value that can be represented. Additional hardware is necessary to realize this behavior.
Remark that a similar situation can occur for the minimum value of a signal. For instance, if the subtraction of two unsigned signals results in a negative value and must be represented in an unsigned format. For such underflow, similar remedies are possible.
When we opt for a saturation strategy, the following hardware is needed. The result of the operation must be compared to the maximum positive and negative numbers. This can be done with an explicit comparator or with the overflow flags from the adders. If overflow or underflow is reached, the result of the operation is replaced by the maximum or minimum value respectively. Remark that the hardware complexity of a comparator or multiplexer is comparable to a adder. As a consequence, saturation hardware can require a significant amount of area.
Going back to the need for fixed point representations, the designer is faced with the following problem. He obtains a floating point algorithm and needs to translate the floating point types into fixed point types, using a refinement strategy. For each floating point number, a fixed point characteristic (including total and integer word lengths, overflow and rounding behavior) must be chosen. In most situations the input and output formats are defined by the system context (e.g. analog-to-digital converter). Remark that determining these ADC and DAC precisions is an important task in the overall system design.
This fixed-point refinement is a complex optimization problem where the search space grows exponentially with the number of signals. The goal of the optimization is to minimize the overall implementation cost and power consumption. At the same time the performance degradation (e.g. implementation loss for telecom systems) must be small. Remark that it is essential to define a performance degradation bound (e.g. implementation loss for communication systems, visual performance measure for multimedia systems) before starting the fixed point refinement.
The optimization problem can be separated in two parts:
Determination of the most significant bit (MSB). First, the minimum and maximum signal value must be determined. From this the MSB position, value representation and overflow behavior is selected such that overflows are avoided as much as possible.
Determination of the least significant bit (LSB). By evaluating the difference in performance between the fixed and floating point behavior of the algorithm, the LSB position and quantization method are determined for each signal. The goal is to stay within the performance degradation bound.
In the next slides we will take a closer look at methods for MSB and LSB determination.
MSB determination can be done by means of range propagation. This analytical method works as follows:
On each input signal, the range, i.e. the minimum and maximum values that occur in a signal, are specified.
Next, the signal flow graph of the algorithm is traversed and for each operator, the range of its output is calculated based on its input ranges.
Because the method exactly calculates the exact minimum and maximum signal values, it results in a safe, but sometimes pessimistic, estimation of MSB position.
Range propagation on the operators is a simple operation. The table on the slides shows the rules for add, subtract and multiply operations.
When applied to feedback signals, range propagation can become unstable and cause continuous growth of the minimum and maximum values. An example of such a situation is shown on the slide. In such a situation, a statistical inspection of the real signals will be needed to determine a realistic MSB position.
Remark that the propagation mechanism also causes that all signals within this feedback loop or depending on the output of the feedback loop will struggle from this range explosion. Once saturation logic is introduced at one place in the loop this problem will be solved.
As an alternative to the analytical range propagation method, we can collect the signal statistics during simulations. Because the obtained range information will be stimuli-dependent, this will give an optimistic estimation of the minimum and maximum values. As a consequence, to maximize the confidence in the obtained results, the stimuli set should be large and provide a complete coverage of the algorithm code.
As can be expected, combining both methods gives the best results. Each signal in the system will then be in one of the following situations:
Both methods result in the same MSB position. Quite logically, the signal can safely be specified with the resulting MSB position and wrap-around overflow behavior.
When the analytical MSB position is larger than the statistical MSB position, we can make a trade-off between the analytical MSB with wrap-around and the statistical method with saturation. In most case the wrap-around functionality will be the most economical. Only when the statistical MSB position is much smaller, saturation logic can be beneficial.
In the case of a range growth because of feedback, the analytical MSB position cannot be calculates (going to infinity). In this case, the statistical MSB position is chosen together with a saturation behavior. After introducing the saturation on one signal in the feedback loop, we need to re-simulate to get useful results for the rest of the algorithm.
An example of each of these situations is shown on the slide.
When we look at the LSB side, the question arises what the effect is of quantization. Many authors approximate the quantization effect as an additional noise source. They assume that:
The noise sequence is a sample of a stationary random process (i.e. whose statistical parameters do not change over time).
The noise sequence is uncorrelated with the input sequence.
The random variables of the noise process are uncorrelated, i.e. the error is a white-noise process.
The probability distribution of the error process is uniform over the range of the quantization error.
The noise process can then be modeled by means of its mean and variance. The expressions for mean and variance for the three most popular quantization methods are shown on the slide. is the quantization step. Rounding and magnitude truncation result in a 0 mean, but rounding has the lowest variance. Truncation and rounding have the same variance, but rounding has the lowest mean. As can be expected, rounding introduces the least quantization noise.
Replacing the quantization by an additional noise source results in a linear model of the quantized algorithm. This can then be analytically analyzed by means of well-developed linear signal processing theory. For many quantization effects, this linear model is a good approximation. It has, for instance been used to determine the effects of quantizing the signals in FIR filters.
As an exercise, calculate the resulting signal to noise ratio in the case that:
x(t) ranges between 0 and 255 with a uniform distribution.
both quatization steps are rounding the values to the nearest integer.
However, not all applications are linear. Quantization in non-linear systems can lead to non-intuitive behavior. In infinite impulse response (IIR) filters, for instance, quantization can generate limit cycles. For a stable floating-point IIR filter implementation, the output will decay asymptotically to zero when the input becomes zero. For the same system, implemented with finite precision, the output may continue to oscillate indefinitely with a periodic pattern while the input remains equal to zero. This effect is often referred to as zero-input limit cycle behavior. An example of such behavior is shown on the slide.
Non-linear quantization effects are difficult to analyze analytically. Therefore, mostly simulation based methods are used. To this end the output of a reference simulation is compared to a simulation with the quantized signals. Again sufficient large stimuli sets, which have a complete code coverage, must be used.
To get a better insight in the optimization trade-off, the difference between the floating-point and fixed-point values (e) and the resulting signal to quantization noise (SQNR) is a useful guidance.
The SQNR for all signals is calculated as follows:
During signal assignments the statistics (mean, standard deviation) for the error signal as well as for the output signal are collected.
At the end of the calculate the signal to quantization noise ratio SQNR is calculated for each signal.
The optimal LSB is determined by running the simulation multiple times with various quantization sets. For each quantization set, the SQNR per signal, the overall SNR and PSNR, and the cost is calculated. The goal is to find the cheapest solution that realizes the specified performance. This procedure can be automated by means of an optimization routine.
When changing the quantization for one signal at the time, the statistics give an impression of the sensitivity of the cost and the performance to the quantization of a signal. As a rule of thumb, the SQNR of a signal should be higher than the overall SNR.
Remark that the SQNR and SNR statistics are dependent on the input. As a consequence, the optimization should be performed on a representative set of inputs.
In the next part we discuss the fixed point support in SystemC
SystemC introduces a number of specific data types, which correspond to data types that are frequently used in Hardware Description Languages (HDLs). These types include sc_logic to make 4 valued representation that can be high (1), low (0), undefined (X) or in a high-impedance (Z) state. Integers can be of arbitrary length with sc_int, sc_uint, sc_bigint and sc_biguint. SystemC also supports logic vectors with 2 or 4 valued logic with sc_bv and sc_lv. sc_fixed and sc_ufixed define fixed point numbers where the characteristics of the number are defined by a template. sc_fix and sc_ufix use a run-time argument to define the fixed point characteristics. This is interesting to try out different quantization settings without recompilation. However, these types can not be used in synthesis, while the others can.
Two data types provide full flexibility in representing fixed point numbers with static parameters: sc_fixed (signed, 2’s complement numbers) and sc_ufixed (unsigned numbers). The constructor of these fixed-point types carry the information of the word lengths and quantization and overflow behavior:
wl is the total number of bits
iwl represents the number of integer bits, i.e. left from the binary point.
q_mode specifies the quantization method to be rounding (SC_RND), flooring (SC_TRN), or magnitude truncate (SC_TRN_ZERO). In addition, some very particular, rarely used quantization modes are specified.
o_mode selects the overflow mode to be saturation (SC_SAT), saturation to zero (SC_SAT_ZERO), symmetrical saturation (SC_SAT_SYM), wrap-around (SC_WRAP), or sign-magnitude wrapping (SC_WRAP_SM).
n_bits specifies the number of saturated bits in case of wrapping. This allows to generate some special wrapping methods that keep the sign of the signal. Default nb is set to 0.
Two of the arguments specified to the fixed point data type were word length (wl) and integer word length (iwl). Word length must be greater than 0. Integer word length can be positive or negative, and larger than the word length.
For instance if the word length is specified as 5 bits but the integer word length is 7 then two zeroes will be added to the end of the object.
If the integer word length is a negative value then sign bits after the binary point will be extended. For instance if wl = 5 and iwl = -2 then two sign bits will be added to the object. The sign bits are simply the most significant bit of the 5 bit number. By extending the sign bits, the value of the number is maintained.
This slide shows an example that illustrates the difference between rounding and flooring functionality. As can be seen, rounding always results in smaller quantization errors than flooring.
The slide shows an example with different overflow handling methods: saturation and wrap-around for a two’s complement number. As can be seen largely different outputs are generated for this different overflow methods.
When working with fixed-point arithmetic, it is vital to have an efficient representation of values and simulation of operations. For this purpose, all operations are performed with floating point arithmetic. Only on assignment, the quantization is performed. In case an intermediate result needs to be quantized, an explicit assignment operation has to be used.
In the example above the multiplication a*b is a floating-point operation having as input two fixed point values. During the assignment to c the floating point result is automatically casted to the specified fixed point type of variable c.
SystemC also allow to define fixed point types with non-static arguments: sc_fix (signed, 2’s complement numbers) and sc_ufix (unsigned numbers). Type sc_fxtype_params is used to configure the parameters of types sc_fix, and sc_ufix. To set the parameters for these types declare an object of type sc_fxtype_params, initialize the parameter values as desired, and pass the sc_fxtype_params object as an argument to the sc_fix or sc_ufix declarations.
The sc_fxtype_params object has the same arguments passed to an object of type sc_fixed. These include:
• wl - word length
• iwl - integer word length
• q_mode - quantization mode
• o_mode - overflow mode
• n_bits - saturated bits
Any combination of arguments are allowed, but the order cannot be changed. A variable of type sc_fxtype_params can be initialized by another variable of type sc_fxtype_params. One variable of type sc_fxtype_params can also be assigned to another.
Individual argument values can be read and written using methods with the same name as the arguments shown above.
We now turn to the exercise, where we will perform fixed point refinement of the IDCT operator in the JPEG decoder.
The goal of this exercise is to get familiar with fixed point refinement, by practicing it on the IDCT block of the JPEG decoder. To this end, we will determine the LSB and MSB value for every variable in the IDCT function. By observing the overall behavior it will be possible to optimize the LSB and MSB values. The MSB should be determined in such a way that overflow is avoided without introduction of overflow logic. To determine the LSB the impact on the image quality (e.g. peak signal to noise ratio PSNR) should be kept below 0,5dB. The PSNR is defined as the ratio between the maximum power of a signal and the power of the corrupting noise. In our case the noise is the mean squared error (MSE) between the original and the decompressed image. The maximum power of the signal is MAX2, where MAX is the maximum grey value of a pixel.
In this third class we will focus on the refinement of the communication between the modules of the functional model. More in particular we will explain how the FIFO communication channels can be replaced by protocols on simple wires.
This lecture on communication refinement consists of three parts:
In the first part we introduce the concept of refining the inter process FIFO communication into real protocols.
Next, we review the support in SystemC for communication refinement.
Finally we introduce the exercise to practice what we have learned.
In the architectural design process that translates an algorithm into an optimal architecture, communication refinement is an important step. The algorithm will typically be specified into a functional model, like data flow. In this data flow model, the communication between processes is performed via point-to-point queues. The architecture needs a model with explicit protocols. In addition, signals could be multiplexed on a bus to reduce the wiring overhead.
A FIFO is a very robust structure because it guarantees correct processing of the data independently from the processing times of the functions and communication times. However, queues require a large amount of storage and also some addressing hardware. A typical implementation, for instance, would be a memory array with modulo addressing and a read and write pointer. Because of this large implementation cost, the communication must be optimized.
Ideally, from an implementation point of view, a FIFO communication could be reduced to a simple wire when the output signal is registered. This requires no storage and no implementation cost for the addressing or protocol. However, consistency of the communication must be guaranteed: Process 2 should not use the data before it is generated and Process 1 should not produce new data before the previous has been read by Process 2.
To analyze the behavior of a wired connection, we represent the two processes with a Synchronous Finite State Machine (FSM). In such a Synchronous FSM the transitions take place on a clock edge. In our analysis we assume that both processes are running on the same clock. Process 1 will perform a filtering operation in 4 cycles and will also write the data in the register in the 4th cycle. Process 2 will initially wait for 4 cycles. Next cycle, it will read the data and perform a first operation, followed by three more cycles of operation. This sequence will be repeated continuously.
If we look at a timing diagram, we see that the timing is guaranteed. Every read() happens after a write() of the signal. Also no data is lost.
However, small changes to the finite state machines of one of the two processes can result in errors:
If we increase the number of operations in process 1, process 2 will consume too early and hence twice the same data is used.
If we decrease the number of operations in process 2, the same happens.
If we decrease the number of operations in process 1, process 2 will be relatively too slow and some data will be overwritten before it has been used.
Increasing the number of operations in process 2 will have the same effect.
Also remark that the number of initial wait operations in process 2 should not be too low or too high.
In the slide an example is shown where process 2 has only two states. As a consequence it can be expected that the data produced by process 1 is used multiple times. Because no initial wait operations are present in process 2, we also expect that undefined data will be used.
The expected behavior is confirmed on the time diagram. As can been seen on the diagram, the first two data elements for process 2 will be undefined. Next, the read() operation of process 2 will use twice the same data produced from process 1.
To guarantee correct behavior, two approaches exist:
Adapt the cycle budget of process 2, for instance by introducing two dummy cycles. However, this breaks the general approach of making modules independent from the environment in which they operate.
Introduce a handshake protocol that automatically synchronizes on the data transfers. This is the most robust and reliable approach. On the other hand, handshake protocols introduce some overhead and should be performed on larger units.
Many different handshake protocols are feasible. Let’s illustrate the concept with a very simple one with two handshake lines. The handshake line “a” (ask) is generated by the receiver and indicates that the receiver is ready to read in the next cycle. The handshake line “r” (ready) is generated by the transmitter and indicates that he has written data in the cycle when the flag is raised. At least two cycles are needed for a reliable communication of a value. Remark that this protocol is only suited for synchronous designs where both processes are executed on the same clock.
The finite state machines enhanced with the protocol operations (in red) is shown in this picture. When “a” is set, process 2 waits for the “r” flag to be raised. Next it reads the data, lowers “a”. performs its operations, and sets “a” again for a next sequence. Process 1 performs its operations and next waits for flag “a” before it writes its data and raised flag “r”. The basic assumption of this protocol is that when data is written it is read in the next cycle.
Looking at the time diagram shows that the operation of the two processes are automatically synchronized by this protocol.
When we add a state in process 2 and reduce the number of states in process 1 to two, we make the receiving process slower than the transmitting one.
Also now, the protocol synchronizes the two processes automatically. However, after “op3” in process2, an extra clock cycle is introduced automatically. This is caused by the fact that process 1 has to observe that “a” is raised before it can write the data and raise “r”. The extra cycle can be avoided by raising’ ”a” already during “op2”.
The simple handshake protocol of previous slides is just one of the many possibilities. The most general protocol is the 4-phase handshake protocol that can synchronize two systems, independent of a clock signal. The 4 phase handshake protocol consists of 4 phases:
Initially, both request (Req) and acknowledgement (Ack) signals are low.
Next, the Req signal is raised and the operation is executed.
After the execution of the operation, the Ack signal is raised. Here starts the third phase.
When the Ack signal is detected, the Req signal is turned off. This phase continues until the low Req signal is detected and the Ack signal is turned off.
The picture on the slide shows the asynchronous FSM for the four-phase handshake protocol. In an asynchronous FSM the transitions are not clocked and happen as soon as the guard statement is valid.
Besides the 4-phase handshake protocol, many other protocols exist.
For example a protocol can be constructed that is based on signal transitions rather than signal levels.
Handshake protocol can also be simplified when both systems run on the same clock or for the cases that the receiver or transmitter is known to be the fastest.
Also, the efficiency of the communication can be improved by block based handshake protocols. In such a protocol, the communication is set-up for the first element of the block. Next, a data element is communicated every cycle.
There also exists a set of protocols based on typical FIFO signals.
The replacement of the FIFO by protocols is only possible if no intermediate storage is needed. This is not always the case. For example, the system on the slides needs at least a storage for two data elements on queue 1. In most cases, the number of required data storages can be derived from the maximum number of elements in a queue during functional simulations.
Also remark that changing the order in which data is produced in process 1 or consumed in process 2 will change the storage requirements.
Another option is to integrate the required storage in one of the two processes and match the production and consumption sequences.
If intermediate storage is needed, a FIFO must be explicitly introduced in hardware. A FIFO will be a module with storage, a finite state machines and communication protocols for the producing and consuming processes.
The FIFO structure can be defined once and next reused in many designs.
Up till now, we have considered point-to-point communications. Each channel in the functional model is then mapped to a physical channel in the hardware.
However, when this communication structure becomes complicated it might be advantageous to multiplex multiple communications on a bus structure. Communication with off-chip devices might also take advantage of a bus structure because of the limited amount of available pins.
The bus can be modeled as a set of multiplexers. To decide when a module is allowed to communicate on this bus, an arbiter is needed. The arbiter works with handshake protocols with the processes. If we reuse our simple protocol, the arbiter would react on the ask signals from the receiving processes and reserve and transfer this ask signal to the sending process when the bus is free for data transfer.
The bus and arbiter are modules that can be designed ones and reused in multiple designs.
After communication refinement of a functional model, we obtain a behavioral model. A behavioral model defines the functionality and also the relative ordering of inputs and outputs. To perform this ordering, a clock signal is used. Also, the pins of a module are identical to the final implementation. On the other hand, the internal operations are functionally modeled. They are not mapped on clock cycles and no functional units are allocated.
Increasingly synthesis tools are moving up from the register transfer level (RTL) synthesis toward behavioral synthesis. In the latter the synthesis tool autonomously decided on the number and types of functional units and schedules the operations on these functional units.
We now take a look at the support for communication refinement in SystemC
Representing behavioral models in SystemC is straight forward. The processes are represented with (clocked) thread processes (SC_CTHREAD or SC_THREAD). To order the inputs and outputs, every module has a clock input. In the case of a SC_THREAD process, it must be made static sensitive to this clock.
To separate clock cycles, wait() statements will be used in the SC_THREAD or SC_CTHREAD process.
It is possible to assign a synchronous reset signal to the thread processes. In the case that the reset signal is active at a clock event, the current process will be stopped, and called again from the start of the function. Also an asynchronous reset is supported.
Remark that because of the introduction of the clock we cannot run until the end of activity (this would never stop). Therefore we must run the simulation for a finite time or halt it explicitly.
Standard signals are used to communicate between behavioral processes. A signal can only be written from one process.
For the sc_signal&lt;T&gt; channel, three ports are predefined:
sc_in &lt;T&gt; is essentially equivalent to sc_port&lt;sc_signal_in_if &lt;T&gt; &gt;
sc_inout &lt;T&gt; is essentially equivalent to sc_port&lt;sc_signal_inout_if &lt;T&gt; &gt;
sc_out &lt;T&gt; is identical to sc_inout&lt;T&gt;
The write() operation on a signal overwrites the present value. The read() operation reads the current value. Also the assignment operators are available for signals. These three ports must be bounded to exactly one signal.
Finally we need also a clock in a behavioral model. SystemC offers special clock functions, where you can choose the period, duty ratio, initial offset and first value. An example is shown on the slide.
On the slide an example is shown where three values are read in sequentially and summed. The resulting sum is put on the output.
The example is modeled with a clocked thread. It could also be implemented with a thread process.
To replace the queues it is advocated to follow a gradual approach. First, converters (between sc_fifo and protocol) are introduced between the processes.
Next the protocol can be integrated in each process separately.
At each moment the correct operation of the system can be validated through simulations.
On the slide we show an example for the converters that translate between a sc_fifo and the simple synchronization protocol and vice versa.
The exercise is intended to get you familiar with communication refinement. We turn again to the simplified JPEG decoder.
The goal of this exercise is to replace the FIFO channel between the run-length encoder and decoder by a handshake protocol. To this end we will add converters between the two blocks to obtain a behavioral model. Next integrate the protocol functionality in the run-length decoder process, integrate the resulting behavioral model in the application, simulate the system, and verify correct operation.
In this 4th class we focus on the refinement of the computations, resulting in RTL description of the circuit. This model should be synthesizable with an RTL synthesis tool.
The class consists of three parts:
First, we describe the conceptual steps to transform from a behavioral into an RTL description of the circuit.
Next we introduce the constructs that are available in SystemC to support this RTL modeling.
Finally we exercise the new knowledge on the JPEG decoder.
Next to fixed point and communication refinement, computation refinement is an important step in architectural design (from functional model towards RTL model). Remark that the order in which these three steps are performed is not defined. Refinements along these three axes can even be intermixed. There also exist interdependences between these operations. For instance if two operations share a common operator they will use the same word size.
At the start of the computation refinement the embedded system is modeled with behavioral blocks, where both the data types and communications are refined. The test bench is not evolved and is still the original functional model.
The RTL modeling can be introduced gradually by replacing individual behavioral blocks with RTL descriptions. The correctness of the system can be verified during this process by simulating the combination of functional, behavioral, and RTL models.
Behavioral models are represented as threads which wait on clock edges to synchronize their inputs and outputs (IO).
As a consequence, they can be represented by a clocked finite state machine (FSM). In the slide a Moore-type state machine, whose outputs are only determined by the state, is used.
The transformation from behavioral to RTL can conceptually be represented by the scheduling of operations on this FSM. In this scheduling activity additional states can be introduced.
Remark also that the scheduling of the operations can have major impact on the inter-process communication:
Additional states can introduce errors in synchronized communication.
Protocol based communication is more robust but the settings of the protocol signals might have to be adapted
Separation of operator scheduling and communication refinement is a desire in many design flows but is rarely achieved completely.
The resulting FSM can be transformed back in code. The resulting RTL model can be represented either with a SC_METHOD or a SC_CTHREAD. Both can be synthesized into gate level circuits. For simplicity, we will use SC_CTHREADS.
The resulting RTL description defines the datapath of the resulting scheme:
The degree of parallelism is defined and as a consequence the number of operators are defined. Most synthesis tools automatically identifies that operators in different states of the FSM can be shared. To enable this sharing, multiplexers and demultiplexers must be introduced.
The RTL description also defines where registers must be inferred. In general, any signal that is generated in one state and must be used in another will be stored in a register. In general, we also register all outputs of the circuit. If a register does not have to be changed each clock cycle, a multiplexing circuit is introduced in front. This is a more robust alternative to the gating of the clock.
The datapath is controlled by a controller. Each cycle it outputs all the necessary control signals for the various (de)multiplexer, which together can be considered as a long instruction word. This instruction word is determined by the output function on the basis of the state and de status inputs of the controller. The next-state function determines based on the same data what the next state is of the controller.
The process that we described in the previous slides is performed manually for a RTL design. However, it can also be automated with behavioral synthesis.
An important parameter of the datapath is the critical path of the combinatorial logic. The critical path is defined as the longest physical delay between the input (referred to the clock edge) and the output of a combinatorial circuit. The critical path must be smaller than the clock period in order to have correct operation.
For signal processing systems, the data insertion interval (DII) determines the architectural style and the selected clock operation. The data insertion interval is defined as the time between two data samples of the signal. The data insertion interval is the inverse of the throughput of a circuit.
When the critical path of a maximum parallel architecture is larger than the data insertion interval, pipelining is the consequence. Pipelining reduces the critical path but introduces extra registers and hence increases the area and cost of the architecture. Pipelining can be done down to the operator or even bit-operator level.
In pipelined architectures the clock period is normally chosen equal to the data insertion interval.
On the other hand when the data insertion interval is much larger than the critical path of the maximum parallel architecture, a multiplexed architecture is possible. In such an architecture less operators are needed, what reduces the area and cost. However, this operator reduction must be balanced against the increased number of registers and (de)multiplexers. In a multiplexed architecture the clock period is only a fraction of the data insertion interval.
Real embedded systems do in most cases not consist of a single architectural style. For example an image processing system starts with the raw image, which results in a small DII. Hence a fully pipelined architecture can be assumed. After the image filters, edges will be detected and these edges will be grouped into features. Because there are far less edges than points in the image, the DII will reduce. However, the algorithms that are performed on these edges become more complicated. As a consequence a dedicated multiplexed architecture will be optimal. Next the systems detects patterns (e.g. objects) and will control a robot to pick the object. Again the DII is lower and the algorithm more complex. Here a general purpose microcontroller will be ideal.
The cost of a circuit is determines by its area. The area of a circuit is the sum of the area of its datapath and control units. Gaijski worked out a complete scheme of factors that contribute to these areas. The precise weight of each of these elements is technology dependent.
To derive the technological parameters for the area model we need to investigate the implementation technology. In the case of a standard cell technology, the library will provide the information for the basic cells. The height of these cells is identical for all cells. As a consequence the area of a cell is directly proportional to its width.
The data of the slide come from the vsclib from www.vlsitechnology.org
The class consists of three parts:
First, we describe the conceptual steps to transform from a behavioral into an RTL description of the circuit.
Next we introduce the constructs that are available in SystemC to support this RTL modeling.
Finally we exercise the new knowledge on the JPEG decoder.
No new concepts are needed to model circuits at the RTL level: modules, processes and signals are the basic elements.
Each sc_module contains one or more processes and is translated in a separate hardware unit.
Variables have no data storage capability and are only used for intermediate values in a single execution of a process. If data has to be stored between multiple executions of a process or has to be communicated between processes, signals have to be used.
Synthesis tools put restrictions on the C++ and SystemC constructs that can be used. Currently a standardization of a synthesizable SystemC subset is underway. The slide gives an overview of the restrictions proposed in the draft standard version 1.3 for modules and processes.
On datatypes and functions the following restrictions apply.
At the RTL level there is a strong similarity between SystemC and other Hardware Description Languages (HDLs), like VHDL and Verilog.
Communication between multiple execution of a process or between processes is done by signals.
As a consequence, signals can store data and result in registers.
On the slide, an example is shown of a communication between two cycles in a sequential process which results in a register.
Different types of memory can be used in a design. Read Only Memory (ROM) can be specified in SystemC by means of an array of constants. A Register File is generated by defining an array of sc_signals which lead to register inference.
In contrary to ROM and Register files, Random Access Memory (RAM) blocks are not synthesized, but generated by a RAM generator. Therefore they must be isolated from the rest of the design and modeled by a behavioral model.
The slide shows the behavioral model for a single port asynchronous RAM. Synchronous RAM, where the output is generated based on a clock edge is, is an alternative.
In stead of single port RAM also dual port RAM is sometimes used.
A unique feature of SystemC is the capability for mixed simulation of functional, behavioral and RTL models. To this end the simulation engine follows a scheduling approach where functional, behavioral and RTL processes are executed iteratively until no activity remains in the system. Next, the primitive channels are updates and the scheduler goes back to the iterative execution of the processes.
An important aspect of a simulation at the RTL level, is the measurement of the performance of the system. SystemC offers a number of supporting functions to obtain the simulation time (virtual time for the circuit), convert it to seconds and print it on the screen.
If the period of the clock is known, the simulation time can be converted into clock cycles.
Remark that the circuit might require some clock cycles for initialization. As a consequence, the throughput of the system is normally better than the datarate divided by the simulation time.
SystemC defines a number of auxiliary functions to dump tracefiles.
To open a tracefile the function sc_trace_file *sc_create_vcd_trace_file(char*) is called.
To include a sc_signal into the tracefile, one uses the sc_trace(sc_trace_file*, sc_signal, char*) function.
Finally the trace file is closed with sc_close_vcd_trace-file(sc_trace_file*).
The tracefile is stored in vcd (value change dump) format. It can be viewed by trace file viewers, like gtkwave.
The class consists of three parts:
First, we describe the conceptual steps to transform from a behavioral into an RTL description of the circuit.
Next we introduce the constructs that are available in SystemC to support this RTL modeling.
Finally we exercise the new knowledge on the JPEG decoder.
The goal of this exercise is to refine the RL decoder into an RTL model an to integrate it into the system.
Normally the RTL model will be synthesized to generate dedicated hardware. In advance we can already estimate the hardware complexity (in number of adders, multipliers, multiplexers, registers, memories, etc.).