High-Performance Timing Simulation of Embedded Software

17.1

High-Performance Timing Simulation
of Embedded Software ∗

Jürgen Schnerr2 , Oliver Bringmann1 , Alexander Viehl1 and Wolfgang Rosenstiel1,2
1 FZI Forschungszentrum Informatik 2 Universität
Tübingen
Haid-und-Neu-Str. 10–14 Sand 13
76131 Karlsruhe, Germany 72076 Tübingen, Germany

{jschnerr, bringmann, viehl, rosenstiel}@fzi.de

ABSTRACT nection and distribution of these components, thereby leading to
This paper presents an approach for cycle-accurate simulation of distributed embedded systems. Also, new applications and inno-
embedded software by integration in an abstract SystemC model. vations arise more and more from a distribution of functionality as
Compared to existing simulation-based approaches, we present a well as from the combination of previously independent functions.
hybrid method that resolves performance issues by combining the Therefore, in the future this distribution will play an important part
advantages of simulation-based and analytical approaches. In a first in the increase of the product value.
step, cycle-accurate static execution time analysis is applied at each The system responsibility of the supplier is also currently in-
basic block of a cross-compiled binary program using static pro- creasing. This means that the supplier will not only be responsi-
cessor models. After that, the determined timing information is ble for the designed subsystem, but additionally for the integration
back-annotated into SystemC for fast simulation of all effects that of the subsystem in the context of the entire system. This inte-
can not be resolved statically. This allows the consideration of data gration is becoming more complex: today, requirements of single
dependencies during run-time and the incorporation of branch pre- components are validated; in future, the requirement validation of
diction and cache models by efficient source code instrumentation. the entire system has to be achieved with regard to the designed
The major benefit of our approach is that the generated code can component.
be executed very efficiently on the simulation host with approxi- What this means is that changes in the product area will lead to a
mately 90% of the speed of the untimed software without any code paradigm change in design. Even in the design stage, the impact of
instrumentation. a component on an entire system has to be considered. A compre-
hensive modeling of distributed systems, and an early analysis and
simulation of the system integration, have to be considered.
Categories and Subject Descriptors Therefore, a methodical design process of distributed embed-
B.8.2 [Performance and Reliability]: Performance Analysis and ded systems must be applied. This can be implemented using a
Design Aids; C.4 [Performance of Systems]: Measurement Tech- comprehensive modeling of distributed systems and using platform
niques independent development of the application software (UML, Mat-
lab/Simulink, C++).
General Terms What is also important is the early inclusion of the intended tar-
get platform in model based system design (UML), the mapping
Design, Measurement, Performance of function blocks on platform components, and the use of virtual
prototypes for the abstract modeling of the target architecture.
Keywords An early evaluation of the target platform means that the appli-
Software Timing, Simulation Acceleration, Virtual Prototypes cation software can be evaluated with consideration to the target
platform. Hence, an optimization of the target platform with con-
sideration given to the application software, performance require-
1. INTRODUCTION ments, power dissipation and reliability can take place.
In the future, new system functionality will be realized less by An early analysis of the system integration also means early ver-
the sum of single components, and more by cooperation, intercon- ification based on virtual prototypes and exposure of integration
∗ This work has been partially supported by the BMBF project VI- faults using virtual prototypes. After that, a seamless transition to
SION under grant number 01M3078B and by the DFG projects the real prototype can take place.
CaNoC under grant number BR 2321/1-1 and ASOC under grant
number RO 1030/14-1 and 2. 1.1 Performance Analysis of Distributed Em-
bedded Systems
Permission to make digital or hard copies of all or part of this work for The main question of performance analysis of distributed em-
personal or classroom use is granted without fee provided that copies are bedded systems is: What is the timing behavior of a system and
not made or distributed for profit or commercial advantage and that copies how can it be determined? The central issue is that computation
bear this notice and the full citation on the first page. To copy otherwise, to has no timing behavior as long as the target platform is unknown
republish, to post on servers or to redistribute to lists, requires prior specific
because the target platform has a major effect on timing.
permission and/or a fee.
DAC 2008, June 8–13, 2008, Anaheim, California, USA But nevertheless, the specification can contain global perfor-
Copyright 2008 ACM 978-1-60558-115-6/08/0006...5.00 mance requirements. The fulfillment of global performance re-

290

quirements depends on local timing behavior of system parts. A This binary code can run in an instruction set simulator (ISS).
solution for this problem is an early inclusion of the target archi- An ISS implements an abstract modeling of the instruction execu-
tecture. tion. It does not consider bus behavior modeling. Such an ISS can
Several analytical and simulative approaches for performance be implemented either as interpreter or as binary code translator.
analysis have previously been proposed. In this paper a hybrid ap- Binary code translation can be implemented in two different ways:
proach for performance analysis will be presented. This hybrid either as static or as dynamic compilation, also called just-in-time
approach combines the advantages of analytical and simulative ap- (JIT) compilation [12]. An ISS is used in several commercial solu-
proaches. tions, like the CoWare Processor Designer [3], CoMET from VaST
Analytical approaches are based on a formal analysis of the Systems Technology [21] or Synopsys Virtual Platforms [19].
corner cases based on a system model. The approaches can be Furthermore, the binary code can be executed using a proces-
divided into two categories: black-box approaches and white-box sor model that includes the complete processor (functional units,
approaches. Furthermore, both approaches can be categorized de- pipelines, caches, register, counter, etc.). Such a model can have
pending on the level of system abstraction and with regard to the several levels of accuracy. For example, it can be a transaction
used model of computation. level model or a register transfer model.
Black-box approaches consider functional system components In addition to the simulation of the processor, a simulation of
as black boxes and abstract from their internal behavior. peripheral components and custom hardware must be done, either
This can be an abstract task model [15] with task activation and as a co-simulation with HDL simulators or using SystemC.
event streams on task level representing activation patterns. Us- An abstract processor model with an integrated RTOS model us-
ing event stream propagation, the determination of a fix-point takes ing task scheduling was presented in [16]. Additionally, a proces-
place. For this, no modification of the event streams has to be done. sor model using neural networks for execution cycle estimation was
Examples for black-box approaches are the real-time calculus [20] presented in [14]. A transaction level approach for the performance
and the system-level composition by event stream propagation as it evaluation of SoC architectures was presented in [24].
is used in SymTA/S [5]. A hybrid model for the fast simulation that allows switching be-
White-box approaches include an abstract control flow represen- tween native code execution and ISS-based simulation was pre-
tation of each process into the system model. Then a global perfor- sented in [9]. Another approach using a hybrid model was shown
mance and communication analysis considering (data-dependent) in [17]. This approach is based on the translation of object code
control structures of all processes can take place. For this analy- into annotated binary code for the target processor. For the cycle
sis, an extraction of the control flow from application software or accurate execution of the annotated code on this processor, special
UML models [23] needs to be done. Then, an environment model- hardware is needed.
ing using event models or processes can take place. Examples for
white-box approaches are the communication dependency analy- 2. PROPOSED HYBRID APPROACH
sis [18] and the control flow based extraction of hierarchical event
streams [1]. Here we will present a hybrid approach for performance simula-
Analytical approaches that only rely on best-case and worst-case tion of embedded software. Hybrid approaches consist of a combi-
create results that are often too pessimistic, such that risk estima- nation of analytic and simulative approaches with the objective of
tion for real scenarios is difficult to carry out. Different analytic gaining simulation speed without remarkable loss of accuracy.
approaches try to tackle this issue by considering probabilities of Static worst case/best case execution time (WCET/BCET) anal-
timing quantities in white-box system analysis. ysis abstracts the influence of data dependencies on the software
A hybrid approach to combining simulation and formal analy- execution time. Due to this, BCET/WCET analysis delivers very
sis for tightening bounds of system level performance analysis was good results of entire basic blocks, but it is too pessimistic across
presented in [10]. The objectives are to determine timing charac- basic block boundaries. Further, the effects of concurrent cache
teristics of not formally specified components by simulation and usage of different applications on multi-core architectures lead to
to integrate simulation results into a framework for formal perfor- even wider bounds. An analytic solution for this issue is still un-
mance analysis. In comparison to this approach, we focus on a fast known. The objective of the presented approach is the reduction of
timing simulation of embedded software. The results determined pessimism that is contained in WCET/BCET boundaries.
using our approach might be included in system level performance Simulative techniques that consider an application with concrete
methodologies with the benefit of high accuracy and time savings input data and the target architecture can be used to determine the
in the simulation stage. timing behavior of software on the underlying architecture. The
Analytic performance risk quantification based on profiled exe- proposed approach tries to prevent repeated time-consuming inter-
cution-times is presented in [22]. The model is derived from real pretation and repeated timing determination of all executed binary
implementations. Although it is able to represent the temporal be- code instructions on the target architecture.
havior of communication, computation and synchronization, data- The hybrid approach provided in this paper applies back-anno-
dependent timing effects can not be detected thoroughly. tation of WCET/BCET values. Theses values are determined stat-
Simulative approaches perform a simulation of the entire com- ically on basic block level in binary code that was generated from
munication infrastructure and the processing elements. If neces- C source code. Additionally, the timing impact of data-dependent
sary, this simulation includes hardware IP. architectural properties like branch prediction is also considered
A simulation of a network between communicating C/C++ pro- effectively. The tool that implements the proposed methodology
cesses can be done using a network simulator such as OPNET [13], generates SystemC code. This code can be compiled for any host
Simulink or SystemC [6]. Timing annotation of such a network machine to be used afterward for a target platform independent sim-
simulation is possible, but the exact timing behavior of the software ulation.
is missing. To obtain this timing behavior, it is necessary to do a Communication calls in the automatically created SystemC mod-
simulation of the software execution on the target processor. For els are encapsulated in TLM [4] communication primitives. This
this simulation, the binary code for the target platform component way, a clean and standardized ability to integrate timed embedded
is required. software in virtual SystemC prototypes is given.
One major advantage of the presented methodology is in the area

291

of multi-core processors with shared caches. Whereas static anal- C Source Code
ysis has no knowledge of concurrent cache usage of different ap-
plications and the impact on execution time, the presented method- C Compiler
ology is able to handle these issues. How this is done will be de-
scribed in more detail in Section 2.4.3. Binary Code Processor
Description
Another possibility would be a translation of binary code into
annotated SystemC code. However, this approach has some major
disadvantages. One main drawback is, for example, that the same

Analysis Source Code

Analysis Binary Code
problems that have to be solved in static compilation (binary trans- construction of intermediate
lation) have to be solved here (e.g. addresses of calculated branch representation
targets have to be determined). Another disadvantage is that the
automatically generated code is not very easily read by humans. building of basic blocks

Back−annotation Tool
2.1 Back-annotation of WCET/BCET values static cycle calculation
In this section, we will describe our approach in more detail.
Figure 1 shows an overview of this approach.
First, the C source code has to be taken and translated using find correspondences between
an ordinary C (cross)-compiler into binary code for the embedded C source code and binary code
processor (source processor). After that, our back-annotation tool

Back−annotation
reads the object file and a description of the used source proces-
insertion of cycle generation code
sor. This description contains information about the instruction set,
the pipelines, the caches and the branch prediction of the source
processor. Using this description, the object code is decoded and
translated into an intermediate representation consisting of a list of insertion of dynamic correction code
objects. Each of these objects represents one intermediate instruc-
tion.
In a next step, the basic blocks of this program are determined
using the list containing the translated program. As a result, a list Annotated SystemC Program
of basic blocks is built.
After that, a static calculation of the number of cycles each basic Figure 1: Back-annotation of WCET/BCET values
block would have taken on the source processor is made using the
pipeline description. This calculation is described in more detail in
Section 2.3. sor. How this number is calculated is described in more detail in
Subsequently the back-annotation correspondences between the Section 2.3. In modern processor architectures, the impact of the
C source code and the binary code are identified. Then, the back- processor architecture on the number of executed cycles can not be
annotation takes place. This is done by code-insertion for the cycle completely predicted statically. Especially the branch prediction
generation and for the dynamic correction of the cycle generation. and the caches of a processor have a significant impact on the num-
The structure and functionality of this code are described in Sec- ber of used cycles. Therefore, the statically determined number of
tion 2.2. cycles has to be corrected dynamically. The division of the basic
Not every impact of the processor architecture on the number block for the calculation of additional cycles of instruction cache
of cycles can be predicted statically. Therefore, if dynamic, data- misses, as shown in Figure 2, is explained in Section 2.4.2.
dependent effects (e.g. branch prediction and caches) have to be If there is a conditional branch at the end of a basic block, branch
taken into account, additional code needs to be added. Further de- prediction has to be considered and possible correction cycles have
tails concerning this code are described in Section 2.4. to be added. This is described in more detail in Section 2.4.1.
During the back-annotation, the C program is transformed into a As shown in Figure 2 the back-annotation tool adds a call to a
cycle-accurate SystemC program that can be compiled on the target function that performs cycle generation at the end of each basic
processor. block. This instruction generates the number of cycles this basic
One advantage of this approach is a fast execution of the anno- block would need on the source processor.
tated code as the C source code does not need major changes for the annotation of C code for a basic block
annotation. Also, the generated SystemC code can be easily used
Architectural
within a SystemC simulation environment. Difficulty in using this C code corresponding to a basic block
Model
approach is that finding corresponding parts of the binary code in c=statically predicted number of cycles ;
the C source code is difficult if the compiler optimizes or changes C code corresponding to the cache Cache
the structure of the binary code too much. If this happens, recom- analysis blocks of the basic block Model
pilation techniques [2] have to be used to find the correspondences.
c=c+cycleCalculationICache( tag,iStart,iEnd ); Branch Pre−
diction Model
2.2 Annotation of the SystemC code c=c+cycleCalculationForConditionalBranch();

On the left side of Figure 2 there is the annotation of a piece of C generateCycles(c);
code that corresponds to a basic block. The right side of this figure
shows the cache model and the branch prediction model that are
used during run-time. Figure 2: General principle for a basic block annotation
At the beginning of the annotated basic block, the annotation
tool adds a cycle counter c that contains the statically determined In order to guarantee both – as fast as possible execution of the
number of cycles this basic block would use on the source proces- code as well as the highest possible accuracy – it is possible to

292

choose different accuracy levels of the generated code that parame- several cache analysis blocks. This has to be done until the tag
terize the annotation tool. The first and fastest one is a purely static changes or the basic block ends. After that, a function call to the
prediction. The second one additionally includes modeling of the cache handling model is added. This code uses a cache model to
branch prediction. And, the third one also takes dynamic inclusion find out possible cache hits or misses.
of instruction caches into account. The cache simulation will be explained in more detail in the next
The cycle calculation in these different detail levels will be dis- few paragraphs. This explanation will start with a description of
cussed in more detail in the following text. the cache model.

2.3 Static cycle calculation of a basic block C Program Binary Code Cache Model
In modern architectures pipeline effects, superscalarity and v tag lru data
caches have an important impact on the execution time. Therefore, asm_inst 1
C_stmnt1
asm_inst l
a calculation of the execution time of a basic block by summing C_stmnt2
up the execution or latency times of the single instructions of this asm_inst l +1
C_stmnt3 asm_inst 2l
block is too inaccurate.
C_stmnt4 asm_inst 2l +1
Therefore, the incorporation of a pipeline model per basic block
cycleCalcICache asm_inst n
becomes necessary [11]. This model helps to predict pipeline ef-
fects and the effects of superscalarity statically. For the genera- Figure 3: Correspondence C – assembler – cache line
tion of this model, information about the instruction set and the
pipelines of the used processor are needed. This information is
contained in the processor description that is used by the annotation
tool. With regard to this, the tool can determine which instructions
The cache model
of the basic block will be executed in parallel on a super scalar pro- The cache model, as it can be seen on the right side of Figure 3,
cessor and which combinations of instructions in the basic block contains data space that is used for the administration of the cache.
will cause pipeline stalls. In this space, the valid bit, the cache tag and the least recently used
With the information gained by the basic block modeling, a pre- (lru) information (containing the replacement strategy) for each
diction is carried out. This prediction determines the number of cache set during the run-time is saved.
cycles the basic block would have needed on the source processor. The number of cache tags and the according amount of valid bits
The next section will show how this kind of prediction is im- that are needed depends on the associativity of the cache (e.g. for
proved during run-time, and a cache model is included. a two-way set associative cache, two sets of tags and valid bits are
needed).
2.4 Dynamic correction of cycle prediction
As previously described, the actual cycle count a processor needs Cache analysis blocks
for executing a sequence of instructions can not be predicted cor- In the middle of Figure 3, the C source code which is corresponding
rectly in all cases. This is the case if, for example, a conditional to a basic block is divided in several smaller blocks, the so-called
branch at the end of a basic block produces a pipeline flush or if ad- cache analysis blocks. These blocks are needed for the considera-
ditional delays occur due to cache misses in instruction caches. The tion of the effects of instruction caches. Each one of these blocks
combination of static analysis and dynamic execution provides a contains that part of a basic block that fits into a single cache line.
well-suited solution for that problem, since statically unpredictable As every machine language instruction in such a cache analysis
effects of branch and cache behavior can be determined during exe- block has the same tag and the same cache index, the addresses of
cution. This is done by inserting appropriate function calls into the the instructions can be used to determine how a basic block has to
translated basic blocks. These calls interact with the architectural be divided into cache analysis blocks. This is because each address
model in order to determine the additional number of cycles caused consists of tag information and cache index.
by mispredicted branch and cache behavior. At the end of each The cache index information (iStart to iEnd in Figure 2) is used
basic block the generation of previously calculated cycles (static to determine at which cache position the instruction with this ad-
cycles plus correction cycles) takes place (Figure 2). dress is cached. The tag information is used to determine which
address was cached, as there can be multiple addresses with the
2.4.1 Branch prediction same cache index. Therefore, a changed cache tag can be easily
Conditional branches have different cycle times depending on determined during the traversal of the binary code with respect to
four different cases resulting from the combination of predicted and the cache parameters. The block offset information is not needed
mispredicted branches, as well as taken and non-taken branches. A for the cache simulation, as no real caching of data takes place.
correctly predicted branch needs less cycles for execution than a After the tag has been changed or at the end of a basic block, a
mispredicted one. Furthermore, additional cycles can be needed if function call that handles the simulated cache and the calculation of
a correctly predicted branch is taken. This problem is solved by im- the additional cycles of cache misses is added to this block. More
plementing a model of the branch prediction and by a comparison details about this function are described in the next section.
of the predicted branch behavior with the executed branch behav-
ior. If dynamic branch prediction is used, a model of the underlying Cycle calculation code
state machine is implemented and its results are compared with the As previously mentioned, each cache analysis block is character-
executed branch behavior. The cycle count of each possible case is ized by a combination of tag and cache set index information. At
calculated and added to the cycle count for the entire basic block the end of each basic block, a call to a function is included. Dur-
before the next basic block is entered. ing run-time, this function should determine whether the different
cache analysis blocks which the basic block consists of are in the
2.4.2 Instruction cache simulated cache or not. This way, cache misses are detected.
Figure 2 shows that, for the simulation of the instruction cache, The function is shown in Figure 4. It has the tag and the range of
every basic block of the translated program has to be divided into cache set indices (iStart to iEnd) as parameters.

293

int cycleCalculationICache( tag, iStart, iEnd ) was also used to generate annotated SystemC code from the C code
{ as described in Section 2.1. As reference, the execution speed and
for index = iStart to iEnd the cycle count of the TriCore code has been measured on a TriCore
if tag is found in index and valid bit is set then TC10GP evaluation board and on a TriCore ISS [8].
{ // cache hit The examples consist of two filters (fir, ellip) and two programs
renew lru information that are part of audio decoding routines (dpcm, subband).
return 0
Speed
}
else 160
{ // cache miss 150
use lru information to determine tag to overwrite 140
write new tag 130
set valid bit of written tag

Million Instructions Per Second
120
renew lru information
110 TriCore Eva-
return additional cycles needed for cache miss
}
100 luation Board
end for 90 Annotated
80 SystemC 1
}
70 Annotated
Figure 4: Function for cache cycle correction SystemC 2
60
TriCore ISS
To find out if there is a cache hit or a cache miss, the function 50
checks whether the tag of each cache analysis block can be found 40
in the specified set and whether the valid bit for the found tag is set. 30
If the tag can be found and the valid bit is set, the block is already 20
cached (cache hit) and no additional cycles are needed. Only the 10
lru information has to be renewed.
0
In all other cases, the lru information has to be used to determine dpcm fir ellip subband
which tag has to be overwritten. After that, the new tag has to be Figure 5: Comparison of speed
written instead of the found old one, and the valid bit for this tag
has to be set. The lru information has to be renewed as well. In a
last step, the additional cycles are returned and added to the cycle Figure 5 shows the comparison of the execution speed of the
correction counter. generated code with the execution speed of the TriCore evaluation
board and the ISS. The execution speed in this figure is represented
2.4.3 Consideration of task switches by Million Instructions of the TriCore Processor per Second. The
In modern embedded systems, software performance simulation Athlon 64 processor running the SystemC code and the ISS had
has to deal with task switching and multiple interrupts. Coopera- a clock rate of 2.4 GHz. The TriCore processor of the evaluation
tive task scheduling can already be handled by the previously men- board ran at 48 MHz.
tioned approach since the presented cache model is able to cope Using the annotated SystemC code, two different types of an-
with non-preemptive task switches. Interrupts, cooperated, and notation have been used: the first one generates the cycles after the
non-preemptive task scheduling can be handled similarly because execution of each basic block, the second one adds cycles to a cycle
task preemption is usually implemented by using software inter- counter after each basic block. The cycles are only generated when
rupts. Therefore, the incorporation of interrupts is discussed in the it is necessary (e.g. when communication with the hardware takes
following. place). This is much more efficient and is depicted in Figure 5.
Software interrupts had to be included to the SystemC model. The execution speed of the TriCore processor ranges from 36.8
Seite 1

This has been achieved by automatic insertion of dedicated pre- to 50.8 million instructions per second, whereas the execution
emption points after cycle calculation. This approach provides the speed of the annotated SystemC models with immediate cycle gen-
integration of different user-defined task scheduling policies and a eration range from 3.5 to 5.7 million of simulated TriCore instruc-
task switch generates a software interrupt. Since cycle calculation tions per second. This means that the execution speed of the Sys-
is completed before a task switch is executed and a global cache temC model is only about ten times slower than the speed of a real
and branch prediction model is used, no other changes are neces- processor. The execution speed of the annotated SystemC code
sary. A minor deviation of the cycle count at certain processes can with on-demand cycle generation ranges from 11.2 to 149.9 mil-
occur due to the actual task switch is carried out with a small delay lion TriCore instructions per second.
caused by the projection of task preemption at binary code level to In order to compare the SystemC execution speed with the ex-
C/C++ source code level. But nevertheless, the total cycle count is ecution speed of a conventional ISS, the same examples were run
still correct. The accuracy can be increased by insertion of cycle using the TriCore ISS. The result was an execution speed ranging
calculation code after each C/C++ statement. from 1.5 to 2.4 million instructions per second. This means our
If the additional delay caused by the context switch itself has to approach delivers a speed increase of up to 91%.
be included, the (binary) code of the context switch routine can be A comparison of the number of simulated cycles of the generated
treated like any other code. SystemC code using branch prediction and cache simulation with
the number of executed cycles of the TriCore evaluation board is
shown in Figure 6. The deviation of the cycle counts of the trans-
3. EXPERIMENTAL RESULTS lated programs (with branch prediction and caches included) com-
In order to test the execution speed and the accuracy of the trans- pared to the measured cycle count from the evaluation board ranges
lated code, a few examples were compiled using a C compiler into between 4% for the program fir to 7% for the program dpcm. This
object code for the Infineon TriCore processor [7]. This object code is in the same range as it is using conventional ISS.

294

Accuracy

37500 [3] CoWare Inc. CoWare Processor Designer.
35000 http://www.coware.com/PDF/products/ProcessorDesigner.pdf.
[4] A. Donlin. Transaction Level Modeling: Flows and Use Models. In
32500
Proceedings of the 2nd IEEE/ACM/IFIP International Conference on
30000 Hardware/Software Codesign and System Synthesis (CODES+ISSS),
27500 pages 75–80, 2004.
[5] R. Henia, A. Hamann, M. Jersak, R. Racu, K. Richter, and R. Ernst.
25000
System Level Performance Analysis – the SymTA/S Approach. IEE
22500 TriCore Eva- Proceedings Computers and Digital Techniques, 152(2):148–166,
luation Board March 2005.
20000
Cycles

Annotated [6] IEEE Computer Society. IEEE Standard SystemC R Language
17500 SystemC 2 Reference Manual, Mar. 2006.
15000 TriCore ISS [7] Infineon Technologies AG. TC10GP Unified 32-bit
12500 Microcontroller-DSP – User’s Manual, 2000.
[8] Infineon Technologies Corp. TriCoreTM 32-bit Unified Processor
10000
Core – Volume 1: v1.3 Core Architecture, 2005.
7500 [9] S. Kraemer, L. Gao, J. Weinstock, R. Leupers, G. Ascheid, and
5000 H. Meyr. HySim: A Fast Simulation Framework for Embedded
Software Development. In Proceedings of the 5th IEEE/ACM
2500
International Conference on Hardware/Software Codesign and
0 System Synthesis (CODES+ISSS), pages 75–80, 2007.
dpcm fir ellip subband
[10] S. Künzli, F. Poletti, L. Benini, and L. Thiele. Combining Simulation
and Formal Methods for System-Level Performance Analysis. In
Figure 6: Comparison of cycle-accuracy Proceedings of the Design, Automation and Test in Europe (DATE)
Conference, pages 236–241, 2006.
[11] S.-S. Lim, Y. H. Bae, G. T. Jang, B.-D. Rhee, S. L. Min, C. Y. Park,
4. OUTLOOK H. Shin, K. Park, S.-M. Moon, and C. S. Kim. An Accurate Worst
Case Timing Analysis for RISC Processors. IEEE Transactions on
As clock frequencies cannot be increased as linearly as the num- Software Engineering, 21(7):593–604, 1995.
ber of cores, modern processor architectures can consist of more [12] A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, and
than one core for fulfillment of computational demands. The dif- A. Hoffmann. A Universal Technique for Fast and Flexible
ferent cores can share architectural resources such as data caches to Instruction-Set Architecture Simulation. In Proceedings of the 39th
speed up access to common data. Therefore, access conflicts and Design Automation Conference (DAC), pages 22–27, 2002.
coherency protocols have a potential impact on the task run-times [13] OPNET Technologies, Inc. http://www.opnet.com.
running on the cores. [14] M. Oyamada, F. Wagner, M. Bonaciu, W. Cesário, and A. Jerraya.
Software Performance Estimation in MPSoC Design. In Proceedings
The incorporation of multiple cores is directly supported by our of the 12th Asia and South Pacific Design Automation Conference
SystemC approach. Parallel tasks can easily be assigned to differ- (ASP-DAC), pages 38–43, 2007.
ent cores, and the code instrumentation by cycle information can [15] K. Richter, M. Jersak, and R. Ernst. A Formal Approach to MpSoC
be carried out independently. However, shared caches can have a Performance Verification. Computer, 36(4):60–67, 2003.
significant impact on the numberSeite executed cycles. This can be
of 1 [16] G. Schirner, A. Gerstlauer, and R. Dömer. Abstract, Multifaceted
solved by inclusion of a shared cache model that executes global Modeling of Embedded Processors for System Level Design. In
cache coherence protocols like the MESI protocol. A clock calcu- Proceedings of the 12th Asia and South Pacific Design Automation
Conference (ASP-DAC), pages 384–389, 2007.
lation after each C/C++ statement is strongly recommended here to
[17] J. Schnerr, O. Bringmann, and W. Rosenstiel. Cycle Accurate Binary
increase the accuracy. Translation for Simulation Acceleration in Rapid Prototyping of
SoCs. In Proceedings of the Design, Automation and Test in Europe
(DATE) Conference, pages 792–797, 2005.
5. CONCLUSIONS [18] A. Siebenborn, A. Viehl, O. Bringmann, and W. Rosenstiel.
This paper presented a hybrid approach for high-performance Control-Flow Aware Communication and Conflict Analysis of
timing simulation of embedded software. The shown approach was Parallel Processes. In Proceedings of the 12th Asia and South Pacific
implemented in an automated design flow. The methodology is Design Automation Conference (ASP-DAC), pages 32–37, 2007.
based on the generation of SystemC code out of the original C code [19] Synopsys, Inc. Synopsys Virtual Platforms.
and back-annotation of statically determined cycle information into http://www.synopsys.com/products/designware/virtual_platforms.html.
the generated code. Additionally, the impact of data dependencies [20] L. Thiele, S. Chakraborty, and M. Naedele. Real-time Calculus for
Scheduling Hard Real-Time Systems. In IEEE International
on the software run-time is analytically handled during simulation. Symposium on Circuits and Systems (ISCAS), volume 4, pages
Presented were promising experimental results from the application 101–104, 2000.
of the implemented design flow. These results show a high execu- [21] VaST Systems Technology. CoMET R .
tion performance of the timed embedded software model as well as http://www.vastsystems.com/docs/CoMET_mar2007.pdf.
good accuracy. Furthermore, the created SystemC models, repre- [22] A. Viehl, M. Schwarz, O. Bringmann, and W. Rosenstiel.
senting the timed embedded software could be easily integrated in Probabilistic Performance Risk Analysis at System-Level. In
SystemC virtual prototypes due to generated TLM interfaces. Proceedings of the 5th IEEE/ACM International Conference on
Hardware/Software Codesign and System Synthesis (CODES+ISSS),
pages 185–190, 2007.
6. REFERENCES [23] A. Viehl, T. Schönwald, O. Bringmann, and W. Rosenstiel. Formal
Performance Analysis and Simulation of UML/SysML Models for
[1] K. Albers, F. Bodmann, and F. Slomka. Hierarchical Event Streams ESL Design. In Proceedings of the Design, Automation and Test in
and Event Dependency Graphs: A New Computational Model for Europe (DATE) Conference, pages 242–247, 2006.
Embedded Real-Time Systems. In Proceedings of the 18th Euromicro [24] T. Wild, A. Herkersdorf, and G.-Y. Lee. TAPES – Trace-Based
Conference on Real-Time Systems (ECRTS), pages 97–106, 2006. Architecture Performance Evaluation with SystemC. Design
[2] C. Cifuentes. Reverse Compilation Techniques. PhD thesis, Automation for Embedded Systems, 10(2–3):157–179, Sept. 2005.
Queensland University of Technology, 19. Nov. 1994.

295

High-Performance Timing Simulation of Embedded Software

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (6)

Similar to High-Performance Timing Simulation of Embedded Software

Similar to High-Performance Timing Simulation of Embedded Software (20)

More from Mr. Chanuwan

More from Mr. Chanuwan (7)

Recently uploaded

Recently uploaded (20)

High-Performance Timing Simulation of Embedded Software