SlideShare a Scribd company logo
1 of 6
Download to read offline
17.1



                               High-Performance Timing Simulation
                                     of Embedded Software                                                         ∗




              Jürgen Schnerr2 , Oliver Bringmann1 , Alexander Viehl1 and Wolfgang Rosenstiel1,2
                 1 FZI   Forschungszentrum Informatik                                                         2 Universität
                                                                                                                        Tübingen
                         Haid-und-Neu-Str. 10–14                                                                   Sand 13
                        76131 Karlsruhe, Germany                                                           72076 Tübingen, Germany

                                            {jschnerr, bringmann, viehl, rosenstiel}@fzi.de

ABSTRACT                                                                                  nection and distribution of these components, thereby leading to
This paper presents an approach for cycle-accurate simulation of                          distributed embedded systems. Also, new applications and inno-
embedded software by integration in an abstract SystemC model.                            vations arise more and more from a distribution of functionality as
Compared to existing simulation-based approaches, we present a                            well as from the combination of previously independent functions.
hybrid method that resolves performance issues by combining the                           Therefore, in the future this distribution will play an important part
advantages of simulation-based and analytical approaches. In a first                       in the increase of the product value.
step, cycle-accurate static execution time analysis is applied at each                       The system responsibility of the supplier is also currently in-
basic block of a cross-compiled binary program using static pro-                          creasing. This means that the supplier will not only be responsi-
cessor models. After that, the determined timing information is                           ble for the designed subsystem, but additionally for the integration
back-annotated into SystemC for fast simulation of all effects that                       of the subsystem in the context of the entire system. This inte-
can not be resolved statically. This allows the consideration of data                     gration is becoming more complex: today, requirements of single
dependencies during run-time and the incorporation of branch pre-                         components are validated; in future, the requirement validation of
diction and cache models by efficient source code instrumentation.                         the entire system has to be achieved with regard to the designed
The major benefit of our approach is that the generated code can                           component.
be executed very efficiently on the simulation host with approxi-                             What this means is that changes in the product area will lead to a
mately 90% of the speed of the untimed software without any code                          paradigm change in design. Even in the design stage, the impact of
instrumentation.                                                                          a component on an entire system has to be considered. A compre-
                                                                                          hensive modeling of distributed systems, and an early analysis and
                                                                                          simulation of the system integration, have to be considered.
Categories and Subject Descriptors                                                           Therefore, a methodical design process of distributed embed-
B.8.2 [Performance and Reliability]: Performance Analysis and                             ded systems must be applied. This can be implemented using a
Design Aids; C.4 [Performance of Systems]: Measurement Tech-                              comprehensive modeling of distributed systems and using platform
niques                                                                                    independent development of the application software (UML, Mat-
                                                                                          lab/Simulink, C++).
General Terms                                                                                What is also important is the early inclusion of the intended tar-
                                                                                          get platform in model based system design (UML), the mapping
Design, Measurement, Performance                                                          of function blocks on platform components, and the use of virtual
                                                                                          prototypes for the abstract modeling of the target architecture.
Keywords                                                                                     An early evaluation of the target platform means that the appli-
Software Timing, Simulation Acceleration, Virtual Prototypes                              cation software can be evaluated with consideration to the target
                                                                                          platform. Hence, an optimization of the target platform with con-
                                                                                          sideration given to the application software, performance require-
1.     INTRODUCTION                                                                       ments, power dissipation and reliability can take place.
  In the future, new system functionality will be realized less by                           An early analysis of the system integration also means early ver-
the sum of single components, and more by cooperation, intercon-                          ification based on virtual prototypes and exposure of integration
∗ This work has been partially supported by the BMBF project VI-                          faults using virtual prototypes. After that, a seamless transition to
SION under grant number 01M3078B and by the DFG projects                                  the real prototype can take place.
CaNoC under grant number BR 2321/1-1 and ASOC under grant
number RO 1030/14-1 and 2.                                                                1.1    Performance Analysis of Distributed Em-
                                                                                                 bedded Systems
Permission to make digital or hard copies of all or part of this work for                   The main question of performance analysis of distributed em-
personal or classroom use is granted without fee provided that copies are                 bedded systems is: What is the timing behavior of a system and
not made or distributed for profit or commercial advantage and that copies                 how can it be determined? The central issue is that computation
bear this notice and the full citation on the first page. To copy otherwise, to            has no timing behavior as long as the target platform is unknown
republish, to post on servers or to redistribute to lists, requires prior specific
                                                                                          because the target platform has a major effect on timing.
permission and/or a fee.
DAC 2008, June 8–13, 2008, Anaheim, California, USA                                         But nevertheless, the specification can contain global perfor-
Copyright 2008 ACM 978-1-60558-115-6/08/0006...5.00                                       mance requirements. The fulfillment of global performance re-




                                                                                    290
quirements depends on local timing behavior of system parts. A                   This binary code can run in an instruction set simulator (ISS).
solution for this problem is an early inclusion of the target archi-          An ISS implements an abstract modeling of the instruction execu-
tecture.                                                                      tion. It does not consider bus behavior modeling. Such an ISS can
   Several analytical and simulative approaches for performance               be implemented either as interpreter or as binary code translator.
analysis have previously been proposed. In this paper a hybrid ap-            Binary code translation can be implemented in two different ways:
proach for performance analysis will be presented. This hybrid                either as static or as dynamic compilation, also called just-in-time
approach combines the advantages of analytical and simulative ap-             (JIT) compilation [12]. An ISS is used in several commercial solu-
proaches.                                                                     tions, like the CoWare Processor Designer [3], CoMET from VaST
   Analytical approaches are based on a formal analysis of the                Systems Technology [21] or Synopsys Virtual Platforms [19].
corner cases based on a system model. The approaches can be                      Furthermore, the binary code can be executed using a proces-
divided into two categories: black-box approaches and white-box               sor model that includes the complete processor (functional units,
approaches. Furthermore, both approaches can be categorized de-               pipelines, caches, register, counter, etc.). Such a model can have
pending on the level of system abstraction and with regard to the             several levels of accuracy. For example, it can be a transaction
used model of computation.                                                    level model or a register transfer model.
   Black-box approaches consider functional system components                    In addition to the simulation of the processor, a simulation of
as black boxes and abstract from their internal behavior.                     peripheral components and custom hardware must be done, either
   This can be an abstract task model [15] with task activation and           as a co-simulation with HDL simulators or using SystemC.
event streams on task level representing activation patterns. Us-                An abstract processor model with an integrated RTOS model us-
ing event stream propagation, the determination of a fix-point takes           ing task scheduling was presented in [16]. Additionally, a proces-
place. For this, no modification of the event streams has to be done.          sor model using neural networks for execution cycle estimation was
Examples for black-box approaches are the real-time calculus [20]             presented in [14]. A transaction level approach for the performance
and the system-level composition by event stream propagation as it            evaluation of SoC architectures was presented in [24].
is used in SymTA/S [5].                                                          A hybrid model for the fast simulation that allows switching be-
   White-box approaches include an abstract control flow represen-             tween native code execution and ISS-based simulation was pre-
tation of each process into the system model. Then a global perfor-           sented in [9]. Another approach using a hybrid model was shown
mance and communication analysis considering (data-dependent)                 in [17]. This approach is based on the translation of object code
control structures of all processes can take place. For this analy-           into annotated binary code for the target processor. For the cycle
sis, an extraction of the control flow from application software or            accurate execution of the annotated code on this processor, special
UML models [23] needs to be done. Then, an environment model-                 hardware is needed.
ing using event models or processes can take place. Examples for
white-box approaches are the communication dependency analy-                  2.    PROPOSED HYBRID APPROACH
sis [18] and the control flow based extraction of hierarchical event
streams [1].                                                                     Here we will present a hybrid approach for performance simula-
   Analytical approaches that only rely on best-case and worst-case           tion of embedded software. Hybrid approaches consist of a combi-
create results that are often too pessimistic, such that risk estima-         nation of analytic and simulative approaches with the objective of
tion for real scenarios is difficult to carry out. Different analytic          gaining simulation speed without remarkable loss of accuracy.
approaches try to tackle this issue by considering probabilities of              Static worst case/best case execution time (WCET/BCET) anal-
timing quantities in white-box system analysis.                               ysis abstracts the influence of data dependencies on the software
   A hybrid approach to combining simulation and formal analy-                execution time. Due to this, BCET/WCET analysis delivers very
sis for tightening bounds of system level performance analysis was            good results of entire basic blocks, but it is too pessimistic across
presented in [10]. The objectives are to determine timing charac-             basic block boundaries. Further, the effects of concurrent cache
teristics of not formally specified components by simulation and               usage of different applications on multi-core architectures lead to
to integrate simulation results into a framework for formal perfor-           even wider bounds. An analytic solution for this issue is still un-
mance analysis. In comparison to this approach, we focus on a fast            known. The objective of the presented approach is the reduction of
timing simulation of embedded software. The results determined                pessimism that is contained in WCET/BCET boundaries.
using our approach might be included in system level performance                 Simulative techniques that consider an application with concrete
methodologies with the benefit of high accuracy and time savings               input data and the target architecture can be used to determine the
in the simulation stage.                                                      timing behavior of software on the underlying architecture. The
   Analytic performance risk quantification based on profiled exe-              proposed approach tries to prevent repeated time-consuming inter-
cution-times is presented in [22]. The model is derived from real             pretation and repeated timing determination of all executed binary
implementations. Although it is able to represent the temporal be-            code instructions on the target architecture.
havior of communication, computation and synchronization, data-                  The hybrid approach provided in this paper applies back-anno-
dependent timing effects can not be detected thoroughly.                      tation of WCET/BCET values. Theses values are determined stat-
   Simulative approaches perform a simulation of the entire com-              ically on basic block level in binary code that was generated from
munication infrastructure and the processing elements. If neces-              C source code. Additionally, the timing impact of data-dependent
sary, this simulation includes hardware IP.                                   architectural properties like branch prediction is also considered
   A simulation of a network between communicating C/C++ pro-                 effectively. The tool that implements the proposed methodology
cesses can be done using a network simulator such as OPNET [13],              generates SystemC code. This code can be compiled for any host
Simulink or SystemC [6]. Timing annotation of such a network                  machine to be used afterward for a target platform independent sim-
simulation is possible, but the exact timing behavior of the software         ulation.
is missing. To obtain this timing behavior, it is necessary to do a              Communication calls in the automatically created SystemC mod-
simulation of the software execution on the target processor. For             els are encapsulated in TLM [4] communication primitives. This
this simulation, the binary code for the target platform component            way, a clean and standardized ability to integrate timed embedded
is required.                                                                  software in virtual SystemC prototypes is given.
                                                                                 One major advantage of the presented methodology is in the area




                                                                        291
of multi-core processors with shared caches. Whereas static anal-                                                                                C Source Code
ysis has no knowledge of concurrent cache usage of different ap-
plications and the impact on execution time, the presented method-                                                                                  C Compiler
ology is able to handle these issues. How this is done will be de-
scribed in more detail in Section 2.4.3.                                                                                                     Binary Code         Processor
                                                                                                                                                                 Description
   Another possibility would be a translation of binary code into
annotated SystemC code. However, this approach has some major
disadvantages. One main drawback is, for example, that the same




                                                                                                                   Analysis Source Code




                                                                                                                                                                                 Analysis Binary Code
problems that have to be solved in static compilation (binary trans-                                                                         construction of intermediate
lation) have to be solved here (e.g. addresses of calculated branch                                                                          representation
targets have to be determined). Another disadvantage is that the
automatically generated code is not very easily read by humans.                                                                              building of basic blocks




                                                                                      Back−annotation Tool
2.1    Back-annotation of WCET/BCET values                                                                                                   static cycle calculation
   In this section, we will describe our approach in more detail.
Figure 1 shows an overview of this approach.
   First, the C source code has to be taken and translated using                                                                  find correspondences between
an ordinary C (cross)-compiler into binary code for the embedded                                                                  C source code and binary code
processor (source processor). After that, our back-annotation tool




                                                                                                                                                                                 Back−annotation
reads the object file and a description of the used source proces-
                                                                                                                                  insertion of cycle generation code
sor. This description contains information about the instruction set,
the pipelines, the caches and the branch prediction of the source
processor. Using this description, the object code is decoded and
translated into an intermediate representation consisting of a list of                                                            insertion of dynamic correction code
objects. Each of these objects represents one intermediate instruc-
tion.
   In a next step, the basic blocks of this program are determined
using the list containing the translated program. As a result, a list                                                                     Annotated SystemC Program
of basic blocks is built.
   After that, a static calculation of the number of cycles each basic                Figure 1: Back-annotation of WCET/BCET values
block would have taken on the source processor is made using the
pipeline description. This calculation is described in more detail in
Section 2.3.                                                                   sor. How this number is calculated is described in more detail in
   Subsequently the back-annotation correspondences between the                Section 2.3. In modern processor architectures, the impact of the
C source code and the binary code are identified. Then, the back-               processor architecture on the number of executed cycles can not be
annotation takes place. This is done by code-insertion for the cycle           completely predicted statically. Especially the branch prediction
generation and for the dynamic correction of the cycle generation.             and the caches of a processor have a significant impact on the num-
The structure and functionality of this code are described in Sec-             ber of used cycles. Therefore, the statically determined number of
tion 2.2.                                                                      cycles has to be corrected dynamically. The division of the basic
   Not every impact of the processor architecture on the number                block for the calculation of additional cycles of instruction cache
of cycles can be predicted statically. Therefore, if dynamic, data-            misses, as shown in Figure 2, is explained in Section 2.4.2.
dependent effects (e.g. branch prediction and caches) have to be                  If there is a conditional branch at the end of a basic block, branch
taken into account, additional code needs to be added. Further de-             prediction has to be considered and possible correction cycles have
tails concerning this code are described in Section 2.4.                       to be added. This is described in more detail in Section 2.4.1.
   During the back-annotation, the C program is transformed into a                As shown in Figure 2 the back-annotation tool adds a call to a
cycle-accurate SystemC program that can be compiled on the target              function that performs cycle generation at the end of each basic
processor.                                                                     block. This instruction generates the number of cycles this basic
   One advantage of this approach is a fast execution of the anno-             block would need on the source processor.
tated code as the C source code does not need major changes for the                                          annotation of C code for a basic block
annotation. Also, the generated SystemC code can be easily used
                                                                                                                                                                               Architectural
within a SystemC simulation environment. Difficulty in using this                   C code corresponding to a basic block
                                                                                                                                                                               Model
approach is that finding corresponding parts of the binary code in                c=statically predicted number of cycles ;
the C source code is difficult if the compiler optimizes or changes                  C code corresponding to the cache                                                                        Cache
the structure of the binary code too much. If this happens, recom-                  analysis blocks of the basic block                                                                       Model
pilation techniques [2] have to be used to find the correspondences.
                                                                                 c=c+cycleCalculationICache( tag,iStart,iEnd );                                                Branch Pre−
                                                                                                                                                                               diction Model
2.2    Annotation of the SystemC code                                            c=c+cycleCalculationForConditionalBranch();

  On the left side of Figure 2 there is the annotation of a piece of C           generateCycles(c);
code that corresponds to a basic block. The right side of this figure
shows the cache model and the branch prediction model that are
used during run-time.                                                             Figure 2: General principle for a basic block annotation
  At the beginning of the annotated basic block, the annotation
tool adds a cycle counter c that contains the statically determined              In order to guarantee both – as fast as possible execution of the
number of cycles this basic block would use on the source proces-              code as well as the highest possible accuracy – it is possible to




                                                                         292
choose different accuracy levels of the generated code that parame-             several cache analysis blocks. This has to be done until the tag
terize the annotation tool. The first and fastest one is a purely static         changes or the basic block ends. After that, a function call to the
prediction. The second one additionally includes modeling of the                cache handling model is added. This code uses a cache model to
branch prediction. And, the third one also takes dynamic inclusion              find out possible cache hits or misses.
of instruction caches into account.                                                The cache simulation will be explained in more detail in the next
   The cycle calculation in these different detail levels will be dis-          few paragraphs. This explanation will start with a description of
cussed in more detail in the following text.                                    the cache model.

2.3     Static cycle calculation of a basic block                                   C Program         Binary Code                 Cache Model
   In modern architectures pipeline effects, superscalarity and                                                         v   tag   lru     data
caches have an important impact on the execution time. Therefore,                                      asm_inst 1
                                                                                    C_stmnt1
                                                                                                       asm_inst l
a calculation of the execution time of a basic block by summing                     C_stmnt2
up the execution or latency times of the single instructions of this                                   asm_inst l +1
                                                                                    C_stmnt3           asm_inst 2l
block is too inaccurate.
                                                                                    C_stmnt4           asm_inst 2l +1
   Therefore, the incorporation of a pipeline model per basic block
                                                                                  cycleCalcICache      asm_inst n
becomes necessary [11]. This model helps to predict pipeline ef-
fects and the effects of superscalarity statically. For the genera-                  Figure 3: Correspondence C – assembler – cache line
tion of this model, information about the instruction set and the
pipelines of the used processor are needed. This information is
contained in the processor description that is used by the annotation
tool. With regard to this, the tool can determine which instructions
                                                                                The cache model
of the basic block will be executed in parallel on a super scalar pro-          The cache model, as it can be seen on the right side of Figure 3,
cessor and which combinations of instructions in the basic block                contains data space that is used for the administration of the cache.
will cause pipeline stalls.                                                     In this space, the valid bit, the cache tag and the least recently used
   With the information gained by the basic block modeling, a pre-              (lru) information (containing the replacement strategy) for each
diction is carried out. This prediction determines the number of                cache set during the run-time is saved.
cycles the basic block would have needed on the source processor.                  The number of cache tags and the according amount of valid bits
   The next section will show how this kind of prediction is im-                that are needed depends on the associativity of the cache (e.g. for
proved during run-time, and a cache model is included.                          a two-way set associative cache, two sets of tags and valid bits are
                                                                                needed).
2.4     Dynamic correction of cycle prediction
   As previously described, the actual cycle count a processor needs            Cache analysis blocks
for executing a sequence of instructions can not be predicted cor-              In the middle of Figure 3, the C source code which is corresponding
rectly in all cases. This is the case if, for example, a conditional            to a basic block is divided in several smaller blocks, the so-called
branch at the end of a basic block produces a pipeline flush or if ad-           cache analysis blocks. These blocks are needed for the considera-
ditional delays occur due to cache misses in instruction caches. The            tion of the effects of instruction caches. Each one of these blocks
combination of static analysis and dynamic execution provides a                 contains that part of a basic block that fits into a single cache line.
well-suited solution for that problem, since statically unpredictable              As every machine language instruction in such a cache analysis
effects of branch and cache behavior can be determined during exe-              block has the same tag and the same cache index, the addresses of
cution. This is done by inserting appropriate function calls into the           the instructions can be used to determine how a basic block has to
translated basic blocks. These calls interact with the architectural            be divided into cache analysis blocks. This is because each address
model in order to determine the additional number of cycles caused              consists of tag information and cache index.
by mispredicted branch and cache behavior. At the end of each                      The cache index information (iStart to iEnd in Figure 2) is used
basic block the generation of previously calculated cycles (static              to determine at which cache position the instruction with this ad-
cycles plus correction cycles) takes place (Figure 2).                          dress is cached. The tag information is used to determine which
                                                                                address was cached, as there can be multiple addresses with the
2.4.1     Branch prediction                                                     same cache index. Therefore, a changed cache tag can be easily
   Conditional branches have different cycle times depending on                 determined during the traversal of the binary code with respect to
four different cases resulting from the combination of predicted and            the cache parameters. The block offset information is not needed
mispredicted branches, as well as taken and non-taken branches. A               for the cache simulation, as no real caching of data takes place.
correctly predicted branch needs less cycles for execution than a                  After the tag has been changed or at the end of a basic block, a
mispredicted one. Furthermore, additional cycles can be needed if               function call that handles the simulated cache and the calculation of
a correctly predicted branch is taken. This problem is solved by im-            the additional cycles of cache misses is added to this block. More
plementing a model of the branch prediction and by a comparison                 details about this function are described in the next section.
of the predicted branch behavior with the executed branch behav-
ior. If dynamic branch prediction is used, a model of the underlying            Cycle calculation code
state machine is implemented and its results are compared with the              As previously mentioned, each cache analysis block is character-
executed branch behavior. The cycle count of each possible case is              ized by a combination of tag and cache set index information. At
calculated and added to the cycle count for the entire basic block              the end of each basic block, a call to a function is included. Dur-
before the next basic block is entered.                                         ing run-time, this function should determine whether the different
                                                                                cache analysis blocks which the basic block consists of are in the
2.4.2     Instruction cache                                                     simulated cache or not. This way, cache misses are detected.
  Figure 2 shows that, for the simulation of the instruction cache,                The function is shown in Figure 4. It has the tag and the range of
every basic block of the translated program has to be divided into              cache set indices (iStart to iEnd) as parameters.




                                                                          293
int cycleCalculationICache( tag, iStart, iEnd )                                   was also used to generate annotated SystemC code from the C code
{                                                                                 as described in Section 2.1. As reference, the execution speed and
  for index = iStart to iEnd                                                      the cycle count of the TriCore code has been measured on a TriCore
   if tag is found in index and valid bit is set then                             TC10GP evaluation board and on a TriCore ISS [8].
   { // cache hit                                                                    The examples consist of two filters (fir, ellip) and two programs
      renew lru information                                                       that are part of audio decoding routines (dpcm, subband).
      return 0
                                                                                                                                                     Speed
   }
   else                                                                                                              160
   { // cache miss                                                                                                   150
      use lru information to determine tag to overwrite                                                              140
      write new tag                                                                                                  130
      set valid bit of written tag




                                                                                   Million Instructions Per Second
                                                                                                                     120
      renew lru information
                                                                                                                     110                                           TriCore Eva-
      return additional cycles needed for cache miss
   }
                                                                                                                     100                                           luation Board
  end for                                                                                                             90                                           Annotated
                                                                                                                      80                                           SystemC 1
}
                                                                                                                      70                                           Annotated
          Figure 4: Function for cache cycle correction                                                                                                            SystemC 2
                                                                                                                      60
                                                                                                                                                                   TriCore ISS
   To find out if there is a cache hit or a cache miss, the function                                                   50
checks whether the tag of each cache analysis block can be found                                                      40
in the specified set and whether the valid bit for the found tag is set.                                               30
   If the tag can be found and the valid bit is set, the block is already                                             20
cached (cache hit) and no additional cycles are needed. Only the                                                      10
lru information has to be renewed.
                                                                                                                       0
   In all other cases, the lru information has to be used to determine                                                     dpcm       fir    ellip       subband
which tag has to be overwritten. After that, the new tag has to be                                                                Figure 5: Comparison of speed
written instead of the found old one, and the valid bit for this tag
has to be set. The lru information has to be renewed as well. In a
last step, the additional cycles are returned and added to the cycle                  Figure 5 shows the comparison of the execution speed of the
correction counter.                                                               generated code with the execution speed of the TriCore evaluation
                                                                                  board and the ISS. The execution speed in this figure is represented
2.4.3     Consideration of task switches                                          by Million Instructions of the TriCore Processor per Second. The
   In modern embedded systems, software performance simulation                    Athlon 64 processor running the SystemC code and the ISS had
has to deal with task switching and multiple interrupts. Coopera-                 a clock rate of 2.4 GHz. The TriCore processor of the evaluation
tive task scheduling can already be handled by the previously men-                board ran at 48 MHz.
tioned approach since the presented cache model is able to cope                       Using the annotated SystemC code, two different types of an-
with non-preemptive task switches. Interrupts, cooperated, and                    notation have been used: the first one generates the cycles after the
non-preemptive task scheduling can be handled similarly because                   execution of each basic block, the second one adds cycles to a cycle
task preemption is usually implemented by using software inter-                   counter after each basic block. The cycles are only generated when
rupts. Therefore, the incorporation of interrupts is discussed in the             it is necessary (e.g. when communication with the hardware takes
following.                                                                        place). This is much more efficient and is depicted in Figure 5.
   Software interrupts had to be included to the SystemC model.                       The execution speed of the TriCore processor ranges from 36.8
                                                                                                                      Seite 1

This has been achieved by automatic insertion of dedicated pre-                   to 50.8 million instructions per second, whereas the execution
emption points after cycle calculation. This approach provides the                speed of the annotated SystemC models with immediate cycle gen-
integration of different user-defined task scheduling policies and a               eration range from 3.5 to 5.7 million of simulated TriCore instruc-
task switch generates a software interrupt. Since cycle calculation               tions per second. This means that the execution speed of the Sys-
is completed before a task switch is executed and a global cache                  temC model is only about ten times slower than the speed of a real
and branch prediction model is used, no other changes are neces-                  processor. The execution speed of the annotated SystemC code
sary. A minor deviation of the cycle count at certain processes can               with on-demand cycle generation ranges from 11.2 to 149.9 mil-
occur due to the actual task switch is carried out with a small delay             lion TriCore instructions per second.
caused by the projection of task preemption at binary code level to                   In order to compare the SystemC execution speed with the ex-
C/C++ source code level. But nevertheless, the total cycle count is               ecution speed of a conventional ISS, the same examples were run
still correct. The accuracy can be increased by insertion of cycle                using the TriCore ISS. The result was an execution speed ranging
calculation code after each C/C++ statement.                                      from 1.5 to 2.4 million instructions per second. This means our
   If the additional delay caused by the context switch itself has to             approach delivers a speed increase of up to 91%.
be included, the (binary) code of the context switch routine can be                   A comparison of the number of simulated cycles of the generated
treated like any other code.                                                      SystemC code using branch prediction and cache simulation with
                                                                                  the number of executed cycles of the TriCore evaluation board is
                                                                                  shown in Figure 6. The deviation of the cycle counts of the trans-
3.    EXPERIMENTAL RESULTS                                                        lated programs (with branch prediction and caches included) com-
   In order to test the execution speed and the accuracy of the trans-            pared to the measured cycle count from the evaluation board ranges
lated code, a few examples were compiled using a C compiler into                  between 4% for the program fir to 7% for the program dpcm. This
object code for the Infineon TriCore processor [7]. This object code               is in the same range as it is using conventional ISS.




                                                                            294
Accuracy

          37500                                                                [3] CoWare Inc. CoWare Processor Designer.
          35000                                                                    http://www.coware.com/PDF/products/ProcessorDesigner.pdf.
                                                                               [4] A. Donlin. Transaction Level Modeling: Flows and Use Models. In
          32500
                                                                                   Proceedings of the 2nd IEEE/ACM/IFIP International Conference on
          30000                                                                    Hardware/Software Codesign and System Synthesis (CODES+ISSS),
          27500                                                                    pages 75–80, 2004.
                                                                               [5] R. Henia, A. Hamann, M. Jersak, R. Racu, K. Richter, and R. Ernst.
          25000
                                                                                   System Level Performance Analysis – the SymTA/S Approach. IEE
          22500                                      TriCore Eva-                  Proceedings Computers and Digital Techniques, 152(2):148–166,
                                                     luation Board                 March 2005.
          20000
 Cycles




                                                     Annotated                 [6] IEEE Computer Society. IEEE Standard SystemC R Language
          17500                                      SystemC 2                     Reference Manual, Mar. 2006.
          15000                                      TriCore ISS               [7] Infineon Technologies AG. TC10GP Unified 32-bit
          12500                                                                    Microcontroller-DSP – User’s Manual, 2000.
                                                                               [8] Infineon Technologies Corp. TriCoreTM 32-bit Unified Processor
          10000
                                                                                   Core – Volume 1: v1.3 Core Architecture, 2005.
           7500                                                                [9] S. Kraemer, L. Gao, J. Weinstock, R. Leupers, G. Ascheid, and
           5000                                                                    H. Meyr. HySim: A Fast Simulation Framework for Embedded
                                                                                   Software Development. In Proceedings of the 5th IEEE/ACM
           2500
                                                                                   International Conference on Hardware/Software Codesign and
              0                                                                    System Synthesis (CODES+ISSS), pages 75–80, 2007.
                  dpcm     fir    ellip    subband
                                                                              [10] S. Künzli, F. Poletti, L. Benini, and L. Thiele. Combining Simulation
                                                                                   and Formal Methods for System-Level Performance Analysis. In
                  Figure 6: Comparison of cycle-accuracy                           Proceedings of the Design, Automation and Test in Europe (DATE)
                                                                                   Conference, pages 236–241, 2006.
                                                                              [11] S.-S. Lim, Y. H. Bae, G. T. Jang, B.-D. Rhee, S. L. Min, C. Y. Park,
4.         OUTLOOK                                                                 H. Shin, K. Park, S.-M. Moon, and C. S. Kim. An Accurate Worst
                                                                                   Case Timing Analysis for RISC Processors. IEEE Transactions on
   As clock frequencies cannot be increased as linearly as the num-                Software Engineering, 21(7):593–604, 1995.
ber of cores, modern processor architectures can consist of more              [12] A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, and
than one core for fulfillment of computational demands. The dif-                    A. Hoffmann. A Universal Technique for Fast and Flexible
ferent cores can share architectural resources such as data caches to              Instruction-Set Architecture Simulation. In Proceedings of the 39th
speed up access to common data. Therefore, access conflicts and                     Design Automation Conference (DAC), pages 22–27, 2002.
coherency protocols have a potential impact on the task run-times             [13] OPNET Technologies, Inc. http://www.opnet.com.
running on the cores.                                                         [14] M. Oyamada, F. Wagner, M. Bonaciu, W. Cesário, and A. Jerraya.
                                                                                   Software Performance Estimation in MPSoC Design. In Proceedings
   The incorporation of multiple cores is directly supported by our                of the 12th Asia and South Pacific Design Automation Conference
SystemC approach. Parallel tasks can easily be assigned to differ-                 (ASP-DAC), pages 38–43, 2007.
ent cores, and the code instrumentation by cycle information can              [15] K. Richter, M. Jersak, and R. Ernst. A Formal Approach to MpSoC
be carried out independently. However, shared caches can have a                    Performance Verification. Computer, 36(4):60–67, 2003.
significant impact on the numberSeite executed cycles. This can be
                                    of 1                                      [16] G. Schirner, A. Gerstlauer, and R. Dömer. Abstract, Multifaceted
solved by inclusion of a shared cache model that executes global                   Modeling of Embedded Processors for System Level Design. In
cache coherence protocols like the MESI protocol. A clock calcu-                   Proceedings of the 12th Asia and South Pacific Design Automation
                                                                                   Conference (ASP-DAC), pages 384–389, 2007.
lation after each C/C++ statement is strongly recommended here to
                                                                              [17] J. Schnerr, O. Bringmann, and W. Rosenstiel. Cycle Accurate Binary
increase the accuracy.                                                             Translation for Simulation Acceleration in Rapid Prototyping of
                                                                                   SoCs. In Proceedings of the Design, Automation and Test in Europe
                                                                                   (DATE) Conference, pages 792–797, 2005.
5.         CONCLUSIONS                                                        [18] A. Siebenborn, A. Viehl, O. Bringmann, and W. Rosenstiel.
   This paper presented a hybrid approach for high-performance                     Control-Flow Aware Communication and Conflict Analysis of
timing simulation of embedded software. The shown approach was                     Parallel Processes. In Proceedings of the 12th Asia and South Pacific
implemented in an automated design flow. The methodology is                         Design Automation Conference (ASP-DAC), pages 32–37, 2007.
based on the generation of SystemC code out of the original C code            [19] Synopsys, Inc. Synopsys Virtual Platforms.
and back-annotation of statically determined cycle information into                http://www.synopsys.com/products/designware/virtual_platforms.html.
the generated code. Additionally, the impact of data dependencies             [20] L. Thiele, S. Chakraborty, and M. Naedele. Real-time Calculus for
                                                                                   Scheduling Hard Real-Time Systems. In IEEE International
on the software run-time is analytically handled during simulation.                Symposium on Circuits and Systems (ISCAS), volume 4, pages
Presented were promising experimental results from the application                 101–104, 2000.
of the implemented design flow. These results show a high execu-               [21] VaST Systems Technology. CoMET R .
tion performance of the timed embedded software model as well as                   http://www.vastsystems.com/docs/CoMET_mar2007.pdf.
good accuracy. Furthermore, the created SystemC models, repre-                [22] A. Viehl, M. Schwarz, O. Bringmann, and W. Rosenstiel.
senting the timed embedded software could be easily integrated in                  Probabilistic Performance Risk Analysis at System-Level. In
SystemC virtual prototypes due to generated TLM interfaces.                        Proceedings of the 5th IEEE/ACM International Conference on
                                                                                   Hardware/Software Codesign and System Synthesis (CODES+ISSS),
                                                                                   pages 185–190, 2007.
6.         REFERENCES                                                         [23] A. Viehl, T. Schönwald, O. Bringmann, and W. Rosenstiel. Formal
                                                                                   Performance Analysis and Simulation of UML/SysML Models for
 [1] K. Albers, F. Bodmann, and F. Slomka. Hierarchical Event Streams              ESL Design. In Proceedings of the Design, Automation and Test in
     and Event Dependency Graphs: A New Computational Model for                    Europe (DATE) Conference, pages 242–247, 2006.
     Embedded Real-Time Systems. In Proceedings of the 18th Euromicro         [24] T. Wild, A. Herkersdorf, and G.-Y. Lee. TAPES – Trace-Based
     Conference on Real-Time Systems (ECRTS), pages 97–106, 2006.                  Architecture Performance Evaluation with SystemC. Design
 [2] C. Cifuentes. Reverse Compilation Techniques. PhD thesis,                     Automation for Embedded Systems, 10(2–3):157–179, Sept. 2005.
     Queensland University of Technology, 19. Nov. 1994.




                                                                        295

More Related Content

What's hot

Design concept -Software Engineering
Design concept -Software EngineeringDesign concept -Software Engineering
Design concept -Software EngineeringVarsha Ajith
 
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTUREDESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTUREVLSICS Design
 
Systematic Model based Testing with Coverage Analysis
Systematic Model based Testing with Coverage AnalysisSystematic Model based Testing with Coverage Analysis
Systematic Model based Testing with Coverage AnalysisIDES Editor
 
COCOMO methods for software size estimation
COCOMO methods for software size estimationCOCOMO methods for software size estimation
COCOMO methods for software size estimationPramod Parajuli
 
Parameter Estimation of Software Reliability Growth Models Using Simulated An...
Parameter Estimation of Software Reliability Growth Models Using Simulated An...Parameter Estimation of Software Reliability Growth Models Using Simulated An...
Parameter Estimation of Software Reliability Growth Models Using Simulated An...Editor IJCATR
 
PhD Thesis defense: Lightweight and Static verification of UML Executable Models
PhD Thesis defense: Lightweight and Static verification of UML Executable ModelsPhD Thesis defense: Lightweight and Static verification of UML Executable Models
PhD Thesis defense: Lightweight and Static verification of UML Executable ModelsElena Planas
 
SE2018_Lec 15_ Software Design
SE2018_Lec 15_ Software DesignSE2018_Lec 15_ Software Design
SE2018_Lec 15_ Software DesignAmr E. Mohamed
 
Software Engineering Sample Question paper for 2012
Software Engineering Sample Question paper for 2012Software Engineering Sample Question paper for 2012
Software Engineering Sample Question paper for 2012Neelamani Samal
 
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...vtunotesbysree
 
Software Engineering Important Short Question for Exams
Software Engineering Important Short Question for ExamsSoftware Engineering Important Short Question for Exams
Software Engineering Important Short Question for ExamsMuhammadTalha436
 
Se ii unit2-software_design_principles
Se ii unit2-software_design_principlesSe ii unit2-software_design_principles
Se ii unit2-software_design_principlesAhmad sohail Kakar
 
Integrating profiling into mde compilers
Integrating profiling into mde compilersIntegrating profiling into mde compilers
Integrating profiling into mde compilersijseajournal
 
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISONSTATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISONijseajournal
 
THE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENT
THE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENTTHE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENT
THE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENTijseajournal
 

What's hot (18)

Design concept -Software Engineering
Design concept -Software EngineeringDesign concept -Software Engineering
Design concept -Software Engineering
 
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTUREDESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
DESIGN APPROACH FOR FAULT TOLERANCE IN FPGA ARCHITECTURE
 
Systematic Model based Testing with Coverage Analysis
Systematic Model based Testing with Coverage AnalysisSystematic Model based Testing with Coverage Analysis
Systematic Model based Testing with Coverage Analysis
 
COCOMO methods for software size estimation
COCOMO methods for software size estimationCOCOMO methods for software size estimation
COCOMO methods for software size estimation
 
Parameter Estimation of Software Reliability Growth Models Using Simulated An...
Parameter Estimation of Software Reliability Growth Models Using Simulated An...Parameter Estimation of Software Reliability Growth Models Using Simulated An...
Parameter Estimation of Software Reliability Growth Models Using Simulated An...
 
PhD Thesis defense: Lightweight and Static verification of UML Executable Models
PhD Thesis defense: Lightweight and Static verification of UML Executable ModelsPhD Thesis defense: Lightweight and Static verification of UML Executable Models
PhD Thesis defense: Lightweight and Static verification of UML Executable Models
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Tecnomatix 11 para simulação e validação
Tecnomatix 11 para simulação e validaçãoTecnomatix 11 para simulação e validação
Tecnomatix 11 para simulação e validação
 
SE2018_Lec 15_ Software Design
SE2018_Lec 15_ Software DesignSE2018_Lec 15_ Software Design
SE2018_Lec 15_ Software Design
 
Software Engineering Sample Question paper for 2012
Software Engineering Sample Question paper for 2012Software Engineering Sample Question paper for 2012
Software Engineering Sample Question paper for 2012
 
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
VTU 5TH SEM CSE SOFTWARE ENGINEERING SOLVED PAPERS - JUN13 DEC13 JUN14 DEC14 ...
 
Software Engineering Important Short Question for Exams
Software Engineering Important Short Question for ExamsSoftware Engineering Important Short Question for Exams
Software Engineering Important Short Question for Exams
 
Adaptation-Engine traning
Adaptation-Engine traningAdaptation-Engine traning
Adaptation-Engine traning
 
Se ii unit2-software_design_principles
Se ii unit2-software_design_principlesSe ii unit2-software_design_principles
Se ii unit2-software_design_principles
 
Integrating profiling into mde compilers
Integrating profiling into mde compilersIntegrating profiling into mde compilers
Integrating profiling into mde compilers
 
4213ijsea06
4213ijsea064213ijsea06
4213ijsea06
 
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISONSTATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISON
 
THE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENT
THE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENTTHE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENT
THE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENT
 

Viewers also liked

FOSS STHLM Android Cloud to Device Messaging
FOSS STHLM Android Cloud to Device MessagingFOSS STHLM Android Cloud to Device Messaging
FOSS STHLM Android Cloud to Device MessagingJohan Nilsson
 
Android Cloud to Device Messaging Framework at GTUG Stockholm
Android Cloud to Device Messaging Framework at GTUG StockholmAndroid Cloud to Device Messaging Framework at GTUG Stockholm
Android Cloud to Device Messaging Framework at GTUG StockholmJohan Nilsson
 
High performance operating system controlled memory compression
High performance operating system controlled memory compressionHigh performance operating system controlled memory compression
High performance operating system controlled memory compressionMr. Chanuwan
 
Object and method exploration for embedded systems
Object and method exploration for embedded systemsObject and method exploration for embedded systems
Object and method exploration for embedded systemsMr. Chanuwan
 
Analyzing memory usage and leaks
Analyzing memory usage and leaksAnalyzing memory usage and leaks
Analyzing memory usage and leaksRonnBlack
 
Java lejos-multithreading
Java lejos-multithreadingJava lejos-multithreading
Java lejos-multithreadingMr. Chanuwan
 

Viewers also liked (6)

FOSS STHLM Android Cloud to Device Messaging
FOSS STHLM Android Cloud to Device MessagingFOSS STHLM Android Cloud to Device Messaging
FOSS STHLM Android Cloud to Device Messaging
 
Android Cloud to Device Messaging Framework at GTUG Stockholm
Android Cloud to Device Messaging Framework at GTUG StockholmAndroid Cloud to Device Messaging Framework at GTUG Stockholm
Android Cloud to Device Messaging Framework at GTUG Stockholm
 
High performance operating system controlled memory compression
High performance operating system controlled memory compressionHigh performance operating system controlled memory compression
High performance operating system controlled memory compression
 
Object and method exploration for embedded systems
Object and method exploration for embedded systemsObject and method exploration for embedded systems
Object and method exploration for embedded systems
 
Analyzing memory usage and leaks
Analyzing memory usage and leaksAnalyzing memory usage and leaks
Analyzing memory usage and leaks
 
Java lejos-multithreading
Java lejos-multithreadingJava lejos-multithreading
Java lejos-multithreading
 

Similar to High-Performance Timing Simulation of Embedded Software

Software performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system designSoftware performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system designMr. Chanuwan
 
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02NNfamily
 
Server Emulator and Virtualizer for Next-Generation Rack Servers
Server Emulator and Virtualizer for Next-Generation Rack ServersServer Emulator and Virtualizer for Next-Generation Rack Servers
Server Emulator and Virtualizer for Next-Generation Rack ServersIRJET Journal
 
SSM White Paper NOV-2010
SSM White Paper NOV-2010SSM White Paper NOV-2010
SSM White Paper NOV-2010ChipStart LLC
 
Performancetestingbasedontimecomplexityanalysisforembeddedsoftware 1008150404...
Performancetestingbasedontimecomplexityanalysisforembeddedsoftware 1008150404...Performancetestingbasedontimecomplexityanalysisforembeddedsoftware 1008150404...
Performancetestingbasedontimecomplexityanalysisforembeddedsoftware 1008150404...NNfamily
 
The real time publisher subscriber inter-process communication model for dist...
The real time publisher subscriber inter-process communication model for dist...The real time publisher subscriber inter-process communication model for dist...
The real time publisher subscriber inter-process communication model for dist...yancha1973
 
LabVIEW - Teaching Aid for Process Control
LabVIEW - Teaching Aid for Process ControlLabVIEW - Teaching Aid for Process Control
LabVIEW - Teaching Aid for Process ControlIDES Editor
 
Predicting Machine Learning Pipeline Runtimes in the Context of Automated Mac...
Predicting Machine Learning Pipeline Runtimes in the Context of Automated Mac...Predicting Machine Learning Pipeline Runtimes in the Context of Automated Mac...
Predicting Machine Learning Pipeline Runtimes in the Context of Automated Mac...IRJET Journal
 
Enterprise performance engineering solutions
Enterprise performance engineering solutionsEnterprise performance engineering solutions
Enterprise performance engineering solutionsInfosys
 
An Algorithm Based Simulation Modeling For Control of Production Systems
An Algorithm Based Simulation Modeling For Control of Production SystemsAn Algorithm Based Simulation Modeling For Control of Production Systems
An Algorithm Based Simulation Modeling For Control of Production SystemsIJMER
 
IRJET- Sketch-Verse: Sketch Image Inversion using DCNN
IRJET- Sketch-Verse: Sketch Image Inversion using DCNNIRJET- Sketch-Verse: Sketch Image Inversion using DCNN
IRJET- Sketch-Verse: Sketch Image Inversion using DCNNIRJET Journal
 
Giddings
GiddingsGiddings
Giddingsanesah
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...IDES Editor
 
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...IJSEA
 
IRJET - Virtual Mechanisms
IRJET - Virtual MechanismsIRJET - Virtual Mechanisms
IRJET - Virtual MechanismsIRJET Journal
 
Software Process Models
 Software Process Models  Software Process Models
Software Process Models MohsinAli773
 
ScriptRock Robotics Testing
ScriptRock Robotics TestingScriptRock Robotics Testing
ScriptRock Robotics TestingCloudCheckr
 

Similar to High-Performance Timing Simulation of Embedded Software (20)

Software performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system designSoftware performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system design
 
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
 
Server Emulator and Virtualizer for Next-Generation Rack Servers
Server Emulator and Virtualizer for Next-Generation Rack ServersServer Emulator and Virtualizer for Next-Generation Rack Servers
Server Emulator and Virtualizer for Next-Generation Rack Servers
 
SSM White Paper NOV-2010
SSM White Paper NOV-2010SSM White Paper NOV-2010
SSM White Paper NOV-2010
 
Performancetestingbasedontimecomplexityanalysisforembeddedsoftware 1008150404...
Performancetestingbasedontimecomplexityanalysisforembeddedsoftware 1008150404...Performancetestingbasedontimecomplexityanalysisforembeddedsoftware 1008150404...
Performancetestingbasedontimecomplexityanalysisforembeddedsoftware 1008150404...
 
The real time publisher subscriber inter-process communication model for dist...
The real time publisher subscriber inter-process communication model for dist...The real time publisher subscriber inter-process communication model for dist...
The real time publisher subscriber inter-process communication model for dist...
 
LabVIEW - Teaching Aid for Process Control
LabVIEW - Teaching Aid for Process ControlLabVIEW - Teaching Aid for Process Control
LabVIEW - Teaching Aid for Process Control
 
Predicting Machine Learning Pipeline Runtimes in the Context of Automated Mac...
Predicting Machine Learning Pipeline Runtimes in the Context of Automated Mac...Predicting Machine Learning Pipeline Runtimes in the Context of Automated Mac...
Predicting Machine Learning Pipeline Runtimes in the Context of Automated Mac...
 
Enterprise performance engineering solutions
Enterprise performance engineering solutionsEnterprise performance engineering solutions
Enterprise performance engineering solutions
 
An Algorithm Based Simulation Modeling For Control of Production Systems
An Algorithm Based Simulation Modeling For Control of Production SystemsAn Algorithm Based Simulation Modeling For Control of Production Systems
An Algorithm Based Simulation Modeling For Control of Production Systems
 
IRJET- Sketch-Verse: Sketch Image Inversion using DCNN
IRJET- Sketch-Verse: Sketch Image Inversion using DCNNIRJET- Sketch-Verse: Sketch Image Inversion using DCNN
IRJET- Sketch-Verse: Sketch Image Inversion using DCNN
 
Atva05
Atva05Atva05
Atva05
 
Giddings
GiddingsGiddings
Giddings
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
 
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
 
IRJET - Virtual Mechanisms
IRJET - Virtual MechanismsIRJET - Virtual Mechanisms
IRJET - Virtual Mechanisms
 
Tapp11 presentation
Tapp11 presentationTapp11 presentation
Tapp11 presentation
 
Software Process Models
 Software Process Models  Software Process Models
Software Process Models
 
p850-ries
p850-riesp850-ries
p850-ries
 
ScriptRock Robotics Testing
ScriptRock Robotics TestingScriptRock Robotics Testing
ScriptRock Robotics Testing
 

More from Mr. Chanuwan

High level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesHigh level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesMr. Chanuwan
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareMr. Chanuwan
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designMr. Chanuwan
 
Software Architectural low energy
Software Architectural low energySoftware Architectural low energy
Software Architectural low energyMr. Chanuwan
 
A system for performance evaluation of embedded software
A system for performance evaluation of embedded softwareA system for performance evaluation of embedded software
A system for performance evaluation of embedded softwareMr. Chanuwan
 
Embedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded softwareEmbedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded softwareMr. Chanuwan
 
Model-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A SurveyModel-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A SurveyMr. Chanuwan
 

More from Mr. Chanuwan (7)

High level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesHigh level programming of embedded hard real-time devices
High level programming of embedded hard real-time devices
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded software
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system design
 
Software Architectural low energy
Software Architectural low energySoftware Architectural low energy
Software Architectural low energy
 
A system for performance evaluation of embedded software
A system for performance evaluation of embedded softwareA system for performance evaluation of embedded software
A system for performance evaluation of embedded software
 
Embedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded softwareEmbedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded software
 
Model-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A SurveyModel-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A Survey
 

Recently uploaded

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

High-Performance Timing Simulation of Embedded Software

  • 1. 17.1 High-Performance Timing Simulation of Embedded Software ∗ Jürgen Schnerr2 , Oliver Bringmann1 , Alexander Viehl1 and Wolfgang Rosenstiel1,2 1 FZI Forschungszentrum Informatik 2 Universität Tübingen Haid-und-Neu-Str. 10–14 Sand 13 76131 Karlsruhe, Germany 72076 Tübingen, Germany {jschnerr, bringmann, viehl, rosenstiel}@fzi.de ABSTRACT nection and distribution of these components, thereby leading to This paper presents an approach for cycle-accurate simulation of distributed embedded systems. Also, new applications and inno- embedded software by integration in an abstract SystemC model. vations arise more and more from a distribution of functionality as Compared to existing simulation-based approaches, we present a well as from the combination of previously independent functions. hybrid method that resolves performance issues by combining the Therefore, in the future this distribution will play an important part advantages of simulation-based and analytical approaches. In a first in the increase of the product value. step, cycle-accurate static execution time analysis is applied at each The system responsibility of the supplier is also currently in- basic block of a cross-compiled binary program using static pro- creasing. This means that the supplier will not only be responsi- cessor models. After that, the determined timing information is ble for the designed subsystem, but additionally for the integration back-annotated into SystemC for fast simulation of all effects that of the subsystem in the context of the entire system. This inte- can not be resolved statically. This allows the consideration of data gration is becoming more complex: today, requirements of single dependencies during run-time and the incorporation of branch pre- components are validated; in future, the requirement validation of diction and cache models by efficient source code instrumentation. the entire system has to be achieved with regard to the designed The major benefit of our approach is that the generated code can component. be executed very efficiently on the simulation host with approxi- What this means is that changes in the product area will lead to a mately 90% of the speed of the untimed software without any code paradigm change in design. Even in the design stage, the impact of instrumentation. a component on an entire system has to be considered. A compre- hensive modeling of distributed systems, and an early analysis and simulation of the system integration, have to be considered. Categories and Subject Descriptors Therefore, a methodical design process of distributed embed- B.8.2 [Performance and Reliability]: Performance Analysis and ded systems must be applied. This can be implemented using a Design Aids; C.4 [Performance of Systems]: Measurement Tech- comprehensive modeling of distributed systems and using platform niques independent development of the application software (UML, Mat- lab/Simulink, C++). General Terms What is also important is the early inclusion of the intended tar- get platform in model based system design (UML), the mapping Design, Measurement, Performance of function blocks on platform components, and the use of virtual prototypes for the abstract modeling of the target architecture. Keywords An early evaluation of the target platform means that the appli- Software Timing, Simulation Acceleration, Virtual Prototypes cation software can be evaluated with consideration to the target platform. Hence, an optimization of the target platform with con- sideration given to the application software, performance require- 1. INTRODUCTION ments, power dissipation and reliability can take place. In the future, new system functionality will be realized less by An early analysis of the system integration also means early ver- the sum of single components, and more by cooperation, intercon- ification based on virtual prototypes and exposure of integration ∗ This work has been partially supported by the BMBF project VI- faults using virtual prototypes. After that, a seamless transition to SION under grant number 01M3078B and by the DFG projects the real prototype can take place. CaNoC under grant number BR 2321/1-1 and ASOC under grant number RO 1030/14-1 and 2. 1.1 Performance Analysis of Distributed Em- bedded Systems Permission to make digital or hard copies of all or part of this work for The main question of performance analysis of distributed em- personal or classroom use is granted without fee provided that copies are bedded systems is: What is the timing behavior of a system and not made or distributed for profit or commercial advantage and that copies how can it be determined? The central issue is that computation bear this notice and the full citation on the first page. To copy otherwise, to has no timing behavior as long as the target platform is unknown republish, to post on servers or to redistribute to lists, requires prior specific because the target platform has a major effect on timing. permission and/or a fee. DAC 2008, June 8–13, 2008, Anaheim, California, USA But nevertheless, the specification can contain global perfor- Copyright 2008 ACM 978-1-60558-115-6/08/0006...5.00 mance requirements. The fulfillment of global performance re- 290
  • 2. quirements depends on local timing behavior of system parts. A This binary code can run in an instruction set simulator (ISS). solution for this problem is an early inclusion of the target archi- An ISS implements an abstract modeling of the instruction execu- tecture. tion. It does not consider bus behavior modeling. Such an ISS can Several analytical and simulative approaches for performance be implemented either as interpreter or as binary code translator. analysis have previously been proposed. In this paper a hybrid ap- Binary code translation can be implemented in two different ways: proach for performance analysis will be presented. This hybrid either as static or as dynamic compilation, also called just-in-time approach combines the advantages of analytical and simulative ap- (JIT) compilation [12]. An ISS is used in several commercial solu- proaches. tions, like the CoWare Processor Designer [3], CoMET from VaST Analytical approaches are based on a formal analysis of the Systems Technology [21] or Synopsys Virtual Platforms [19]. corner cases based on a system model. The approaches can be Furthermore, the binary code can be executed using a proces- divided into two categories: black-box approaches and white-box sor model that includes the complete processor (functional units, approaches. Furthermore, both approaches can be categorized de- pipelines, caches, register, counter, etc.). Such a model can have pending on the level of system abstraction and with regard to the several levels of accuracy. For example, it can be a transaction used model of computation. level model or a register transfer model. Black-box approaches consider functional system components In addition to the simulation of the processor, a simulation of as black boxes and abstract from their internal behavior. peripheral components and custom hardware must be done, either This can be an abstract task model [15] with task activation and as a co-simulation with HDL simulators or using SystemC. event streams on task level representing activation patterns. Us- An abstract processor model with an integrated RTOS model us- ing event stream propagation, the determination of a fix-point takes ing task scheduling was presented in [16]. Additionally, a proces- place. For this, no modification of the event streams has to be done. sor model using neural networks for execution cycle estimation was Examples for black-box approaches are the real-time calculus [20] presented in [14]. A transaction level approach for the performance and the system-level composition by event stream propagation as it evaluation of SoC architectures was presented in [24]. is used in SymTA/S [5]. A hybrid model for the fast simulation that allows switching be- White-box approaches include an abstract control flow represen- tween native code execution and ISS-based simulation was pre- tation of each process into the system model. Then a global perfor- sented in [9]. Another approach using a hybrid model was shown mance and communication analysis considering (data-dependent) in [17]. This approach is based on the translation of object code control structures of all processes can take place. For this analy- into annotated binary code for the target processor. For the cycle sis, an extraction of the control flow from application software or accurate execution of the annotated code on this processor, special UML models [23] needs to be done. Then, an environment model- hardware is needed. ing using event models or processes can take place. Examples for white-box approaches are the communication dependency analy- 2. PROPOSED HYBRID APPROACH sis [18] and the control flow based extraction of hierarchical event streams [1]. Here we will present a hybrid approach for performance simula- Analytical approaches that only rely on best-case and worst-case tion of embedded software. Hybrid approaches consist of a combi- create results that are often too pessimistic, such that risk estima- nation of analytic and simulative approaches with the objective of tion for real scenarios is difficult to carry out. Different analytic gaining simulation speed without remarkable loss of accuracy. approaches try to tackle this issue by considering probabilities of Static worst case/best case execution time (WCET/BCET) anal- timing quantities in white-box system analysis. ysis abstracts the influence of data dependencies on the software A hybrid approach to combining simulation and formal analy- execution time. Due to this, BCET/WCET analysis delivers very sis for tightening bounds of system level performance analysis was good results of entire basic blocks, but it is too pessimistic across presented in [10]. The objectives are to determine timing charac- basic block boundaries. Further, the effects of concurrent cache teristics of not formally specified components by simulation and usage of different applications on multi-core architectures lead to to integrate simulation results into a framework for formal perfor- even wider bounds. An analytic solution for this issue is still un- mance analysis. In comparison to this approach, we focus on a fast known. The objective of the presented approach is the reduction of timing simulation of embedded software. The results determined pessimism that is contained in WCET/BCET boundaries. using our approach might be included in system level performance Simulative techniques that consider an application with concrete methodologies with the benefit of high accuracy and time savings input data and the target architecture can be used to determine the in the simulation stage. timing behavior of software on the underlying architecture. The Analytic performance risk quantification based on profiled exe- proposed approach tries to prevent repeated time-consuming inter- cution-times is presented in [22]. The model is derived from real pretation and repeated timing determination of all executed binary implementations. Although it is able to represent the temporal be- code instructions on the target architecture. havior of communication, computation and synchronization, data- The hybrid approach provided in this paper applies back-anno- dependent timing effects can not be detected thoroughly. tation of WCET/BCET values. Theses values are determined stat- Simulative approaches perform a simulation of the entire com- ically on basic block level in binary code that was generated from munication infrastructure and the processing elements. If neces- C source code. Additionally, the timing impact of data-dependent sary, this simulation includes hardware IP. architectural properties like branch prediction is also considered A simulation of a network between communicating C/C++ pro- effectively. The tool that implements the proposed methodology cesses can be done using a network simulator such as OPNET [13], generates SystemC code. This code can be compiled for any host Simulink or SystemC [6]. Timing annotation of such a network machine to be used afterward for a target platform independent sim- simulation is possible, but the exact timing behavior of the software ulation. is missing. To obtain this timing behavior, it is necessary to do a Communication calls in the automatically created SystemC mod- simulation of the software execution on the target processor. For els are encapsulated in TLM [4] communication primitives. This this simulation, the binary code for the target platform component way, a clean and standardized ability to integrate timed embedded is required. software in virtual SystemC prototypes is given. One major advantage of the presented methodology is in the area 291
  • 3. of multi-core processors with shared caches. Whereas static anal- C Source Code ysis has no knowledge of concurrent cache usage of different ap- plications and the impact on execution time, the presented method- C Compiler ology is able to handle these issues. How this is done will be de- scribed in more detail in Section 2.4.3. Binary Code Processor Description Another possibility would be a translation of binary code into annotated SystemC code. However, this approach has some major disadvantages. One main drawback is, for example, that the same Analysis Source Code Analysis Binary Code problems that have to be solved in static compilation (binary trans- construction of intermediate lation) have to be solved here (e.g. addresses of calculated branch representation targets have to be determined). Another disadvantage is that the automatically generated code is not very easily read by humans. building of basic blocks Back−annotation Tool 2.1 Back-annotation of WCET/BCET values static cycle calculation In this section, we will describe our approach in more detail. Figure 1 shows an overview of this approach. First, the C source code has to be taken and translated using find correspondences between an ordinary C (cross)-compiler into binary code for the embedded C source code and binary code processor (source processor). After that, our back-annotation tool Back−annotation reads the object file and a description of the used source proces- insertion of cycle generation code sor. This description contains information about the instruction set, the pipelines, the caches and the branch prediction of the source processor. Using this description, the object code is decoded and translated into an intermediate representation consisting of a list of insertion of dynamic correction code objects. Each of these objects represents one intermediate instruc- tion. In a next step, the basic blocks of this program are determined using the list containing the translated program. As a result, a list Annotated SystemC Program of basic blocks is built. After that, a static calculation of the number of cycles each basic Figure 1: Back-annotation of WCET/BCET values block would have taken on the source processor is made using the pipeline description. This calculation is described in more detail in Section 2.3. sor. How this number is calculated is described in more detail in Subsequently the back-annotation correspondences between the Section 2.3. In modern processor architectures, the impact of the C source code and the binary code are identified. Then, the back- processor architecture on the number of executed cycles can not be annotation takes place. This is done by code-insertion for the cycle completely predicted statically. Especially the branch prediction generation and for the dynamic correction of the cycle generation. and the caches of a processor have a significant impact on the num- The structure and functionality of this code are described in Sec- ber of used cycles. Therefore, the statically determined number of tion 2.2. cycles has to be corrected dynamically. The division of the basic Not every impact of the processor architecture on the number block for the calculation of additional cycles of instruction cache of cycles can be predicted statically. Therefore, if dynamic, data- misses, as shown in Figure 2, is explained in Section 2.4.2. dependent effects (e.g. branch prediction and caches) have to be If there is a conditional branch at the end of a basic block, branch taken into account, additional code needs to be added. Further de- prediction has to be considered and possible correction cycles have tails concerning this code are described in Section 2.4. to be added. This is described in more detail in Section 2.4.1. During the back-annotation, the C program is transformed into a As shown in Figure 2 the back-annotation tool adds a call to a cycle-accurate SystemC program that can be compiled on the target function that performs cycle generation at the end of each basic processor. block. This instruction generates the number of cycles this basic One advantage of this approach is a fast execution of the anno- block would need on the source processor. tated code as the C source code does not need major changes for the annotation of C code for a basic block annotation. Also, the generated SystemC code can be easily used Architectural within a SystemC simulation environment. Difficulty in using this C code corresponding to a basic block Model approach is that finding corresponding parts of the binary code in c=statically predicted number of cycles ; the C source code is difficult if the compiler optimizes or changes C code corresponding to the cache Cache the structure of the binary code too much. If this happens, recom- analysis blocks of the basic block Model pilation techniques [2] have to be used to find the correspondences. c=c+cycleCalculationICache( tag,iStart,iEnd ); Branch Pre− diction Model 2.2 Annotation of the SystemC code c=c+cycleCalculationForConditionalBranch(); On the left side of Figure 2 there is the annotation of a piece of C generateCycles(c); code that corresponds to a basic block. The right side of this figure shows the cache model and the branch prediction model that are used during run-time. Figure 2: General principle for a basic block annotation At the beginning of the annotated basic block, the annotation tool adds a cycle counter c that contains the statically determined In order to guarantee both – as fast as possible execution of the number of cycles this basic block would use on the source proces- code as well as the highest possible accuracy – it is possible to 292
  • 4. choose different accuracy levels of the generated code that parame- several cache analysis blocks. This has to be done until the tag terize the annotation tool. The first and fastest one is a purely static changes or the basic block ends. After that, a function call to the prediction. The second one additionally includes modeling of the cache handling model is added. This code uses a cache model to branch prediction. And, the third one also takes dynamic inclusion find out possible cache hits or misses. of instruction caches into account. The cache simulation will be explained in more detail in the next The cycle calculation in these different detail levels will be dis- few paragraphs. This explanation will start with a description of cussed in more detail in the following text. the cache model. 2.3 Static cycle calculation of a basic block C Program Binary Code Cache Model In modern architectures pipeline effects, superscalarity and v tag lru data caches have an important impact on the execution time. Therefore, asm_inst 1 C_stmnt1 asm_inst l a calculation of the execution time of a basic block by summing C_stmnt2 up the execution or latency times of the single instructions of this asm_inst l +1 C_stmnt3 asm_inst 2l block is too inaccurate. C_stmnt4 asm_inst 2l +1 Therefore, the incorporation of a pipeline model per basic block cycleCalcICache asm_inst n becomes necessary [11]. This model helps to predict pipeline ef- fects and the effects of superscalarity statically. For the genera- Figure 3: Correspondence C – assembler – cache line tion of this model, information about the instruction set and the pipelines of the used processor are needed. This information is contained in the processor description that is used by the annotation tool. With regard to this, the tool can determine which instructions The cache model of the basic block will be executed in parallel on a super scalar pro- The cache model, as it can be seen on the right side of Figure 3, cessor and which combinations of instructions in the basic block contains data space that is used for the administration of the cache. will cause pipeline stalls. In this space, the valid bit, the cache tag and the least recently used With the information gained by the basic block modeling, a pre- (lru) information (containing the replacement strategy) for each diction is carried out. This prediction determines the number of cache set during the run-time is saved. cycles the basic block would have needed on the source processor. The number of cache tags and the according amount of valid bits The next section will show how this kind of prediction is im- that are needed depends on the associativity of the cache (e.g. for proved during run-time, and a cache model is included. a two-way set associative cache, two sets of tags and valid bits are needed). 2.4 Dynamic correction of cycle prediction As previously described, the actual cycle count a processor needs Cache analysis blocks for executing a sequence of instructions can not be predicted cor- In the middle of Figure 3, the C source code which is corresponding rectly in all cases. This is the case if, for example, a conditional to a basic block is divided in several smaller blocks, the so-called branch at the end of a basic block produces a pipeline flush or if ad- cache analysis blocks. These blocks are needed for the considera- ditional delays occur due to cache misses in instruction caches. The tion of the effects of instruction caches. Each one of these blocks combination of static analysis and dynamic execution provides a contains that part of a basic block that fits into a single cache line. well-suited solution for that problem, since statically unpredictable As every machine language instruction in such a cache analysis effects of branch and cache behavior can be determined during exe- block has the same tag and the same cache index, the addresses of cution. This is done by inserting appropriate function calls into the the instructions can be used to determine how a basic block has to translated basic blocks. These calls interact with the architectural be divided into cache analysis blocks. This is because each address model in order to determine the additional number of cycles caused consists of tag information and cache index. by mispredicted branch and cache behavior. At the end of each The cache index information (iStart to iEnd in Figure 2) is used basic block the generation of previously calculated cycles (static to determine at which cache position the instruction with this ad- cycles plus correction cycles) takes place (Figure 2). dress is cached. The tag information is used to determine which address was cached, as there can be multiple addresses with the 2.4.1 Branch prediction same cache index. Therefore, a changed cache tag can be easily Conditional branches have different cycle times depending on determined during the traversal of the binary code with respect to four different cases resulting from the combination of predicted and the cache parameters. The block offset information is not needed mispredicted branches, as well as taken and non-taken branches. A for the cache simulation, as no real caching of data takes place. correctly predicted branch needs less cycles for execution than a After the tag has been changed or at the end of a basic block, a mispredicted one. Furthermore, additional cycles can be needed if function call that handles the simulated cache and the calculation of a correctly predicted branch is taken. This problem is solved by im- the additional cycles of cache misses is added to this block. More plementing a model of the branch prediction and by a comparison details about this function are described in the next section. of the predicted branch behavior with the executed branch behav- ior. If dynamic branch prediction is used, a model of the underlying Cycle calculation code state machine is implemented and its results are compared with the As previously mentioned, each cache analysis block is character- executed branch behavior. The cycle count of each possible case is ized by a combination of tag and cache set index information. At calculated and added to the cycle count for the entire basic block the end of each basic block, a call to a function is included. Dur- before the next basic block is entered. ing run-time, this function should determine whether the different cache analysis blocks which the basic block consists of are in the 2.4.2 Instruction cache simulated cache or not. This way, cache misses are detected. Figure 2 shows that, for the simulation of the instruction cache, The function is shown in Figure 4. It has the tag and the range of every basic block of the translated program has to be divided into cache set indices (iStart to iEnd) as parameters. 293
  • 5. int cycleCalculationICache( tag, iStart, iEnd ) was also used to generate annotated SystemC code from the C code { as described in Section 2.1. As reference, the execution speed and for index = iStart to iEnd the cycle count of the TriCore code has been measured on a TriCore if tag is found in index and valid bit is set then TC10GP evaluation board and on a TriCore ISS [8]. { // cache hit The examples consist of two filters (fir, ellip) and two programs renew lru information that are part of audio decoding routines (dpcm, subband). return 0 Speed } else 160 { // cache miss 150 use lru information to determine tag to overwrite 140 write new tag 130 set valid bit of written tag Million Instructions Per Second 120 renew lru information 110 TriCore Eva- return additional cycles needed for cache miss } 100 luation Board end for 90 Annotated 80 SystemC 1 } 70 Annotated Figure 4: Function for cache cycle correction SystemC 2 60 TriCore ISS To find out if there is a cache hit or a cache miss, the function 50 checks whether the tag of each cache analysis block can be found 40 in the specified set and whether the valid bit for the found tag is set. 30 If the tag can be found and the valid bit is set, the block is already 20 cached (cache hit) and no additional cycles are needed. Only the 10 lru information has to be renewed. 0 In all other cases, the lru information has to be used to determine dpcm fir ellip subband which tag has to be overwritten. After that, the new tag has to be Figure 5: Comparison of speed written instead of the found old one, and the valid bit for this tag has to be set. The lru information has to be renewed as well. In a last step, the additional cycles are returned and added to the cycle Figure 5 shows the comparison of the execution speed of the correction counter. generated code with the execution speed of the TriCore evaluation board and the ISS. The execution speed in this figure is represented 2.4.3 Consideration of task switches by Million Instructions of the TriCore Processor per Second. The In modern embedded systems, software performance simulation Athlon 64 processor running the SystemC code and the ISS had has to deal with task switching and multiple interrupts. Coopera- a clock rate of 2.4 GHz. The TriCore processor of the evaluation tive task scheduling can already be handled by the previously men- board ran at 48 MHz. tioned approach since the presented cache model is able to cope Using the annotated SystemC code, two different types of an- with non-preemptive task switches. Interrupts, cooperated, and notation have been used: the first one generates the cycles after the non-preemptive task scheduling can be handled similarly because execution of each basic block, the second one adds cycles to a cycle task preemption is usually implemented by using software inter- counter after each basic block. The cycles are only generated when rupts. Therefore, the incorporation of interrupts is discussed in the it is necessary (e.g. when communication with the hardware takes following. place). This is much more efficient and is depicted in Figure 5. Software interrupts had to be included to the SystemC model. The execution speed of the TriCore processor ranges from 36.8 Seite 1 This has been achieved by automatic insertion of dedicated pre- to 50.8 million instructions per second, whereas the execution emption points after cycle calculation. This approach provides the speed of the annotated SystemC models with immediate cycle gen- integration of different user-defined task scheduling policies and a eration range from 3.5 to 5.7 million of simulated TriCore instruc- task switch generates a software interrupt. Since cycle calculation tions per second. This means that the execution speed of the Sys- is completed before a task switch is executed and a global cache temC model is only about ten times slower than the speed of a real and branch prediction model is used, no other changes are neces- processor. The execution speed of the annotated SystemC code sary. A minor deviation of the cycle count at certain processes can with on-demand cycle generation ranges from 11.2 to 149.9 mil- occur due to the actual task switch is carried out with a small delay lion TriCore instructions per second. caused by the projection of task preemption at binary code level to In order to compare the SystemC execution speed with the ex- C/C++ source code level. But nevertheless, the total cycle count is ecution speed of a conventional ISS, the same examples were run still correct. The accuracy can be increased by insertion of cycle using the TriCore ISS. The result was an execution speed ranging calculation code after each C/C++ statement. from 1.5 to 2.4 million instructions per second. This means our If the additional delay caused by the context switch itself has to approach delivers a speed increase of up to 91%. be included, the (binary) code of the context switch routine can be A comparison of the number of simulated cycles of the generated treated like any other code. SystemC code using branch prediction and cache simulation with the number of executed cycles of the TriCore evaluation board is shown in Figure 6. The deviation of the cycle counts of the trans- 3. EXPERIMENTAL RESULTS lated programs (with branch prediction and caches included) com- In order to test the execution speed and the accuracy of the trans- pared to the measured cycle count from the evaluation board ranges lated code, a few examples were compiled using a C compiler into between 4% for the program fir to 7% for the program dpcm. This object code for the Infineon TriCore processor [7]. This object code is in the same range as it is using conventional ISS. 294
  • 6. Accuracy 37500 [3] CoWare Inc. CoWare Processor Designer. 35000 http://www.coware.com/PDF/products/ProcessorDesigner.pdf. [4] A. Donlin. Transaction Level Modeling: Flows and Use Models. In 32500 Proceedings of the 2nd IEEE/ACM/IFIP International Conference on 30000 Hardware/Software Codesign and System Synthesis (CODES+ISSS), 27500 pages 75–80, 2004. [5] R. Henia, A. Hamann, M. Jersak, R. Racu, K. Richter, and R. Ernst. 25000 System Level Performance Analysis – the SymTA/S Approach. IEE 22500 TriCore Eva- Proceedings Computers and Digital Techniques, 152(2):148–166, luation Board March 2005. 20000 Cycles Annotated [6] IEEE Computer Society. IEEE Standard SystemC R Language 17500 SystemC 2 Reference Manual, Mar. 2006. 15000 TriCore ISS [7] Infineon Technologies AG. TC10GP Unified 32-bit 12500 Microcontroller-DSP – User’s Manual, 2000. [8] Infineon Technologies Corp. TriCoreTM 32-bit Unified Processor 10000 Core – Volume 1: v1.3 Core Architecture, 2005. 7500 [9] S. Kraemer, L. Gao, J. Weinstock, R. Leupers, G. Ascheid, and 5000 H. Meyr. HySim: A Fast Simulation Framework for Embedded Software Development. In Proceedings of the 5th IEEE/ACM 2500 International Conference on Hardware/Software Codesign and 0 System Synthesis (CODES+ISSS), pages 75–80, 2007. dpcm fir ellip subband [10] S. Künzli, F. Poletti, L. Benini, and L. Thiele. Combining Simulation and Formal Methods for System-Level Performance Analysis. In Figure 6: Comparison of cycle-accuracy Proceedings of the Design, Automation and Test in Europe (DATE) Conference, pages 236–241, 2006. [11] S.-S. Lim, Y. H. Bae, G. T. Jang, B.-D. Rhee, S. L. Min, C. Y. Park, 4. OUTLOOK H. Shin, K. Park, S.-M. Moon, and C. S. Kim. An Accurate Worst Case Timing Analysis for RISC Processors. IEEE Transactions on As clock frequencies cannot be increased as linearly as the num- Software Engineering, 21(7):593–604, 1995. ber of cores, modern processor architectures can consist of more [12] A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, and than one core for fulfillment of computational demands. The dif- A. Hoffmann. A Universal Technique for Fast and Flexible ferent cores can share architectural resources such as data caches to Instruction-Set Architecture Simulation. In Proceedings of the 39th speed up access to common data. Therefore, access conflicts and Design Automation Conference (DAC), pages 22–27, 2002. coherency protocols have a potential impact on the task run-times [13] OPNET Technologies, Inc. http://www.opnet.com. running on the cores. [14] M. Oyamada, F. Wagner, M. Bonaciu, W. Cesário, and A. Jerraya. Software Performance Estimation in MPSoC Design. In Proceedings The incorporation of multiple cores is directly supported by our of the 12th Asia and South Pacific Design Automation Conference SystemC approach. Parallel tasks can easily be assigned to differ- (ASP-DAC), pages 38–43, 2007. ent cores, and the code instrumentation by cycle information can [15] K. Richter, M. Jersak, and R. Ernst. A Formal Approach to MpSoC be carried out independently. However, shared caches can have a Performance Verification. Computer, 36(4):60–67, 2003. significant impact on the numberSeite executed cycles. This can be of 1 [16] G. Schirner, A. Gerstlauer, and R. Dömer. Abstract, Multifaceted solved by inclusion of a shared cache model that executes global Modeling of Embedded Processors for System Level Design. In cache coherence protocols like the MESI protocol. A clock calcu- Proceedings of the 12th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 384–389, 2007. lation after each C/C++ statement is strongly recommended here to [17] J. Schnerr, O. Bringmann, and W. Rosenstiel. Cycle Accurate Binary increase the accuracy. Translation for Simulation Acceleration in Rapid Prototyping of SoCs. In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, pages 792–797, 2005. 5. CONCLUSIONS [18] A. Siebenborn, A. Viehl, O. Bringmann, and W. Rosenstiel. This paper presented a hybrid approach for high-performance Control-Flow Aware Communication and Conflict Analysis of timing simulation of embedded software. The shown approach was Parallel Processes. In Proceedings of the 12th Asia and South Pacific implemented in an automated design flow. The methodology is Design Automation Conference (ASP-DAC), pages 32–37, 2007. based on the generation of SystemC code out of the original C code [19] Synopsys, Inc. Synopsys Virtual Platforms. and back-annotation of statically determined cycle information into http://www.synopsys.com/products/designware/virtual_platforms.html. the generated code. Additionally, the impact of data dependencies [20] L. Thiele, S. Chakraborty, and M. Naedele. Real-time Calculus for Scheduling Hard Real-Time Systems. In IEEE International on the software run-time is analytically handled during simulation. Symposium on Circuits and Systems (ISCAS), volume 4, pages Presented were promising experimental results from the application 101–104, 2000. of the implemented design flow. These results show a high execu- [21] VaST Systems Technology. CoMET R . tion performance of the timed embedded software model as well as http://www.vastsystems.com/docs/CoMET_mar2007.pdf. good accuracy. Furthermore, the created SystemC models, repre- [22] A. Viehl, M. Schwarz, O. Bringmann, and W. Rosenstiel. senting the timed embedded software could be easily integrated in Probabilistic Performance Risk Analysis at System-Level. In SystemC virtual prototypes due to generated TLM interfaces. Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 185–190, 2007. 6. REFERENCES [23] A. Viehl, T. Schönwald, O. Bringmann, and W. Rosenstiel. Formal Performance Analysis and Simulation of UML/SysML Models for [1] K. Albers, F. Bodmann, and F. Slomka. Hierarchical Event Streams ESL Design. In Proceedings of the Design, Automation and Test in and Event Dependency Graphs: A New Computational Model for Europe (DATE) Conference, pages 242–247, 2006. Embedded Real-Time Systems. In Proceedings of the 18th Euromicro [24] T. Wild, A. Herkersdorf, and G.-Y. Lee. TAPES – Trace-Based Conference on Real-Time Systems (ECRTS), pages 97–106, 2006. Architecture Performance Evaluation with SystemC. Design [2] C. Cifuentes. Reverse Compilation Techniques. PhD thesis, Automation for Embedded Systems, 10(2–3):157–179, Sept. 2005. Queensland University of Technology, 19. Nov. 1994. 295