Performance and Memory Profiling for
                                      Embedded System Design

Clock Cycles              n 60
              executable of                                                                ...
For each access that occurs on the data bus (to the data                                      processors of the ARM family...
SIMD instructions can be applied, if such instructions are
1) Coprocessors                                                                                 these results, for example...
CL~~~~~~             E


2) Memory System                                                                                 control flow level. There...
C. Hardware/Software System Implementation                                                                 exhaustive perf...
Upcoming SlideShare
Loading in …5

Performance and memory profiling for embedded system design


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Performance and memory profiling for embedded system design

  1. 1. Performance and Memory Profiling for Embedded System Design Heiko Hubert, Benno Stabernack, Kai-Immo Wels Image Processing Department, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut Einsteinufer 37, 10587 Berlin, Germany [huebert,stabernack,wels] ghhi. fraunhofer. de Abstract- The design of embedded hardware/software systems is In order to reduce the overall data traffic, those parts of the often underlying strict requirements concerning various aspects, code, which require a high amount of data transfers, have to be including real time performance, power consumption and die identified and optimized. The above mentioned applications area. Especially for data intensive applications, such as multimedia systems, the number of memory accesses is a contain up to 100.000 lines of source code. Therefore tools are dominant factor for these aspects. In order to meet the required, which help the designer identifying the critical parts requirements and design a well-adapted system, the software of the software. Several analysis tools exist, e.g. timing parts need to be optimized and an adequate hardware analysis is provided by gprof or VTune. Memory access architecture needs to be designed. For complex applications this analysis is part of the ATOMIUM [2] tool suite. However, all design space exploration can be rather difficult and requires in- these tools provide only approximate results for either timing depth analysis of the application and its implementation alternatives. Tools are required which aid the designer in the or memory accesses. A highly accurate memory analysis can design, optimization and scheduling of hardware and software. be done with a hardware (HDL) simulator, if an HDL model We present a profiling tool for fast and accurate performance of the processor is available. However, such an analysis and memory access analysis of embedded systems and show how implies a long simulation time. it can be applied within the design flow. This concept has been In order to achieve a fast and accurate solution, we proven in the design of a mixed hardware/software system for developed a specialized profiler, called Memtrace [3], for H.264/AVC video decoding. obtaining performance and memory access statistics. This Keywords- profiling, embedded hardware/software systems, paper describes the tool with all its features. We show how the design space exploration, scheduling provided profiling results can be used during the design and optimization of embedded hardware/software systems. As a I. INTRODUCTION case study, Memtrace is applied during the efficient design of The design of an embedded system often starts from a a mixed hardware/software system for H.264/AVC video software description of the system in C language. For decoding. Starting from a software implementation, it is example, the designer writes an executable specification based shown, how the software is optimized, an efficient hardware on a reference implementation of the application, e.g. from architecture is developed, and the system tasks are scheduled standardization organizations or the open-source community. based on the profiling results. This software code is often not optimized in any manners, II. MEMTRACE: A PERFORMANCE AND MEMORY PROFILER because it mainly serves the purpose of functional and conformance testing. Therefore it has to be transformed into A. Tool Architecture an efficient system, including hardware and software Memtrace is a non-intrusive profiler, which analyzes the components. The design of the system requires the following memory accesses and real time performance of an application, steps: system architecture design, hardware/software without the need of instrumentation code. The analysis is partitioning, software optimization, design of hardware controlled by information about variables and functions in the accelerators and system scheduling. All these steps require user application, which is automatically extracted from the detailed information about the performance of the different application. Furthermore, the user can specify the system parts of the application. Besides the arithmetical demands of parameters, e.g. the processor type and the memory system. the application, memory accesses can have a huge influence During the analysis, Memtrace utilizes the instruction set simulator ARMulator [1] for executing the application. The on performance and power consumption. This is especially the ARMulator provides Memtrace with the information required case for data intensive applications, such as multimedia for the analysis, e.g. the program counter, the clock cycle systems, due to the huge amount of data to be transferred in counter and the memory accesses. Memtrace creates detailed these applications. This problem is even increased if the given results on memory accesses and timing for each function and data bandwidth is not used efficiently. variable in the code. 1-4244-0840-7/07/$20.00 02007 IEEE. 94 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  2. 2. Clock Cycles n 60 executable of it funcl1func2 o the application _ 1| 201 30 . > >,40 40 ---- fundc1 121 271 38 20 = func2 list of functions memitace 131 231 34 o analysis stack location specification variable location fronten results of function analysis 1 2 3 4 5 6 4 M result table format /, srf Cache Misses t 60 it var1 var2 system Processor AK backend (A RMulator) 1 15 6 40 ---- va rl specification Caches.16K1tIII &IMemTimingn Set Simulator 2 48 3, 38, 13 22 20 --va r2 lil ~~~~~~Instruction Set Simulator with memtrace backeind results of memory analysis 1 2 3 4 5 6 Figure 1. Performance analysis tool: Memtrace profiles the performance and memory accesses of a user application. B. Analysis Workflow load memory accesses for each function. Furthermore the The performance analysis with Memtrace is carried out in results of several functions can be accumulated in groups for three steps, the initialization, the performance analysis and the comparing the results of entire application modules. The user- postprocessing of the results. defined tables are written to files in a tab-separated format. Thus they can be further processed, e.g. by spreadsheet During initialization Memtrace extracts the names of all programs for creating diagrams. functions and variables of the application. During this process user variables and functions are separated from standard library C. Tool Backend Interface to the ISS - functions, such as printf() or malloc(. This is achieved by Memtrace communicates with the Instruction Set Simulator comparing the symbol table of the executable with the ones of (ISS) via its backend, as depicted in Figure 2. The backend is the user library and object files. The results are written to the implemented as dynamic link library (DLL), which connects to analysis specification file. The specification file can be edited the ISS. Currently only the ARM instruction set simulator by the user, e.g. for adding user-defined memory areas, such as ARMulator is supported. The backend is automatically called the stack and heap variables, for additional analysis. by the ISS during simulation. During the startup phase, the Furthermore the user can define a so called "split function", backend creates a list of all functions and marks the user and which instructs Memtrace to produce snapshot results, each split functions found in the analysis specification file. For each time the "split function" is called. This can be used e.g. in video function a data structure is created, which contains the processing for generating separate profiling results for each function's start address and variables for collecting the analysis processed frame. Additionally the user can control if the results. Finally two pointers, called currentFunction and analysis results, e.g. clock cycles, of a function should include evaluatedFunction, are initialized. The first pointer the results of a called function (accumulated) or if it should indicates the currently executed function and, if this function only reflect the function's own results (self). Typically should not be evaluated, the second pointer indicates the calling auxiliary functions, e.g. C library or simple arithmetic function, to which the result of the current function should be functions, are accumulated to the calling functions. added. In the second step the performance analysis is carried out, based on the analysis specification and the system specification, as shown in Figure 1. The system specification includes the processor, cache and memory type definitions. The Memtrace backend connects to instruction set simulator for the simulation of the user application and writes the analysis results of the functions and variables to files, see chapter II.C for more details. If a "split function" has been specified, these files include tables for each call of the "split function", TABLE I. shows exemplary results for function profiling. The output System Bus files serve as a database for the third step, where user-defined Memory&Bus data is extracted from these tables. Timing Model Memorie5 TABLE I. 32-BIT EXEMPLARY RESULT TABLE FOR FUNCTIONS Figure 2. Interface between memtrace backend and the ISS f ca cyl Is Id 18 st s8 pm cm BI BC BD fl 8 215 75 22 7 52 3 42 5 123 92 0 Each time the program counter changes memtrace checks, 2 2 295 39 35 3 14 9 17 9 55 153 87 if the program execution has changed from one function to f3 2 432 78 68 4 10 2 31 17 143 289 0 another. If so, the cycle count of the evaluatedFunction Abbreviations are: f: function; ca: calls, yl: bus (clock) cycles; ls: all load/store accesses from is recalculated and the call count of the currentFunction the core; Id: all loads; 18: byte and half-word loads; st: all stores; s8: byte and half-word stores; pm: page misses; cm: cache miss; BI: bus idle cycles, BC: core bus cycles, BD: DMA bus cycles is incremented. Finally the pointers to the currentFunction and evaluatedFunction are In the third step a postprocessing of the results can be updated. If currentFunction is a split function, the performed. Memtrace allows the generation of user-defined differential results from the last call of this function up to the tables, which contain specific results of the analysis, e.g. the current call are printed to the result files. 95 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  3. 3. For each access that occurs on the data bus (to the data processors of the ARM family can be profiled, a wide variety cache or TCM), the memory access counters of the of architectural features is covered, including variations of evaluatedFunction are incremented. Depending on the pipeline length, instruction bit-width, availability of information provided by the ARMulator, it is decided, if a load DSP/SIMD instructions, MMUs, cache size and organization, or store access was performed, and which bitwidth (8/16 or 32 tightly coupled memories, bus width and detailed memory bit) was used. Furthermore the ARMulator indicates if a cache timing options. For a profiling estimation of a non-ARM miss occurred. Page hits and misses are calculated by processor an ARM processor with a similar feature set should comparing the address of the current with the previous memory be chosen. In TABLE II. a list of common embedded access and incorporating the page structure ofthe memory. processors is given, which have similarities with ARM processors. They have a basic feature set in common, including For each bus cycle (on the external memory bus) memtrace a 32-bit Harvard architecture with caches, a 5- to 8-stage checks if it was an idle cycle, a core access or DMA access and pipeline and a RISC instruction set. Although, it has to be increments the appropriate counter of the mentioned, that some ofthe processor provide specific features, evaluatedFunction. which may have a significant influence on the performance, for At the end of the simulation the results of the last example the custom instruction extensions of ARC and evaluatedFunction are updated and the results ofthe last Tensilica Xtensa processors. call of the split function and the accumulated results are printed to the result files. TABLE II. 32-BIT EMBEDDED RISC PROCESSORS D. Memtrace Frontend Pipe- Reg- Instr./Data Special Processor line isters' Cache, TCMA Features Memtrace comes with two frontends, a commandline ARM9E 5 16 128k/128k coprocessor interf interface and a graphical user interface (GUI). The stage yes/yes commandline interface is very well suited for the usage in SIMD, 8 16 64k/64k branch pred. batch files, for example for performing a profiling for a set of ARMII stage yes/yes 64-bit bus system configurations or input data. The GUI version allows an coprocessor interf easy and fast access to all features ofthe tool. Especially for the 5 32 32k/32k custom instr. quick generation of result diagrams the GUI version is very ARC600 stage (- 60) 512k/16k extend. reg.file helpful. custom instr. 7 32 64k/64k branch pred. ARC700 stage (- 60) 512k/256k extend. reg. file 64-bit bus Tesilica 5 64 32k/32k custom instr. Xtensa7 stage or > 256k/256k windowed regs. up to 128-bit bus Tensilica 5 32 16k/16k windowed regs. Diamond232L stage LatticeMico32 6 32 32k/32k stage Altera 5-6 32 64k/64k direct-map. cache NIOS II stage yes/yes custom instr. Xilinx 5 32 64k/64k direct-map. cache MicroBlaze v5 stage yes/yes coprocessor interf MIPS 4KE 5 32 64k/64k coprocessor interf stage yes/yes openRISC 5 32 64k/64k direct-map. cache OR1200 stage custom instr. INI LEON3 7 520 lM/yM windowed regs. Figure 3. Memtrace GUI frontend stage yes/yes coprocessor interf a many features are customizable, given is the maximum value E. Portability to other Processor Architectures MEMTRACE WITHIN THE DESIGN FLOW III. The current version of Memtrace is only targeted to the ARM processor family, as it uses the ISS from ARM This chapter describes how the profiler can be applied (ARMulator). However the interface of the profiler, as during the design of embedded systems. Figure 4. shows a described before, is rather simple and could be ported to other typical design flow for such hardware/software systems. processor architectures if an instruction set simulator is Starting from a functionally verified system description in available, which allows debugging access to its memory software, this software is profiled with an initial system busses. Our plans for future work include Memtrace backends specification, in order to measure the performance and see, if for other processor architectures. the (real-time) requirements are met. If not, an iterative cycle of software and hardware partitioning, optimization and As long as other backends are not available, the ARM- scheduling starts. In this process detailed profiling results are based profiling results may function as a rough estimation for crucial for all steps in the design cycle. the results on other RISC processor architectures. Since all 96 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  4. 4. SIMD instructions can be applied, if such instructions are available in the processor. If the performance of the code is significantly influenced by memory accesses, as it is mainly the case in video applications, the number of accesses has to HWSW Partitioning be reduced or they have to be accelerated. The profiler gives a detailed overview of the memory accesses and allows therewith identifying the influence of the memory access. One optimization mechanism is the conversion of byte (8-bit) to word (32-bit) memory accesses. This can be applied if adjacent bytes in memory are required concurrently or within a short time period, for example pixel data of an image during Scheduling image processing. A further mechanism is the usage of tightly coupled memories (TCMs) for storing frequently used data. System For finding the most frequently accessed data area, the memory access statistics of Memtrace can be used. In [1] these Figure 4. Typical embedded system design flow techniques are described in more detail. C. Hardware/Software Profiling and Scheduling A. Hardware/Software Partioning and Besides the software profiling and optimization a system Design Space Exploration simulation including the hardware accelerators needs to be For the definition of a starting point of a system architecture carried out in order to evaluate the overall performance. an initial design space exploration should be performed. These Usually hardware components are developed in a hardware steps include a variation of the following parameters: description language (HDL) and tested with an HDL simulator. This task requires long development and simulation times. * processor type Therefore HDL modelling is not suitable for the early design * cache size and organization cycles, where exhaustive testing of different design alternatives is important. Furthermore, if the system performance is data * tightly coupled memories dependent also a huge set of input data should be tested to get * bus timing reliable profiling results. Therefore, a simulation and profiling environment is required, which allows short modification and * external memory system and timing (DRAM, SRAM) simulation time. * hardware accelerators, DMA controller For this purpose, we used the instruction set simulator and extended it with simulators for the hardware components of the Memtrace can be run in batch mode and thus different system. The ARMulator provides an extension interface, which system configurations can be tested and profiled. Thus the allows the definition of a system bus and peripheral bus influence of the system architecture on the performance can be components. It comes already with a bus simulator, which evaluated. This initial profiling also reveals the hot-spots of the reflects the industry standard AMBA bus and a timing model software. The most time consuming functions are good for access times to memory mapped bus components, such as candidates for either software optimization or hardware memories and peripheral modules, see Figure 5. acceleration. Especially computational intensive functions are well-suited for hardware acceleration in a coprocessor. With support of a DMA controller even the burden of data transfers can be taken from the processor. Control-intensive functions are better suited for software implementation, as a hardware implementation would lead to a complex state machine, which requires long design time and often doesn't allow parallelization. In order to get a first idea of the influence of hardware acceleration, a (well-educated guessed) factor can be defined for each hardware candidate function. This factor is used by Memtrace, in order to manipulate the original profiling results. B. Software Profiling and Optimization After a partitioning in hardware and software is found, the software part can be optimized. Numerous techniques exist, that can be applied for optimizing software, such as loop unrolling, loop invariant code motion, common subexpression elimination or constant folding and propagation. For Figure 5. Environment for hardware/software cosimulation and profiling computational intensive parts arithmetic optimizations or 97 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  5. 5. 1) Coprocessors these results, for example Figure 6. shows the bus usage for We supplemented this system with a simple template for each function depending on the access time ofthe memory. coprocessors, including local registers and memories and a cycle-accurate timing. The functionality of the coprocessor can be defined as standard C-code, thus the software function can be simulated as hardware accelerator by copying the software e5 _ 1111 100 Bus Idle (SRAM)1 ,0 M Bus Accesses (DRAM) code to the coprocessor template. The timing parameter can be 7 7 - 1lilill l l l l l | *~11Bus Idle (DRAM) used to define the delay of the coprocessor between activation and result availability, i.e. the execution time of the task, as it would be in real hardware. This value can be either achieved 04 from reference implementation found in literature or by an educated guess of a hardware engineer. Furthermore, often 2 multiple hardware implementations of a task with different execution time (and hardware cost) are possible. In the 0 proposed profiling environment, simply by varying the timing parameter and viewing its influence on the overall performance, a good trade-off between hardware cost and Functions speed-up can be found quickly. 2) DMA Controller Figure 6. Bus usage for each function, depending on the memory type For data intensive applications data transfers have a tremendous influence on the overall performance. In order to 4) HDL Simulation efficiently outsource tasks into hardware accelerators also the In a later design phase, when the hardware/software burden of data transfer has to be taken from the CPU. This job partitioning is fixed and an appropriate system architecture is can be performed by a DMA-Controller. The Memtrace found, the hardware component need to be developed in a hardware profiling environment includes a highly efficient hardware description language and tested using a HDL DMA-Controller with the following features: simulator, such as Modelsim. Finally, the entire system needs to be verified including hardware and software components. * multi-channel (parameterizable number of channels) For this purpose the instruction set simulator and the HDL * ID- and 2D- transfers simulator have to be connected. The codesign environment * activation FIFO (non-blocking transfer, autonomous) PeaCE [4] allows the connection of the Modelsim Simulator * internal memory for temporary storage between read and the ARiulator. and write * burst transfer mode IV. APPLIcATioN EXAMPLE H.264/AVGCVIDEo DECODER Thus the designer is enabled to determine the influence of FOR MOBILE TV TERMINALS different DMA modes in order to find an appropriate trade-off between DMA Controller complexity and required CPU The proposed design methodology has been applied to the activity. design of a video decoder as part of a mobile digital TV receiver. Starting from an executable specification of the video 3) Scheduling decoder, namely the (unoptimized) reference software, at first a After the software and hardware tasks have been defined a pure optimized software implementation and then an ASIC has scheduling of these tasks is required. For increasing the overall been developed incorporating hardware accelerators and a performance a high degree of parallelization should be customized processor. accomplished between hardware and software tasks. In order to find an appropriate scheduling for parallel tasks the following A. D VB-H and H 2641A VC Video Compression information is required: The receiver is compliant to DVB-H, which is a new * dependencies between tasks standard for broadcasting of digital audio and video content to mobile devices. The content is encoded using highly efficient * the execution time of each task compression methods, namely AAC-HE for audio data and the * data transfer overhead H.264/AVC [5] codec for video content. DVB-H focuses on a high mobility and low power consumption of the receivers. The Especially for data intensive application the overhead for most demanding part of the receiver in terms of computational data transfers can have a huge influence on the performance. It requirements is the H.264 AVG video decoder. might even happen that the speed-up of a hardware accelerator is vanished by the overhead for transferring data to and from The H.264pAVG video compression standard is similar to the accelerator. its predecessors, however it adds various new coding features and refiements of existing mechanisms, which lead to a 2 to 3 The overhead for data transfers to the coprocessors is time's increased coding efficiency compared to MPEGf-2. dependent on the bus usage. Furthermore side effects on other However, the computational demands and required data functions may occur, if bus congestion occurs or when cache accesses have also increased significantly. In Figure 7. the flushing is required in order to ensure cache coherency. In block diagram of an H.264/AVC decoder is depicted. order to find these side-effects, detailed profiling of the system performance and the bus usage is required. Memtrace provides 98 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  6. 6. CL~~~~~~ E Ca)~~~~~~~~~~~~~~~~~~~~~~~~C F -- |j~ ~ref nce -- -- -- 04-----------------> decoding inversetransformage t Figure 7. Block diagram of an H.264/AVC decoder The bitstream parsing and entropy decoding interpret the encoded symbols and are highly control flow dominated. The symbols contain control information and data for the following components. The inter and intra prediction modes are used to predict image data from previous frames or neighboring blocks, respectively. Both methods require filtering operations, whereas the inter prediction is more computational demanding. i fr001ame buffer motion compensation for the chrominance pixels, which is mainly based on bilinear interpolation. Focusing on the read memory cs- CD, 8- 7- 6- accesses, which are performed motionCompChroma (), as given in the second column of TABLE III. , it shows that more than 30%0 are byte or half in word accesses (third column). This is due to the fact, that the pixel values have the size of one byte each. .................................................................................................................................................................................. ......................................................................................................................................................................................................... ~ The residuals of the prediction are received as transformed and Figure 8. Profiling results for the H.264/AVC software decoder quantized coefficients. The applied transformation, which can be considered as a simplified discrete cosine transformation Since the interpolation is applied iteratively on adjacent (DCT), is based on integer arithmetic and is computational pixels, the source code can be optimized by reading 4 adjacent demanding. The reconstructed image is post processed by a bytes at once. This leads to a reduction of the execution time deblocking filter for reducing blocking artifacts at block edges. The deblocking filter includes the calculation of the filter of the function by almost 30°0o. The speedup of the function strength, which is control flow dominated, and the actual 3- to leads to a reduction of the execution time for processing a P- 5-tap filtering, which requires many arithmetic operations. frame by about 500. Each of these components allows various modes of operation, which are chosen by symbols in the bitstream. This involves a TABLE III. PROFILING RESULTS FOR MOTIONCOMPCHROMAO) FUNCTION high degree of control flow in the decoder. Clock Cycles All Load Load 8/16 The H.264/AVC baseline decoder has been profiled with before optimization 13,149,109 309,368 104,784 Memtrace using a system specification typical for mobile after optimization 9,355,709 196,746 34,584 embedded systems comprising an ARM946E-S processor core, a data and instruction cache (16kB each) and an external Further speed-up of the software could be achieved by DRAM as main memory. The execution time for each module applying well-known software optimization techniques and of the decoder has been evaluated as depicted in Figure 8. The those proposed in [3] to the functions identified by the results show, that the distribution over the modules differs significantly between I- and P-frames. Whereas in I-frames the profiler. The resulting software decoder has been tested on an deblocking has the most influence on the overall performance, Intel PXA270-based PDA within the DVB scenario. The in P-frames the motion compensation is the dominant part. required processor clock frequency for H.264/AVC decoding is about 420 MHz. (320x240 pixel resolution, 384 kBit/s). B. Design and Optimizations Considering the dynamic power consumption of CMOS- Based on the acquired profiling results several software and circuits, given in equation 1, the rather high system frequency hardware architectural optimizations are applied. Our first leads to high power consumption. M target is a pure software version of the video decoder for the (1) implementation of a DVB-H terminal on a PDA. In a second Pdynamic k=l Ck fk VDD step an embedded hardware/software is developed. For achieving lower power consumption, methods need to 1) Software Implementation and Optimizations be applied, which allow the reduction of the system frequency, Following Amdahl's law, those parts of the software should which in turn also allows a lower supply voltage (voltage be considered for optimization first, which take up the most of scaling). Hardware accelerators can be used for this purpose. the execution time. Figure 8. shows, that motion However, their influence on the capacitance has to be compensation, loopfilter, inverse transformation and memory considered and reduced by mechanism like clock gating. related functions are those candidates. Exploring the results of Furthermore the memory architecture needs to be adapted the functions corresponding to the motion compensation, it (reduced) to the specific application requirements. can be seen that the function motionCompChroma () requires the most execution time. This function performs the 99 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  7. 7. 2) Memory System control flow level. Therefore they are well suited for hardware Besides the processing power of the CPU the memory and implementation as coprocessors, which can be controlled by bus architecture determine the overall performance of the the main CPU. In order to ease the burden of providing the system. Namely the caches size and architecture, the speed and coprocessors with data, a DMA controller can be applied usage of a tightly coupled (on-chip) memory (TCM), the width allowing memory transfers concurrently to the processing of of the memory bus, the bandwidth of the off-chip memory and the CPU. The coprocessors should be equipped with local a DMA controller are the most influencing factors. Adjusting memory for storing input and output data for processing at least these factors requires a trade-off between hardware cost, power one macroblock at a time preventing fragmented DMA consumption and performance. The H.264/AVC decoder has transfers. As the video data is stored in the memory in a two been simulated with different cache sizes in order to find an dimensional fashion, the DMA controller should feature 2-D appropriate size for the DVB-H terminal scenario (QVGA memory transfers. The output of the video data to a display, image resolution). It has been evaluated how the required which is required by a DVB-H terminal, even increases the decoding time changes when either the instruction cache size or problem ofthe high amount of data transfers. the data cache size is increased, see Figure 9. 4) Hardware/Software Interconnection and Scheduling n 1=4k:D=var m I=var:D=Ok After the software optimization is performed and the 120 - hardware accelerators are developed, a scheduling of the entire g 100- system is required. The scheduling is static and controlled by - 80- the software. The hardware accelerators are introduced step-by- 0 step to the system. Starting from the pure software ,, 60- implementation, at first the software functions are replaced by 0 their hardware counterparts. This also requires the transfer of ,, 40- input data to and output data from the coprocessors. These data 0 " 20- transfers are at first executed by load-store operations of the processor and in a next step replaced by DMA transfers. This 0- might also requires flushing the cache or cache lines, which may decrease the performance of other software functions. In a Figure 9. Influence of the instruction (I) and data (D) cache sizes on the final step the parallelization of the hardware task and software execution time of the H.264/AVC decoder. tasks takes place. All decision taken in these steps are based on detailed profiling results. The results show that increasing the instruction cache size The following example shows how the hardware from 4 kByte up to 32 kByte has a minor influence on the accelerator for the deblocking is inserted into the software overall performance. However, adding a data cache of 4 kByte decoder. The hardware accelerator only includes the filtering to the system decreases the decoding time to less than 20%. process of the deblocking stage, filter strength calculation is Further increasing the data cache size does not yield a dramatic performed in software, because it is rather control intensive and performance increase. Therefore a data and instruction cache therefore more suitable for software implementation. The filter size of 4 kByte each is a good tradeoff between performance processes the luminance and chrominance data for one and die area. The data cache increases the performance by macroblock at a time. It requires the pixel data and filter decreasing the number of accesses to the external memory. parameters as an input and provides filtered image data as an This is especially efficient for data areas with frequent accesses output, this sums up to about 340 32-bit words of data transfer. to the same memory location, e.g. the stack. However for Figure 10. shows the results for the pure software randomly accessed data areas, e.g. lookup tables, a fast on-chip implementation, when using the filter accelerator with data memory (SRAM) is more appropriate. As the H.264/AVC transfer managed by the processor, and when additionally using decoder requires about 1. 1 MByte of data memory (@ QVGA the DMA controller. As can be seen, if data is transferred by video resolution), only small parts of the used data structures the processor, the performance gain of the accelerator is (less than 3%0 with 32 kByte of SRAM) can be stored in the of vanished by the data transfers, only in conjunction with the on-chip memory. In order to find a useful partitioning of data DMA controller the coprocessor can be used efficiently. areas between on-chip and off-chip memory, it is required to profile the accesses to each data area of the decoder. Since a data cache is instantiated, accesses to these memories only Million happen if cache misses occur. Therefore, the cache misses have been analyzed separately for each data area in the code including global variables, heap variables and the stack. Data areas with many cache misses are stored in on-chip memory. 10- 14 M Paaee Caclto 3) Hardware/Software Partitioning In order to further increase the system efficiency and decrease power consumption and hardware costs, the CPU can be enhanced by coprocessors. Again, the hot spots in the software code should be considered, namely the loop filter, the SW HWwith CPU LD/ST HWwith DMA motion compensation and the integer transformation. These are Figure 10. Clock cycle comparison of different deblocking implementations the foremost candidates for hardware implementation. All these components are rather demanding on an arithmetical than on a 100 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  8. 8. C. Hardware/Software System Implementation exhaustive performance testing and power measurements, The profiling and implementation results of the previous separately for memory, core and IO supply voltages. chapters lead to a mixed hardware/software implementation of the video decoder, which is given in Figure 11. An application processor is extended with a companion chip for acceleration of the video decoding. The companion chip contains the hardware accelerators for H.264/AVC decoding. TABLE IV. shows a comparison of the required cycle times of the accelerators with their software counterparts. TABLE IV. COMPARISON OF THE EXECUTION TIME iN HARDWARE AND SOFTWARE Pixel Inverse Implementation Debloc king Interpolation Transform Software 3000-7000 cylces 100-700 cycles 320 cycles Figure 12. ASIC layout Hardware 232 cylces 16-34 cycles 30 cycles V. CONCLUSIONS AND FUTURE WORK a memory transfers are not included in this cycle counts The design of an efficient system for applications with high Furthermore a so called SIMD engine is available on the demands on the real-time performance requires the selection of chip, which is 32-bit RISC processor enhanced with special an appropriate system architecture and the incorporated SIMD instructions. The 32-bit system bus connecting the hardware and software components. For this decision a detailed processor core with the main memory and coprocessor knowledge of the computational demands of the application is components is augmented with a DMA-controller which mandatory. Furthermore for data intensive applications also the supports the main processor by performing the memory influence of memory accesses has to be taken into account. We transfers to the coprocessor units. A video output unit is presented a profiling tool which provides this information and implemented directly driving a connected display or video have shown how it can be integrated in the design flow. The DAC. To avoid a heavy bus load on the mentioned system bus tool aids the designer in taking the right decision during each due to transfers from a frame buffer to the video output step of the design, including the hardware/software interface, an extra frame buffer memory and the video output partitioning, the optimization ofthe components and the system unit are provided by a separate video bus system. The data scheduling. We have applied this methodology for the transfers between these bus systems are also performed by the development of a software solution and a hardware/software DMA controller. The main control functionality of the decoder system for real-time video decoding. can either be run on the application processor or on the RISC core on the companion chip. Our future work includes the retargeting of the profiler backend to other processors. Many processor simulators offer already profiling capabilities, e.g. the LisaTek tool suite; however their results are not as detailed as the Memtrace results. Furthermore we plan to integrate power models for cache and memory accesses and instruction execution in order to allow power consumption estimation. These models will be based on existing power models of caches and memories and on measurement results of the presented ASIC design. REFERENCES [1] RealView ARMulator ISS User Guide Version 1.4, Ref: DUI0207C, display January 2004, [2] J. Bormans, K. Denolf, S. Wuytack, L. Nachtergaele, and I. Bolsens, "Integrating system-level low power methodologies into a real-life design flow," The Ninth Int. Workshop Power and Timing Modeling, Optimization and Simulation, pp. 19-28, Oct. 1999, Kos Island, Greece [3] H. Hubert, B. Stabernack, and H. Richter, "Tool-Aided Performance Figure 11. SOC architecture of the DVB-H/DMB companion chip Analysis and Optimization of an H.264 Decoder for Embedded Systems," The Eighth IEEE International Symposium on Consumer To fully evaluate the proposed concept the complete SOC Electronics (ISCE 2004), Reading, England, Sept. 2004 architecture has been implemented as an ASIC design using [4] s. Ha, C. Lee, Y. Yi, S. Kwon, and Y.-P. Joo, "Hardware-software UMC's L180 1P6M GII logic technology, see Figure 12. The Codesign of Multimedia Embedded Systems: the PeaCE Approach," 12th IEEE Int. Conf on Embedded and Real-Time Computing Systems maximum clock frequency of the design is 120 MHz, whereas and Applications, Sydney, Australia, Vol. 1 pp. 207-214, Aug. 2006 50 MHz should be sufficient for the DVB-H scenario. An [5] International Standard of Joint Video Specification (ITU-T Rec. H.264 evaluation board for the chip is currently under development. It ISO/IEC 14496-10 AVC), Joint Video Team (JVT) of ISO/IEC allows the fully functional verification and furthermore MPEG and ITU-T, VCEG, JVT-G050, March 2003 101 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.