Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325 Energy-Efficient Design of Battery-Powered Embedded Systems V.Prasanna Kumar M.Tech, Asst.Professor, Dept Of E.C.E, L.N.B.C.I.E.T, SataraAbstract Energy-efficient design of battery- development and once the prototype is built in orderpowered systems demands optimizations in to get the best performance and energy consump-both hardware and software. We present a tion from the system. Embedded softwaremodular approach for enhancing instruction optimization requires tools for estimating the impactlevel simulators with cycle-accurate simulation of program transformations on energy consumptionof energy dissipation in embedded systems. and performance.Our methodology has tightly coupledcomponent models thus making our approach This work presents a complete solution formore accurate. Performance and energy all embedded system design issues discussed above.computed by our simulator are within a 5% The distinctive features of our approach are thetolerance of hardware measurements on the following: i) complete system-level and componentSmartBadge [2]. We show how the simulation energy consumption estimates as well as batterymethodology can be used for hardware design lifetime estimates; ii) ability to explore multipleexploration aimed at enhancing the SmartBadge architectural alternatives; and iii) easy estimation ofwith real-time MPEG video feature. In the impact of software changes both during andaddition, we present a profiler that relates after the architectural exploration. The tool set isenergy consumption to the source code. Using integrated within the instruction set simulatorthe profiler we can quickly and easily redesign provided by ARM Ltd. [1]. It consists of twothe MP3 audio decoder software to run in real components: a cycle-accurate system-level energytime on the SmartBadge with low energy consumption simulator with battery lifetimeconsumption. Performance increase of 92% and estimation and a system profiler that correlatesenergy consumption decrease of 77% over the both energy consumption and performance withoriginal executable specification have been the code. Our tools have been tested on a real-lifeachieved. industrial application, and have proven to be both accurate (within 5% of hardware measurements) andIndex Terms—Low-power design, performance highly effective in optimizing the energytradeoffs, power consumption model, system- consumption in embedded systems (energylevel. consump- tion reduced by 77%). In addition, they are very flexible and easy to adopt to differentI. INTRODUCTION systems. The tools contain general models for all NERGY consumption is a critical factor in typical embedded system components but thesystem-level design of embedded portable microprocessor. In order to adopt the tools to another Eappliances. In addition, low costs with fast time to processor, the ARM ISS needs to be replaced by themarket are crucial. As a result, typical portable ISS for the processor of interest.appliances are built of commodity components and The rest of this manuscript is organized ashave a microprocessor-based architecture. Full follows. We discuss related work in Section II.system evalua- tion is often done on prototype System model and the methodology for cycle-boards resulting in long design times. Field accurate simulation of energy dissi- pation areprogrammable gate array (FPGA) hardware emu- presented in Section III. Section IV shows that thelators are sometimes used for functional debugging simulation results of timing and energy dissipationbut cannot give accurate estimates of energy using the methodology presented are within 5% ofconsumption or performance. Performance can be the hardware mea- surements for the Dhrystone testevaluated using instruction-set simulators (e.g., [1]), case. Hardware architecture trade-offs forbut there is limited or no support for energy consump- SmartBadge’s real-time MPEG video decodetion evaluation. design are explored using cycle-accurate energy Ideally, when designing an embedded simulation in Section V. The profiling supportsystem built of com- modity components, a designer we have developed is presented in Section VI. Awould like to explore a limited number of full software design example of MP3 audio decoderarchitectural alternatives and test functionality, en- for the SmartBadge that uses our profiler is shownergy consumption, and performance without the need in Section build a prototype first. In addition, designers needto optimize software both during hardware 312 | P a g e
  2. 2. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325II. RELATED WORK components. In addition, it is slower than our As portable embedded systems have approach as it models at lower abstraction level.grown in importance in recent years, so has the An alternative approach for energy estimation usingneed for tools that enable energy consumption measure- ments as a basis for estimation is presentedestimation for such systems. CAE support for in PowerScope tool [9]. PowerScope requires twoembedded system design is still limited. computers to collect the measure- ment statistics,Commercial tools target mainly functional some changes to the operating system source code,verification and performance estima- tion [3]–[6], and a digital multimeter. Although this systembut provide no support for energy-related cost enables accurate code profiling of an existingmetrics. system, it would be very difficult to use it for bothProcessor energy consumption is generally hardware and software architecture ex- ploration weestimated by nstruction-level power analysis, first present in this paper, as in the early design stagesproposed by Tiwari et al. [24], [25]. This technique neither hardware nor operating systems or softwareestimates the energy consumed by a program by are avail- able for measurements.summing the energy consumed by the execution of Finally, previous approaches do not focuseach instruction. Instruction-by-instruction energy on battery life op- timization, the ultimate goal ofcosts, together with nonideal effects, are energy optimization for portable systems. In fact,precharacterized once for all for each target when the battery subsystem is not considered inprocessor. An approach proposed recently in [12] energy estimation significant errors can result [21].attempts to evaluate the effects of different cache Some an- alytical estimates of the tradeoff betweenand bus configurations using linear equations to battery capacity and delay in digital CMOS systemsrelate the main cache characteristics to system are presented in [18]. Battery capacity is stronglyperformance and energy consumption. This approach dependent on the discharge current as can be seendoes not account for highly nonlinear behavior in from any battery data sheet [22]. Hence, it iscache accesses for different cache configurations important to accurately model discharge current as athat are both data and architecture dependent. function of time in an embedded system.A few research prototype tools that estimate the In contrast to previous approaches, in thisenergy con- sumption of processor core, caches, and work memory models and processor instruction-main memory in SOC design have been proposed level simulator are tightly integrated together with[7], [10]. Memory energy consump- tion is estimated an accurate battery model into cycle- accurateusing cost-per-access models. Processor ex- ecution simulation engine. Estimation results obtained withtraces are used to drive memory models, thereby ne- our simulator are shown to be within 5% ofglecting the nonnegligible impact of a nonideal measured energy consumption in hardware. Inmemory system on program execution. The final addition, we accurately model battery dischargesystem energy is obtained by summing over the current. Since we develop only one simulationcontribution of each component. The main engine, there is no overhead in executing simulatorslimitation of the approaches presented in [7] and at different levels of abstraction, or in the interface[10] is that the interaction between memory system between them. Thus, our approach enables fast and(or I/O peripherals) and processor is not modeled. accurate architecture exploration for both energyA more recent approach presented in [11] combines consumption and performance.multiple power estimators into one simulation In an industrial environment, the degreesengine thus enabling de- tailed simulation of some of freedom in hardware design for embeddedcomponents, while using high-level models for portable appliances are often very limited, but forothers. This approach is able to account for interac- software a lot more freedom is available. As ation between memory, cache and processor at run result, a primary requirement for system-leveltime, but at the cost of potentially long run-times. design methodology is to effectively support codeLonger run-times are caused by different abstraction energy consumption optimization. Severallevels of various simulators and by the overhead in techniques for code optimization have beencommunication between different components. The presented in the past. A methodology thattechniques that enable significant simulation combines automated and manual softwarespeedup are pre- sented, but at the cost of the loss of optimizations focused on optimizing memorydetail in software design and in the input data trace. accesses has been presented in [17]. Tiwari et al.Cycle-accurate register-transfer level energy [24], [25] uses instruction-level energy models toestimation is pre- sented in [8]. This tool integrates develop compiler-driven energy optimizations suchRT level processor simulator with DineroIII cache as instruction reordering, reduction of memorysimulator and memory model. It is shown to be operands, operand swapping in Booth multipliers,within 15% of HSPICE simulations. Unfortunately, efficient usage of memory banks, and series ofthis ap- proach is not practical for component-based processor specific optimizations. Several other opti-designs such as the one presented in this paper, as it mizations have been suggested, such as energyrequires knowledge of the in- ternal design of system efficient register labeling during the compile 313 | P a g e
  3. 3. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325phase [19], procedure inlining and loop unrolling to illustrate our methodology and to obtain hard-[7], as well as instruction scheduling [27]. Work ware measurements. The SmartBadge, shown in Fig.presented in [20] applies a set of compiler 1, is an em- bedded system consisting of theoptimizations concurrently and evaluates the StrongARM-1100 processor, FLASH, SRAM,resulting energy consumption via simulation. sensors, and modem/audio analog front-end on aAll of the techniques discussed above focus on PCB board powered by the batteries through a DC–automated instruction-level optimizations driven by DC converter. The initial goal in designing thethe compiler. Unfor- tunately, currently available SmartBadge was to allow a computer or a humancommercial compilers have lim- ited capabilities. user to provide location and envi- ronmentalThe improvements gained when using stan- dard information to a location server through a heteroge-compiler optimizations are marginal compared to neous network. The SmartBadge could be used as awriting energy efficient source code [16]. The corporate ID card, attached (or built in) to deviceslargest energy savings were observed at the such as PDAs and mobile telephones, orinterprocedural level that compilers have not been incorporated in computing systems. The design goalable to exploit. for the SmartBadge has since been extended to Code optimization requires extensive combine lo- cation awareness and authenticationprogram execution analysis to identify energy- with audio and video sup- port. We will illustratecritical bottlenecks and to provide feedback on the how our methodology has been used for architectureimpact of transformations. Profiling is typically used exploration of the new SmartBadge that needed toto relate performance to the source code for CPU support real-time MPEG video decode feature. Inand L1 cache [1]. Leveraging our estimation engine, addition, we will show how our profiler and codewe implemented a code profiling tool that gives optimizations can be used to improve code for MP3percentages of time and energy spent in each audio decoder.procedure for every system component, not only The system we use in this work to illustrateCPU and L1 cache. Thanks to energy profiling, the our methodology, the SmartBadge, has an ARMprogrammer can easily identify the most energy- processor. As a result, we im- plemented the energycritical procedures, apply transformations, and models as extensions to the cycle-accu- rateestimate their impact not only on pro- cessor energy instruction-level simulator for the ARM processorconsumption, but also on memory hierarchy and family, called the ARMmulator [1]. The ARMulatorsystem busses. is normally used for functional and performance validation. Fig. 2 shows the sim- ulator architecture. The typical sequence of steps needed to set up system simulation can be summarized as follows: 1) The designer provides a simple functional model for each system component other than the processor; 2) The functional model is annotated with a cycle- accurate performance model; 3) Ap- plication software (written in C) is cross-compiled and loaded in specified locations of the system memory model; and 4) The simulator runs the code and the designer can analyze execution using a cross- debugger or collecting statistics. A designer inter- ested in using our methodology would only need toFig. 1. SmartBadge. addition- ally provide cycle-accurate energy models for each component during step 2) of the simulation Our approach enables complete system- setup. Thus, the designer can obtain power estimateslevel and component energy consumption estimates with little incremental well as battery lifetime es- timates. In addition, it We developed a methodology forprovides an ability to quickly explore multiple enhancing cycle-accurate simulators with energyarchitectural alternatives. Finally, it enables software models of typical components used in embeddedoptimization both during and after architectural system design. Each component is characterized withexploration using our energy profiling tool. In the equivalent capacitance for each of its power states.following section we present the cycle-accurate Energy spent per cycle is a function of equivalentenergy simulator architecture to- gether with energy capacitance, current voltage, and frequency. Theconsumption models for the components modeled. equivalent capacitance allows us to easily scale energy consumed for each component as frequencyIII. SYSTEM MODEL or voltage of operation change. Equivalent Typical portable embedded systems have capacitances are cal- culated given the informationprocessors, storage, and peripherals. We use provided in data sheets.SmartBadge [2] throughout this paper as a vehicle Internal operation of our simulator proceeds as 314 | P a g e
  4. 4. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325follows. On each cycle of execution the ARMulatorsends out the infor- mation about the state of theprocessor (―cycle type‖) and its address and databusses. Two main classes of processor cycle typesare processor active, where active power is consumed,and processor idle, where idle power is consumed.The processor idle state represents an off-chipmemory request. The number of cycles that theprocessor remains idle depends on L2 cache andmemory model access times. L2 cache, whenpresent, is always accessed before the main memoryand so is active on every memory access request. OnL2 cache miss, main memory is accessed. Memorymodel accounts for energy spent during the memoryaccess. The interconnect energy model calculatesenergy consumed by the interconnect and pinsbased on the number of lines switched during the Fig. 2. Simulator architecture.cycle on the data and ad- dress busses. The DC–DCconverter energy model sums all the currents the on-chip cache, and the state in whichconsumed each cycle by other system components, the processor is exe- cuting NOPs while waiting toac- counts for its efficiency loss, and gets the total fill the consumed from the battery. The battery Note that in the case of StrongARMmodel accounts for battery effi- ciency losses due to processor used in this work, the data sheet values forthe difference between the rated current and discharge current consumption correspond well to the measuredcurrent computed the current cycle. values. Wan [26] extended the StrongARM processorThe total energy consumed by the system per cycle model with base current costs for each the sum of energies consumed by the processor The average power consumption for most of theand L1 cache ( ), interconnect and pins ( instructions is 200 mW measured at 170 MHz. Load ), memory ( ), L2 cache ( ), the and store instructions re- quired 260 mW each.DC–DC converter ( ) and the efficiency losses Because the difference in energy per in- struction isin the battery ( ) minimal, it can be expected that the average power consumption value from the data sheets is on the same level of accuracy as the instruction-level The total energy consumed during the model. Thus we can use data sheet values to deriveexecution of the software on a given hardware equivalent capacitances for the Stron- gARM. Notearchitecture is the sum of the energies con- sumed that for other processors data sheet values wouldduring the each cycle. Models for energy need to be verified by measurement, as often dataconsumption and performance estimation of each sheet values report the maximum powersystem component are de- scribed in the following consumption, instead of typical.sections. When the processor is executing with the on-chip cache, it consumes the active powerA. Processor specified in the data sheet mea- sured at given The ARM simulator provides a cycle- voltage and frequency of operation . Totalaccurate, instruction- level model for ARM equivalent active capacitance within the processorprocessors and L1 on-chip cache. The model was is estimated asenhanced with energy consumption estimates basedon the information provided by the data sheets. Twopower states are considered: active state in whichprocessor is running with The amount of energy consumed by processor and L1-cache at specified processor cycle time and CPU core voltage is When there is an on-chip cache miss, the processor stalls and ex- ecutes NOP instructions which consume less power. 315 | P a g e
  5. 5. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325can be estimated from the power consumed during estimated from the ac- tive power specified in the dataexecution of sheet measured at voltage and frequencyNOPs at voltage and frequency Multibank memory can be represented as multipleThe energy consumed within processor core per cycle one-bank memorieswhile ex- ecuting NOPs is Idle state can be further subdivided into multiple states that describe modes of operation for different types of memories. For example, DRAM might have two idle states: refresh and sleep. The designer specifies the percentage of the timeB. Memory and L2 Cache memory spends in each idle state. Total idle energy The processor issues an off-chip memory per cycle for memory isaccess when there is an L1 cache miss. The cache-fill request will either be ser- viced by the L2 cacheif one is present in the design or directly from themain memory. On L2 cache miss, a request is issuedto the processor to fetch data from the main memory. where is power consumption in idleData sheets specify the memory and L2 cache access state . Both RAM and ROM are represented withtimes and energy con- sumed during active and idle the same memory model, but with differentstates of operation. parameters. Memory access time is scaled by the The L2 cache access time and energyprocessor cycle time to obtain the number of consumption are treated the same way as any othercycles the processor has to wait to serve a request memory. L2 cache organization is de- termined from (6). Wait cycles are defined for two different the number of banks, lines per bank, and words pertypes of memory accesses: sequential and line. Line replacement can follow any of the well-nonsequen- tial. Sequential access is at the address known replace- ment policies. Cache hit rate isimmediately following the address of the previous strongly dependent on its organi- zation, which in turnaccess. In burst type memory the se- quential access affects the total memory access time and the energyis normally a fraction of the first nonsequential consumption. Note that we are simulating details ofaccess the L2 cache access and thus know the exact L2 cache miss rate. C. Interconnect and Pins The interconnects on the PCB can Two energy consumption states are contribute a large portion of the off-chipdefined for each type of memory: active and idle. capacitance. Capacitance per unit length of theEnergy consumed per cycle while interconnect is a parameter in the energy model that can be ob- tained from the PCB manufacturer. The length of an intercon- nect can be estimated by the designer based on the approximate placement of the selected components on the PCB. Pin capaci- tance values are reported on the data sheets. For each component the average length of the clock line, data, and address buses between the processor and the component are provided as one of the input simulation parameters. Hence, the designer is free to use any wire-length estimate [14] or mea- surement. The interconnect lengths used in our simulation of SmartBadge come from the prototypeFig. 3. DC–DC converter efficiency board layout. memory is in active state operating at The total capacitance switched during one cycle issupply voltage is a function of equivalent shown in (10). It depends on the capacitance of oneactive capacitance, voltage of operation and number interconnect line and the pins attached to itof total access cycles ( +1) and the number of lines switched during the cycle Active memory capacitance can be 316 | P a g e
  6. 6. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325 The total energy consumed per cycle The discharge rate (or discharge current ratio) is is a function given byof the voltage swing on the lines that switchedtotal capac- itance switched and the total timeto access the memory where , the rated discharge current, is derived from the battery specification and is the average current drawn by the DC–DC converter. As a battery cannot respond to instantaneous changes in current, a first order time constant is defined to determine the short-term average current drawn from the battery [23]. Given , and processor cycle timeD. DC–DC Converter , we can compute , the number of cycles DC–DC converter losses can account for a over which average DC–DC current is calculated assignificant frac- tion of the total energy consumption.Fig. 3 from the datasheets shows the dependence ofefficiency on the DC–DC converter output current.Total current drawn from the DC–DC converter bythe system each cycle is a sum of the currentsdrawn by each system component. A componentcurrent is defined by where is the instantaneous current drawn from the bat- tery. With discharge current ratio, we estimate battery efficiency using battery efficiency plot such as the one shown in Fig. 4. The where is the energy consumed by thecomponent during cycle of length atoperating voltage .Total current drawn from the battery can becalculated as Efficiency can be estimated using linearinterpolation from the data points derived from theoutput current versus efficiency plot in the data Fig. 4. Battery efficiency.sheet. From our experience, a table with 20 pointsderived from the data sheets gives enough total energy loss of the battery per cycleinformation for accurate linear estimation of values is the product of energy drained from thenot directly represented in the table. battery by the system with the effi- ciency lossTotal energy DC–DC converter draws out of thebattery each cycle isThe energy consumed by the DC–DC converter is differ- Given the battery capacity model describedence between the energy provided by the battery above, battery es-timation is performed as follows. and First, the designer character- izes the battery with itsthe energy consumed from the DC–DC converter rated capacity, the time constant, and the table ofby all other components, points describing the discharge plot similar to the one shown in Fig. 4. During each simulation cycle discharge cur- rent ratio is computed from the rated battery current and average DC–DC currentE. Battery Model calculated from the last cycles. Efficiency is The main battery characteristic is its rated calculated using linear interpolation between thecapacity measured in megawatt hours. Since total points from the discharge plot. Total energy drawnavailable battery capacity varies with the discharge from the battery during the cycle is obtained fromrate, manufacturers specify plots in the datasheets (19). Lower efficiency means that less battery energywith discharge rate versus battery efficiency similar remains and thus the battery lifetime is propor-to the one shown below. tionally lower. For example, if battery efficiency is 60% and its rated capacity is 100 mAhr at 1 V, then 317 | P a g e
  7. 7. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325the battery would be drained in 12 min at average V. EMBEDDED MPEG DECODERDC–DC current of 300 mA. With efficiency of SYSTEM DESIGN EXPLORATION100% the battery would last 1 h. The primary motivation for the development of cycle-accu- rate energy consumption simulation methodology is to provideIV. VALIDATION OF THE SIMULATION an easy way for embedded system designers toMETHODOLOGY evaluate mul- tiple hardware and software We validated the cycle-accurate power architectures with respect to per- formance andsimulator by com- paring the computed energy energy consumption constraints. In this section weconsumption with measurements on the SmartBadge will present an application of the simulationprototype implementation. The SmartBadge methodology to embedded MPEG video decoderprototype consists of the StrongARM-1100 system design exploration. The MPEG decoderprocessor, DC–DC Converter, FLASH, and SRAM design consists of the processor, the off-chipon a PCB board. All the com- ponents except the memory, the DC–DC converter, output to the LCDCPU core are powered through the 3.3 V supply display, and the interface to the source of the MPEGline. CPU core runs on 1.5 V supply. DC–DC stream. The input and output portions of the MPEGconverter is powered by the 3.5 V supply. DC–DC decoder design will not be consid- ered at this point.converter efficiency table contains 22 points derived We focus on selection of memory hierarchy that isfrom the plot shown in Fig. 3. Stripline interconnect most energy efficient.model is used with 1.6 pF/cm capacitance calculated The characteristics of memorybased on the PCB board characteristics [13]. Table I components considered are shown in Table II.shows other system components. Average current Two different instruction memories wereconsumed by the processor’s power supply and the evaluated—low-power FLASH and power-hungrytotal current drawn from the battery are measured burst FLASH. We looked at three different datawith digital multimeters. Execution time is memories—low- power SRAM, faster burstmeasured using the processor timer. SRAM, and very power-hungry burst SDRAM.Industry standard Dhrystone benchmark is used as a Both instruction and data memories are 1 MB invehicle for methodology verification. Measurements size. We considered using L2 cache in addition to L1and simulations have been done for ten different cache. Unified L2 cache is 256 Kb, four-way setoperating frequencies of SA-1100 and SA-110 associative. The hardware configurations simulatedprocessors. Dhrystone test case is run 10 million are shown in Table III. The MPEG video decodetimes, 445 instructions per loop. Simulations ran sequence we used has 12 frames running at 30 frames/second, with two I, three P, and seven B-Table 1 frames. We found that the results we obtained with aDHRYSTONE TEST CASE SYSTEM DESIGN shorter video sequence matched well the results obtained with the longer trace. Fig. 6 shows the amount of time each system component is active during the MPEG decode and the amount of energy con- sumed. The original configuration is limited by the bandwidth of data memory. L2 cache is very fast but also consumes too much energy. Burst SDRAM design On HP Vectra PC with Pentium II MMX fully solves the memory bandwidth problem with300 MHz processor and 128 MB of memory. least energy consumption. InstructionHardware ran 450 times faster than the simulationswithout the energy models. Simulations with energymodels were slightly slower (about 7%). Fig. 5shows average processor core and battery currents.Average simulation current is obtained by dividingthe total energy consumed by the processor core orthe battery with their respective supply voltages andthe total execution time. Simulation results are within 5% of thehardware measure- ments for the same frequency ofoperation. The methodology presented in this paperfor cycle-accurate energy consumption simulation isvery accurate and thus can be used for architecturedesign exploration in embedded system designs. Anexample of such exploration is presented next. 318 | P a g e
  8. 8. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325Fig. 5. Average processor core and batterycurrents.TABLE IIMEMORY ARCHITECTURES FOR MPEGDESIGN Fig. 6. Performance and energy consumption for hardware architecturesTABLE III HARDWARE CONFIGURATIONS Fig. 7. Cycle-accurate energy plot The analysis of peak energy consumption and the fine tuning of the architectures can be done memory constitutes a very small portion of by studying the energy con- sumption and thethe total energy due to the relatively large L1 cache in memory access patterns over a period of time. Fig. 7comparison to the MPEG code size. The DC–DC shows the energy consumption over time of theconverter consumes a significant amount of total processor with burst FLASH and SRAM. Peakenergy and thus should be considered in system energy consumption can reach twice the averagesimula- tions. We conclude from this example that consumption, so the thermal character- istics of theusing faster and more power-hungry memory can be hardware design, the DC–DC converter, and the bat-energy efficient. tery have to be specified accordingly. For best battery utilization, it is important to match the current consumption of the embedded 319 | P a g e
  9. 9. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325system to the discharge charac- teristic of thebattery. On the other hand, the more capacity bat-tery has, the heavier and more expensive it will be.Fig. 8 shows that the instantaneous battery efficiencyvaries greatly over time with MPEG decode runningon the hardware described above. Lower capacity batteries have largerefficiency losses. Fig. 9 shows that the total decreasein battery lifetime when contin- ually runningMPEG algorithm on a battery with lower rateddischarge current can be as high as 16%. The battery’stime con- stant was set to ms. The design exploration example presentedin this section il- lustrates how the methodology forcycle-accurate energy con- sumption simulation canbe used to select and fine-tune hard- wareconfiguration that gives the best tradeoff between Fig. 9. Percent decrease in battery lifetime forperfor- mance and energy consumption. MPEG decoder The main limitation of the cycle-accurate sign can be optimized for both energyenergy simulator is that the impact of code consumption and perfor- mance based on theoptimizations is not easily evaluated. For example, in expected input data set.order to evaluate energy efficiency of two different The profiler operates as follows. Sourceimplementations of a particular portion of software, code is compiled using a compiler for a targetthe designer would need to obtain cycle-by-cycle processor (e.g., application or op- erating systemplots and then manually relate cycles to the software code). The output of the compiler is the exe- cutableportion of interest. The profiling methodology that the cycle-accurate simulator executespresented next addresses this limitation. (represented in this figure as assembly code that is input into the simulator) and a map of locations of each procedure in the executable that a profiler usesVI. PROFILING OF SOFTWARE to gather statistics (the map is correspondence ofENERGY CONSUMPTION assembly code blocks to procedures in ―C‖ source The profiler architecture is shown in Fig. 10. code). In order to increase the simulation speed, aThe shaded por- tion represents the extension we user-defined profiling interval is set so that themade to the cycle-accurate en- ergy simulator to profiler gathers statistics only at pre- determinedenable code profiling. Profiling for energy and time increments. Usually an interval of 1 s is suf-performance enables designers to identify those ficient. Note that longer intervals will give slightlyportions of their source code that need to be further faster ex- ecution time, with a loss of accuracy. Veryoptimized in order to either decrease energy short intervals (on the other of a few cycles) haveconsumption, increase performance, or both. Our larger calculation overhead. For example, energyprofiler enables designers to explore multiple dif- consumption calculation gives approximately 10%ferent hardware and software architectures, as well as overhead to standard cycle-accurate performanceto do sta- tistical analysis based on the input simula- tion. Profiling with an interval of 1 s givessamples. In this way the de- negligible overhead over energy simulation (less then 1%), with still accurate results. During each cycle of operation, the cycle-accurate energy consumption simulator calculates the current total executiontime and energy consumption of all system components as shown in (1). The profiler works concurrently with the cycle-accurate simulator. It periodically samples the simulation results (using sample interval specified by the user) and maps the energy and performance to the function executed using information gathered at the compile time. Once the simulation is complete, the results of profiling can be printed out by the total energy or time spent in each function. The main advantage of the profiler is that it allows designers to obtain energy consumption breakdown by procedures in their source code after running onlyFig. 8. Battery efficiency for MPEG decoder. one simulation. This information is of critical 320 | P a g e
  10. 10. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325importance when designing an embedded system, as uses the profiler to guide the implementation ofit enables designers to quickly identify and address the source code optimizations described earlier forthe areas in the source code that will provide largest the MP3 audio decoder running on the martBadge.overall energy sav- ings. A good example of profilerusage is shown in Table IV. The table shows a VII. OPTIMIZING MP3 AUDIO DECODERportion of energy profile for MP3 audio de- code. The block diagram of the MPEG Layer IIIThe first column gives the name of the top procedure, audio decoding algorithm (MP3) is shown in Fig. 11.fol- lowed by its children. The next column gives It consists of three blocks: frame unpacking,the total energy spent for that procedure. For reconstruction, and inverse mapping. The first stepexample, the total energy spent running the program in decoding is synchronizing the incoming bitstream( ) is 0.32 mWhr. The final column gives and the decoder. Huffman decoding of the subbandthe amount of energy spent only in that particular coefficients is performed before requantization.proce- dure. For example, under it is Stereo processing, if applicable, occurs before theclear that and its descendants spend inverse mapping which consists of an inversethe most energy, 0.0671 mWhr. Looking at the entry modified cosine transform (IMDCT) followed by afor , it is easy to see that the largest portion polyphase synthesis filterbank. We obtained theof energy is consumed by its child, . There- fore, original MP3 audio decoder software from thethe procedures to focus optimization on are International Organization for Standardization [28].and . Although in this example we showed Our design goal was to obtain real-time performancesource code profile of total battery energy with low energy consumption while keeping in fullconsumption, the pro- filer can report energy compliance with the MPEG standard.consumption for any system component, such as Given the limited compiler support available [16],SRAM or the interconnect. our ap- proach to code optimization is based on The profiler allows for fast and accurate manual code rewriting and optimization guided byevaluation of software and hardware architectures. our profiler. Code transformations are applied inMost importantly, it gives good guidance to the layers, starting from a high level of abstraction anddesigner during the design process without requiring moving down to very detailed and architecture-manual intervention. In addition, the profiler specific op- timization. In the next threeaccounts for all embedded system components, not subsections, we will describe in detail the threejust the processor and the L1 cache as most optimization layers, moving from high to lowgeneral-purpose profilers do. In the next section we abstraction. The results of optimizations applied topresent a real design example the MP3 de- coder will be presented in the last subsection. Note that all the optimizations presented in the following subsections were per- formed manually. A. Algorithmic Optimization The top layer in the optimization hierarchy targets algorithms. The original specification is first profiled to identify all compu- tational kernels, i.e., the procedures where most time and power are spent. Each computational kernel is then analyzed from a functional viewpoint. Then, the alternative algorithms for im- plementing the same functionality are considered and compared ˘ IMUNIC et al.: ENERGY-EFFICIENT DESIGN ´ OF BATTERY-POWERED EMBEDDED SYSTEMSFig. 10. Profiler architecture. 321 | P a g e
  11. 11. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325Table IV optimization, was unacceptably slow and power-SAMPLE ENERGY PROFILING consuming. Trying to reduce the precision of floating point computation, such as discussed in [31], would have helped only marginally as the processor would have to emulate in soft- ware all the floating point operations. To overcome this problem, we developed a fixed- precision library and we implemented all computational kernels of the algorithm using fixed precision numbers. The number of dec- imal digits can be set at compile time. The ARM architecture is designed to support computation with 32-bits integers with maximum efficiency. Hence, little can be gained by reducing data size below 32 bits. On the other hand, when multiplyingtwo 32-bit numbers, the result is a 64-bit number and directly truncating the result of a multiplication to 32 digits frequently leads to incorrect results because of overflow. To increase ro- bustness, 64-bit numbers have been used for fixed-point compu- tation. This data type is supported by the ARM compiler through the definition of a integer type. Computing with integers is less efficient than using 32-bit integers, but results are accurate and the risk of overflow is minimized. Data optimization produced significant energy savings and speedups for almost all computational kernels of MP3 without any perceivable degradation in quality. The fixed-point library developed for this with the original one. At this level of purpose contains macros for conversion from fixed-abstraction, we consider only high-level estimators point to floating point, accuracy adjustment,of algorithmic efficiency (such as number of basic elementary function computation.operations). Our approach to algorithmic optimization C. Instruction Flow Optimizationin MP3 decoding has been conservative. First, we Moving further down in abstraction level,focused on just one computa- tional kernel where a the third layer of optimizations targets low-levellarge fraction of run time (and power) was spent, instruction flow. After extensive profiling, the mostnamely the subband synthesis. Second, we did not critical loops are identified and carefully analyzed.try to develop new original algorithms but we used Source code is then rewritten to make computationpreviously pub- lished algorithmic enhancements more efficient. Well-known techniques such as loop[29], [30] that are still fully compliant to the MPEG merging, unrolling, software pipelining, loopstandard. The new algorithm incorpo- rates an invariant extraction, etc. [36], [35] have beeninteger implementation of the scaled Chen discrete applied. In the innermost loops, code can beco- sine transform (DCT) instead of a generic DCT in written directly as inline assembly, to better exploitthe polyphase synthesis filterbank. The use of a specialized instructions.scaled DCT reduces the DCT multiply count by28%. Instruction flow optimizations have been extensively applied in the MP3 decoder, obtainingB. Data Optimization significant speedup. We do not describe these At a lower level of abstraction than the optimizations in detail because they are commonalgorithmic level, we optimize code by changing the knowledge in the optimizing compilers literaturerepresentation of the data ma- nipulated by the [36], [35]. However, in our case most optimizationsalgorithms. The main objective is to match the were performed manually due to lack of support bycharacteristics of the target architecture with the the ARM compiler.processed data. In our case, the executable A simple example of this class ofspecification of the MPEG de- coder performed transformation is the use of the multiply-most computations using doubles, while the accumulate instruction ( ) available in the ARMprocessor SA-1100 has no hardware floating point SA-1100 core. The inner loops of subband synthesissupport. As a result, a direct implementation of the and inverse modified cosine transform (the two keydecoding algorithm, even after algorithmic computational kernels of MP3 decoder) contain 322 | P a g e
  12. 12. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325matrix multiplications which can be implemented Table VI ENERGY FOR MP3efficiently with multiply-accumulate. In this case, IMPLEMENTATIONSwe forced the ARM compiler to use the instruc-tion by inlining it in assembly.D. Results of MP3 Audio Decode Optimization Table V shows the top three functions inenergy consumption for each code revision weworked on. The original code has a very largeoverhead due to floating point emulation, about 80%of energy consumption. The next largest issue is theredesign of SubBandSynthesis function thatimplements the polyphase syn- thesis filterbank. Thedetails of each optimization type, namelyalgorithmic, data, and instruction-level optimizations,have been presented above. We will use the SubBandSynthesis functionredesign as a ve- hicle to illustrate the use of ourprofiler. In the initial stage, we transferred all criticaloperations to fixed-point from floating point. The the Table V. In parallel we had decided to try thetransfer resolved the issue with floating-point opera- algorithmic changes on the current code.tions but at the same time increased Profiling results in Table V show that theSubBandSynthesis fraction of total energy six times. algorithmic opti- mizations considerably reduced theNext we introduced a series of instruc- tion-level energy consumption of Sub- BandSynthesisoptimizations that resulted in a 30% decrease of Sub- function—it does not appear in the top three func-BandSynthesis fraction of total energy, to 34.32% as tions, and in fact it is only 3.2% of the total energyshown in consump- tion. The final step is to combine the algorithmic changes with the data and instruction- level changes, resulting in decrease of Sub- BandSynthesis fraction of energy consumption to 6% of total. System and component energy consumptions are shown in Table VI for different revisions of source code optimization. Positive percentages show energy decrease with respect to the original code. Table VII shows the sameFig. 11. MP3 audio decoder architecture. results but for performance measurements. Positive percentages show perfor- mance increase. AlthoughTable V the energy savings of algorithmic versus data andPROFILING FOR MP3 IMPLEMENTATIONS instruction-level optimizations as compared to original code are comparable, the performance improvement of data and instruction-level optimizations is significant. Note that the increase in energy consumption and the decrease in performance of Flash is due to the increase in code size with the algorithmic change in SubBandSynthesis procedure. The total improvement in system performance and energy consumption more than makes up for the degradation of Flash performance and energy consumption. Combined optimiza- tions give real- time performance for MP3 audio decode which is a primary constraint for this project. In addition, lower energy consumption enables longer battery life. Note that faster implementation that is also more energy efficient might imply higher power consumption, which can be an issue for thermal design of the device. In the case presented in this paper, it 323 | P a g e
  13. 13. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325Table VII selection of the best hardware configuration.PERFORMANCE FOR MP3 I have also developed a tool for profilingIMPLEMENTATIONS energy consump- tion of software in embedded systems. Profiling results enabled us to quickly and easily target the redesign the MP3 audio de- coder software. Our final MP3 audio decoder is fully compliant with the MPEG standard and runs in real time with low energy consumption. Using our design tools we have been able to in- crease performance by 92% while decreasing energy consump- tion by 77%. REFERENCES [1] Advanced RISC Machines Ltd (ARM), ARM Software Development Toolkit Version 2.11, 1996.Table VIII [2] G. Q. Maguire, M. Smith, and H. W. P. Beadle, ―SmartBadges: A wear- ableFIXED-POINT PRECISION AND computer and communication system,‖ inCOMPLIANCE Proc. 6th Int. Workshop Hardware/Software Codesign, 1998, Invited talk. [3] CoWare. CoWareN2c [Online]. Available: html [4] Mentor Graphics. [Online]. Available: [5] Synopsys. [Online]. Available: critical to get real-time performance with [6] Cadence. [Online]. Available:longer battery lifetime. The average and peak power constraints are met with our final [7] Y. Li and J. Henkel, ―A framework fordesign. estimating and minimizing energy The final MP3 audio decoder compliance to dissipation of embedded HW/SW systems,‖the MPEG stan- dard has been tested as a function in Proc. Design Automation Conf., 1998,of precision for fixed-point computation. We used pp. 188–193.the compliance test provided by the MPEG standard [8] N. Vijaykrishnan, M. Kandemir, M.[32], [33]. The range of RMS error between the Irwin, H. Kim, and W. Ye, ―Energy-samples defines the compliance level. Table VIII driven integrated hardware–softwareshows that results. Clearly, the larger number of optimizations using SimplePower,‖ inprecision bits results in better compliance. In our Proc. 27th Int. Symp. Computerfinal MP3 audio decoder we used 27 bits precision. Architecture, 2000, pp. 24–30.Using our design tools to guide the software [9] J. Flinn and M. Satyanarayanan,optimization process, we have been able to increase ―PowerScope:A tool for profiling theperformance by 92% while decreasing energy energy usage of mobile applications,‖ inconsumption by 77%, with full com- pliance to the Proc. 2nd IEEE Workshop Mo-bileMP3 audio decode standard. Computing Systems Applications, 1999, pp. 23–30.VIII. CONCLUSION [10] B. Kapoor, ―Low power memoryI developed a methodology for cycle-accurate architecutres for video applications,‖ insimulation of performance and energy consumption Proc. 8th Great Lakes Symp. VLSI, 1998,in embedded systems. Accuracy, modularity, and pp. 2–7.ease of integration with the instruc- tion-level [11] M. Lajolo, A. Raghunathan, and S. Dey,simulators widely used in industry make this method- ―Efficient power co-estimation techniquesology very applicable to the embedded system for SOC design,‖ in Proc. Design,hardware and soft- ware design exploration. Automation Test Europe Conf., 2000, pp.Simulation is found to be within 5% of the 27–34.hardware measurements for Dhrystone benchmark. [12] T. Givargis, F. Vahid, and J. Henkel, ―FastWe presented MPEG video decoder embedded system cache and bus power estima- tion fordesign explo- ration as an example of how our parameterized SOC design,‖ in Proc.methodology can be used in prac- tice to aid in the Design, Automation Test Europe Conf., 324 | P a g e
  14. 14. V.Prasanna Kumar / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 6, November- December 2012, pp.312-325 2000, pp. 333–339.[13] OZ Electronics Manufacturing. PCB Modeling Tools [Online]. Avail- able: url:[14] A. El Gamal and Z. A. Syed, ―A stochastic model for interconnections in custom integrated circuits,‖ IEEE Trans. Circuits Syst., vol. CAS-28, pp. 888–894, Sept. 1981.[15] T. Simunic, L. Benini, and G. De Micheli, ―Cycle-accurate simulation of energy consumption in embedded systems,‖ in Proc. Design Automation Conf., 1999, pp. 867–872.[16] T. Simunic, L. Benini, G. De Micheli, and M. Hans, ―Energy-efficient design of battery-powered embedded systems,‖ in Int. Symp. Low-Power Electronics Design, 1999, pp. 212–217.[17] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vanduoppelle, Custom Memory Management Methodology: Explo- ration of Memory Organization for Embedded Multimedia System De-sign. New York: Kluwer, 1998.[18] M. Pedram and Q. Wu, ―Battery-powered digital CMOS design,‖ in Proc. Design, Automation Test Europe Conf., 1999, pp. 17–23.[19] H. Mehta, R. M. Owens, M. J. Irvin, R. Chen, and D. Ghosh, ―Tech- niques for low energy software,‖ in Proc. Int. Symp. Low Power Elec- tronics Design, 1997, pp. 72–75.[20] M. Kandemir, N. Vijaykrishnan, M. Irwin, and W. Ye, ―Influence of compiler optimizations on system power,‖ in Proc. 27th Int. Symp. Com- puter Architecture, 2000, pp. 35–41. 325 | P a g e