services on...... embedded(ARM9,ARM11,LINUX,DEVICE DRIVERS,RTOS) VLSI-FPGA DIP/DSP PLC AND SCADA JAVA AND DOTNET iPHONE ANDROID If ur intrested in these project please feel free to contact us@09640648777,Mallikarjun.V
HICPA: A Hybrid Low Power Adder for High-Performance Processors
HICPA: A Hybrid Low Power Adder for High-Performance Processors Mohammad Hossein Hajkazemi and Amirali Baniasadi Department of Electrical and Computer Engineering University of Victoria Victoria, Canada email@example.com, firstname.lastname@example.orgAbstract— In this work we introduce a hybrid CLA-Ripple We refer to our solution as the Hybrid CLA-RipplePower-aware adder (or simply HICPA) for high performance Power-Aware Adder (HICPA). An accurate evaluation ofprocessors. HICPA is a multi-component adder that saves power HICPA requires taking into account the associated power andby avoiding aggressive usage of the Carry Look-Ahead adder timing overhead. In this work we take into account thisfor add operations using small operands. Instead, for small size overhead and show that the benefits achieved outweigh theoperands, HICPA uses a small and power efficient Ripple associated costs.Adder. We evaluate HICPA using a subset of SPEC2Kbenchmarks and show that, after taking into account the In summary our contributions are:associated power overhead, it is possible to reduce average ALUpower dissipation by 15.7% without compromising • We introduce HICPA as a power-aware alternative forperformance. conventional adders. Our solution is built on the observation that add operations using small size I. INTRODUCTION operands could be performed using a small power- efficient RCA without compromising performance. The adder is one of the most commonly used units indigital circuits. Many processor operations including addition, • We show that by using HICPA we can reduce ALUsubtraction and comparison rely on the adder component to power by 15.7% and 12.9% when 8-bit and 12-bitproduce results. Moreover almost all processor types, i.e., RCAs are used in conjunction with a conventionalhigh-performance, embedded and DSP processors, use adders CLA respectively.in their organization. The rest of the paper is organized as follows. In Section 2 Our study on SPEC’2k applications using SimpleScalar we discuss motivation. In Section 3 we introduce HICPA in3.0 toolset  shows that on average, 41% of the operations more details. In Section 4 we report methodology. In Section 5taking place in these applications use the adder unit. we report results. In Section 6 we review related work. FinallyAccordingly, designing a power-aware adder can impact the in Section 7 we offer concluding remarks.overall power efficiency considerably. II. MOTIVATION Previous studies have introduced many adder HICPA relies on the observation that while modernimplementations. The ripple carry adder (RCA) is among the processors rely on 32-bit adders to perform computations, notmost straightforward implementations of an adder. RCA, all add operations require 32 bits.while being simple, is very slow. Carry Look-ahead (CLA) isa classic solution for this problem. CLA adder relies on extra HICPA uses the above observation and saves power bylogic to produce carry signals fast and without the timing using a small RCA for add operations using small number ofoverhead associated with the ripple carry. This delay bits. As a result for such operations CLA power hungryreduction, however, comes with extra hardware overhead resources are not used. To provide better insight, in Fig. 1 wewhich in turn results in considerable power dissipation. report the number of data bits necessary to use in add operations for the SPEC2K applications studied here (see the In this work we extend previous studies by introducing a methodology Section for more details).low power hybrid adder. The proposed hybrid solution relieson a conventional CLA adder and a small power efficient 100%RCA. Our proposed solution takes into account applicationbehavior and uses only one of the two underlying adders 80% III pin firdepending on the operand size. For small operands, a simple 60%power efficient RCA is as fast as a 32-bit CLA adder.Therefore for this group of operands we save power without 40%compromising performance by using the RCA instead of theCLA adder. We maintain the overall performance byexploiting the CLA adder for large operands. 20% 0% I I I I I applu crafty galgel gcc mgrid AVG 6-bit 8-bit 12-bit 116-bit 24-bit Figure 1. Bars from left to right report the frequency of add operations requiring six, eight, 12, 16 and 24 bits for the applications studied here.
1.5 Cin A B 1 r ii lil II 0.5 U en 32-bit CLA Adder en RCA-n 0 6-bit 8-bit 12-bit 24-bit 32-bit s I [ux V / (a) ■ III 1 0.8 Figure 3. HICPA 0.6 0.4 III. HICPA 0.2 HICPA performs an add operation using either the CLA 0 adder or an n-bit RCA where n is less than 24. This decision is 6-bit 8-bit 12-bit 24-bit 32-bit made dynamically. For simplicity, we refer to an RCA using n (b) bits as RCA-n. Figure 2. (a) Relative delay for different size RCAs compared to a As presented in Fig. 3, HICPA is composed of four main conventional CLA adder. (b) Relative power dissipation for different size units: a 32-bit CLA adder, an n-bit RCA, a Decoder, and a RCAs compared to a conventional CLA adder. Multiplexer. As reported on average, 25.9%, 85.9%, 94.1%, 96%, and HICPA aims at selecting the proper adder to perform the97.3% of add operations require less than six, eight, 12, 16 and operation based on operand size. As soon as the appropriate24 bits, respectively. Despite this observation, modern ALUs adder is selected, the other adder is gated. This is done by vdd-take a uniform approach and treat all add operations equally gating  the idle adder. The Decoder selects the right adder.using a conventional 32-bit adder. This approach while being We present the two different Decoder implementations used inefficient from the performance point of view results in this work in Fig. 4.excessive power dissipation. The role of the Decoder is to find out if any of the upper m In Fig. 2(a) we report relative delay for different size (where m=32-n) bits are one. Assuming positive numbers, aRCAs compared to a CLA. As reported the delay of ripple bit equal to one in the upper m bits means that the number isadder increases with adder size. Meantime, ripple adders with bigger than n bits and the conventional CLA adder should beless than 24 bits are faster than the conventional 32-bit CLA used, otherwise we use the RCA. For negative numbers weadder. The delay associated with RCAs using six, eight, 12, 24 invert the sign extended bits before we make a decision.and 32 bits are about 25%, 34%, 49%, 98% and 130% of the32-bit CLA adder, respectively. Accordingly replacing the In Fig. 4(a) we present the sequential implementation usedCLA adder with ripple adders smaller than 24 bits would not for the Decoder Unit. In this implementation the upper m bitscompromise performance if the operation bit number does not of operands enter the Decoder serially. As the CPU fetchesexceed the ripple adder size. In Fig. 2(b) we report relative power dissipation for anumber of RCAs vs. the CLA adder. In the interest of space Data in Clear ^ > o- in Outputwe only report the average power for the applications studied Latch Enablehere. All RCAs show less power dissipation compared to theCLA adder. As reported the power dissipation of RCAs usingsix, eight, 12, 24 and 32 bits are about 43%, 54%, 61%, 79% Feedbackand 88% of a 32-bit CLA adder, respectively. We conclude (a)that replacing a CLA adder with a smaller RCA can reducepower. Considering the results reported in Fig. 2(a), this could A:[31:n] B:[31:n]be done without compromising performance, if the right RCAis picked according to the data operand size. For example, Fig. 1 shows that for 94.1% of add Outputoperations (i.e., those using less than 12 bits) it is possible to (b)save power by using a 12-bit ripple adder instead of aconventional CLA adder. This power reduction, in principle, Figure 4. (a) Sequential implementation. (b) Parallel implementation.does not impact performance as the delay associated with a While in the sequential implementation bits of operands enter the12-bit ripple adder is less than a 32-bit CLA adder. Decoder serially, in the parallel method they arrive in parallel.
instructions we clear the Decoder (Latch) using the clear A. Power Reductionsignal. Once the source operands are read, their upper m bits(2 * m bits) enter the Decoder serially. There are two In Fig. 5(a) we report ALU power reduction for HICPApossibilities for the first bit: a “0” or a “1”. Upon receiving a using RCA-6. As reported, on average ALU power reduction“0”, the output of the OR gate, AND gate and then Decoder is about -8.7% and 1.1% using parallel and sequentialwill be “0”. Upon receiving a “1” the outputs of both O R and Decoder, respectively. Using RCA-6 does not seem to be aA N D gates will be “1”. Consequently, the latch output reasonable choice, as the number of add operations using 6-bitbecomes “1”. Any bit entering the Decoder after this point, operands is too low to justify the overhead associated withwill result in a “1” in the output. This is due to the feedback HICPA. In 6(b) an RCA-8 is used under HICPA. We reduceconnecting the output to O R and A N D gates. In other words, ALU power by 7.5% and 15.7% compared to a CLA adder forfor each add operation and after we clear the Decoder output, parallel and sequential Decoder, respectively. In 6(c) we useand so long a “1” has not arrived at the Decoder input, the an RCA-12 in combination with the CLA adder under HICPA.output would remain “0”. As soon as a “1” enters the Decoder ALU power reductions are about 6.3% and 12.9% for parallelthe output toggles and does not change till the next add and sequential Decoder units compared to the conventionaloperation. CLA adder, respectively. Finally, we report ALU power reduction for RCA-24 combined with the CLA adder in 6(d). At the end of receiving all 2*m bits of both operands, the Power is increased by 3.9% and 1.9% when usingDecoder output would be either “0” or “1”. An output equal to combinational and sequential detectors, respectively. Our“0” indicates that none of the upper m bits of both operands is study shows an increase in the number of MUX non-zero“1” and that we have a small add operation. Consequently, inputs causes the negative savings for RCA-24.RCA-n is selected to execute the add operation. Otherwise wehave at least one “1” in one of the upper m bits of the operandindicating a long addition and the 32-bit C L A is selected. 50% Sequential Parallel In Fig. 4(b) we present the parallel implementation whereall operand bits are available to the Decoder at the same time. 0%The output of the Decoder is generated using an O R gate. The applu crafty galgel jjfcinputs to the O R gate include the upper m bits of the operands.Therefore if one of the operands has a “1” in its upper m bits -50%the Decoder output is “1”. This indicates using the C L A adder. The Decoder’s output is connected to the enable signal of (a)both adders and the select signal of the multiplexer. A 40%Decoder output equal to “0” suggests using the RCA-n addertransferring its result to the output. Otherwise, we select theC L A adder and its result. 20% As shown in Fig. 3, the multiplexer transfers either theresult of R C A or C L A adder to the output. Since the size of 0%multiplexer input is 32 bits, the (32-n) most significant bits of applu crafty galgel gcc mgrid AVGsecond input (output of RCA-n) are set to zero. (b) IV. METHODOLOGY 50% We use SimpleScalar 3.0 toolset  with A L P H A t Sequential mParallelconfiguration to study application behavior and data operandsize distribution. We use a subset of SPEC2K benchmarks - 0% .I s _■ *~™ ■suite  compiled for A L P H A instruction set architecture applu crafty galgel gcc mgrid AVGusing a modified G N U gcc compiler . -50% We use Synopsys Design Compiler tool  to estimateadder delays and evaluate the power consumption of our (c)designs. All our designs were implemented using Verilog 10%HDL. 0% We use C U B A S I C library in Mentor Graphics Leonardoto synthesis H I C P A for the area overhead measurement. We -10%report area in terms of gate-count. -20% V. RESULTS (d) We report the power reduction achieved by using HICPAfor the SPEC’2k applications studied here and for both Figure 5. HICPA ALU power reduction when CLA adder combineddecoder implementations. We simulated our design using with (a) RCA-6, (b) RCA-8, (c) RCA-12 and (d) RCA-24. We report forTSMC 65-nm CMOS process technology. both the parallel and sequential Decoder unit implementations.
As Fig. 5 shows, using the sequential Decoder results in C L A , our solution could be extended to other high-consistently better power reduction compared to using the performance adders which use parallel prefix networks in theirparallel Decoder. Meanwhile the parallel Decoder is a better structures as well [15 and 16]. A natural extension of our workcandidate for the situations where we face critical timing is to use other high-performance adders combined with theissues. power efficient R C A .B. Overhead VII. CONCLUSION HICPA comes with power and timing overhead. We take In this work we introduced H I C P A as an alternative lowinto account the power overhead in the results reported here. power hybrid adder for high-performance adders. H I C P A is aAs for the timing overhead, we assume that the Decoder can multi-component adder which selects the power hungry C L Aselect the proper underlying adder early enough and without for large operand add operations and a low power ripple adderimposing additional delay. This is a reasonable assumption as for small size operand add operations. Our solution offerssource operands are available at decode time and at least two about 15% A L U power reduction without compromisingcycles before ALU performs the operation. performance. HICPA comes with area overheads of one decoder, one ACKNOWLEDGMENTRCA-n and one multiplexer with two 32-bit inputs. Fig. 6 The authors thank Mostafa Farahani and Ahmad Lashgarcompares this area overhead to 32-bit carry look-ahead adder. for their input.With fixed multiplexer size, increasing RCA size leads tolarger RCA and a smaller decoder. Area overhead of HICPA REFERENCESis 36%, 39% and 46% for 6-bit, 8-bit and 12-bit RCAs, [I] D . Burger, T . M . Austin , S . Bennett , Evaluating Futurerespectively. For RCA-6 HICPA, 44% of the overhead Microprocessors: The Simple Scalar Tool Set. Technical Report,belongs to the multiplexer. University of Wisconsin-Madison, New York (1996).  M . D . Powell, S-H. Yangs, B . Falsafi, K . Roy, and T.N. Vijaykumar, irrrt 1.9 “Gated-vdd: a circuit technique to reduce leakage in deep-submicron ■ CLA BRCA BMUX ■ DEC cache memories”, ACM/IEEE International Symposium on Low Power 1.4 Electronics and Design (ISLPED), 2000.  Standard Performance Evaluation Corporation, http://www.spec.org. 0.9  SimpleScalar compiler tools, the Active Pages project at UCDavis.  Synopsys.com, http://www.synopsys.com/home.aspx. 0.4 CLA-32 RCA-6 RCA-8 RCA-12 RCA-16 RCA-24  A. Correale Jr, H. T. Kung, , “Overview of the Power Minimization Techniques Employed in the I B M PowerPC 4xx Embedded Controllers”, in Proceedings of the 1995 international symposium onFigure 6. Area overhead for different configurations of HICPA relative to the Low power design, pp. 75-80, 1995. baseline carry look-ahead adder.  R. Shalem, E. John, L.K. John, “A novel low power energy recovery full adder cell”, Proceedings of the ninth Great Lakes Symposium on VI. RELATED WORK V L S I , 1999, pp.380-383. Having multiple processing units and selecting one while  M. Shams, M.A Bayoumi, “A new full adder cell for low-power applications”, Proceedings of the 8th Great Lakes Symposium onisolating the others has been used in many embedded V L S I , 1998, pp.45-49.processors including PowerPC 403GA.  Ch. Ch. Wang, Ch.L. Lee, P. L. Liu, “Power-Aware Design Of An 8- Shalem introduced an energy-efficient full adder by adding Bit Pipelining Asynchronousant-Based Cla Using Data Transition Detections”, Proceedings of the the 2004 IEEE Asia-Pacifica control gate to the full adder circuit . In , a novel 1-bit Conference on Circuits and Systems, Dec 2004, pp.29-32.full adder which is based on X O R , X N O R and pass-transistors  H. Ling, “High Speed Binary Parallel Adder”, IBM J., pp. 156-166,is presented. This adder shows less power dissipation 1981.compared to conventional full adders due to two reasons. [II] R . P . Brent, H . T . Kung, A Regular Layout for Parallel Adder, IEEEFirst, it has no short circuit power. Second, exploiting low Transactions on Electronic Computers, pp. 260-264, 1982.circuit capacitance reduces dynamic power.  P. M. Kogge, H. S. Stone, “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations”, IEEE Wang proposed a power-aware PLA-style 8-bit C L A adder Transactions on Electronic Computers, pp. 786-793.. The introduced adder uses a simple data transition  S. Naffzifer, “A Sub-Nanosecond 0.5um 64b Adder Design” Digest ofdetection circuit to monitor and eliminate the unnecessary Technical Papers, I E E E International Solid-State Circuits Conference,input signals. Consequently power is reduced. pp. 362-363, 1996.  T. Lynch, .E Swartzlander, “A Spanning Tree Carry Lookahead As explained earlier in Section 1, C L A adder was Adder” I E E E Transactions on Computers, pp. 931-939, 1992.introduced to improve the performance of R C A . Other high-  D. Harris, “A taxonomy of parallel prefix adders”, Conference Recordperformance adder implementations include Ling adder  of the Thirty-Seventh Asilomar Conference on Signals, Systems andBrent-Knug adder  Kogge-Stone adder , Naffziger Computers, 2003, pp. 2213-2217adder  and spanning-tree adder . The last two adders  S. Knowles, “A family of Adders”, In Proceedings of 14th IEEEutilize a hybrid of C L A and Carry select adder in their Symposium on Computer Arithmetic, 1999 , pp. 30-34designs. While in this study we suggest combing R C A with