Ultra Low Energy vs Throughput Design Exploration of 65 nm Sub-VT CMOS Digital Filters          S. M. Yasser Sherazi, Joac...
(a)               (a)                                       (b)Fig. 2.   Half Band Digital Filter. (a) single HBD filter (b...
TABLE I  E XTRACTED PARAMETER FOR THE S YNTHESIZED I MPLEMENTATIONS             Energy dissipation is calculated under the...
3                      10                                                                                                 ...
Upcoming SlideShare
Loading in …5



Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Ultra Low Energy vs Throughput Design Exploration of 65 nm Sub-VT CMOS Digital Filters S. M. Yasser Sherazi, Joachim N. Rodrigues, Omer C. Akgun, Henrik Sjöland, and Peter Nilsson Department of Electrical and Information Technology, Lund University Box 118, SE-221 00 Lund, Sweden Email: {yasser.sherazi, joachim.rodrigues, omercan.akgun, henrik.sjoland, peter.nilsson}@eit.lth.se Abstract—This paper presents an analysis on energy dissipa- tion of a digital half band filters operated in the the sub-threshold (sub-VT ) region with throughput constraints. The degradation of speed in the sub-VT domain is counteracted by unfolding the architectures. A filter is implemented in a basic 12-bit and its various unfolded structures. The designs are synthesized in a 65 nm low-leakage high-threshold CMOS technology. A sub- Fig. 1. Receiver system. VT energy model is applied to characterize the designs in the sub-VT domain. The results from application of an energy 250 Ksamples/s. Therefore, a chain of decimation filters needs model shows that the unfolded by 2 architecture is most energy to be applied. To achieve lower energy dissipation, we are efficient, dissipating 22 % less energy compared to it the original employing voltage scaling techniques rigorously, hence mak- filter implementation at energy minimum voltage. Unfolded by ing the designed circuits run in the sub-threshold (sub-VT ) 4 architecture, however, is the best for throughput requirements of around 120 Ksamples/sec to 1 Msamples/s, as it dissipates less domain [1]. When operating in the sub-VT domain, leakage energy than any other implementation in this speed range. currents are to be dealt with, which are the source of energy dissipation in idle CMOS [2]. This current puts an important I. I NTRODUCTION design constraint especially in implantable medical devices. Miniaturized devices are important in medicine, sensor Consequently, we need to optimize the circuits in terms of networks, and many other applications. Engineers aim to energy dissipation and throughput for sub-VT operation. develop ultra compact and low energy circuits that may be In Sec. II we briefly present the applied sub-VT energy used in devices like hearing aids, medical implants, and remote model. In Sec. III we present a 12-bit architecture of a sensors. There is currently a major interest in small wireless Half Band Digital (HBD) filter that is implemented as direct devices with ultra low energy dissipation targeting on-body mapped and its various unfolded structures. In Sec. IV the applications or medical implants. In such devices minimal results attained from the HBD filters are shown and discussed, energy dissipation in active and standby mode, is of highest and finally, the conclusions are presented in Sec. V. importance as it makes the battery last longer, which is II. S UB -VT E NERGY M ODEL important as it is non-trivial to change or charge a battery in a medical implant. Devices like hearing aids that communicate The current of a MOS transistor is not equal drop to zero between the two ears to improve binaural hearing may benefit when the gate to source voltage VGS is equal to or below from energy efficient wireless receivers. Another example is a the threshold voltage VT , VGS ≤ VT , which is an indication neural sensor inside the body that communicates with a robotic for leakage currents, commonly referred to as the sub-VT or arm or leg. If a radio is made sufficiently small and with weak inversion conduction [3]. The existent current is due minimal power consumption, there will be vast possibilities to leakage and low in amperage, and in the sub-VT domain for new applications. used as the operating switching current. The drawback of sub- In the conducted project the design constraints are, less VT circuits is speed penalty. However, circuits that operate at than 1 mW and 1 µW power consumption in active and sub-VT manage to satisfy the ultra low energy requirements, standby mode, respectively, capacity to handle data rates up since order of magnitudes less energy is dissipated compared to 250 kbits/s, and realization on a single chip with an area of to super-threshold circuits [3]. The total energy dissipation of 1 mm2 in 65 nm CMOS. A block digram shows the receiver static CMOS digital circuits typically modelled as system in Fig. 1, containing a RF front-end (2.5 GHz), an Etotal = αCtot VDD 2 + Ileak VDD Tclk + Ipeak tsc VDD , (1) analog-to-digital converter, a digital baseband for demodula- Edyn Eleak Esc tion and control, and finally, an analog decoder that processes the received data packets. where Edyn is the average switching energy and Eleak is The main focus of this paper is on the digital baseband leakage energy dissipated during a clock cycle Tclk . As it part of the receiver system. The first task of the digital is known that the energy dissipation due to short circuit baseband circuit is to re-sample data from 4 Msamples/s to (Esc ) in the sub-VT domain is minor compared to the overall978-1-4244-8971-8/10$26.00 c 2010 IEEE
  2. 2. (a) (a) (b)Fig. 2. Half Band Digital Filter. (a) single HBD filter (b) uf-2 HBD filter.energy dissipation, which therefore is neglected [1]. In (1),Edyn during one clock period is proportional to the switchingactivity factor (α), and the total switched capacitance of thecircuit (Ctot ). The model used to calculate energy dissipation deliversSPICE-accurate results [4]. This model calculates total energydissipation by (2), and the key parameters required are ob-tained during synthesis and high level simulations. ET = Cinv VDD µe kcap + kcrit kleak e−VDD /(nUt ) , 2 (2)where kleak is average leakage scaling factor of the circuit isnormalized to the average leakage current of a single inverter.The scaling factor kcap is the normalized total capacitanceof the circuit in terms of a single inverter capacitance. Thekcrit is a coefficient that measures the critical path delay of (b)the circuit in terms of a single inverter delays. The average Fig. 3. Unfolded Architectures of the HBD filter. (a) uf-4 HBD filter (b)switching activity of circuit per N samples operations is µe . uf-8 HBD filter.A process dependent constant called slope factor is n, and Utis the thermal voltage and its value is 26 mV at 300 K. For All the filter coefficients are 1 or 2 may be implementedmore details the reader is referred to [4]. by simple shifting, and thereby saving the area and energy dissipation of the circuit. An initial analysis indicates that III. F ILTER A RCHITECTURES the required throughput would not be achieved by a single Minimum energy dissipation with medium to high through- sample implementation of this filter. Therefore, unfolding wasput requirement puts stringent constraints on a design. There- applied. Unfolding is a transformation technique that calculatefore, it is important to explore and analyse the architectures j samples per clock cycle, where j is the unfolding factor.that best fulfill the requirements. This section presents the Unfolding has a property of preserving the number of delaysHBD filter and the architectural differences in the basic and in a Direct Form Graph (DFG) [7]. The basic HBD filterunfolded versions. architecture was unfolded to get three more structures, i.e., unfolded by 2 (uf-2), unfolded by 4 (uf-4) and, unfoldedA. Half Band Digital Filter by 8 (uf-8). In all unfolded architectures the number of An optimized third order filter structure is evaluated for registers remain unchanged, whereas the adders scale with theminimum energy dissipation. The filter structure for the par- unfolding factor. Fig. 2(b), shows the uf-2 version of the filter.allel implementation, see Fig. 2(a), is a parallel third-order Furthermore, the critical path of this circuit is equal to thebi-reciprocal lattice wave digital filter, [5], considered as original HBD filter structure. Fig. 3(a) shows an architecturehighly suitable as decimator or interpolator, for sample rate that was unfolded by a factor of 4. The number of adders hasconversions with a factor of two. The benefit of using this type increased according to the unfolding factor. The critical pathof filter is that all filtering may be performed at lower sample has increased, since two of the feedback paths do not containrates, with low arithmetic complexity, therefore, yielding both a register. Similarly, Fig. 3(b), shows the architecture of uf-8low energy dissipation and a low chip area [6]. The transfer HBD, the adders have increased by a factor of 8, comparedfunction of the proposed filter is to the original HBD structure. The critical path increases, 1 + 2z −1 + 2z −2 + z −3 since six of the feed back paths do not contain any register. Hz = , (3) However, there are more samples processed per clock cycle in 2 + z −2
  3. 3. TABLE I E XTRACTED PARAMETER FOR THE S YNTHESIZED I MPLEMENTATIONS Energy dissipation is calculated under the assumption that Arch. kleak kcap kcrit µe Area tp [nsec] the designs operate at critical path speed, which gives an En- par 1113.6 835.4 127.4 0.727 1124 2.84 ergy Minimum Voltage (EMV) point [9]. The threshold voltage uf-2 1695.5 1375.7 127.4 0.708 1836 2.84 for this LL-HVT device is around 430 mV. The designs’ energy uf-4 3172.5 2797.9 164.2 0.703 3275 3.66 characteristics, over a scaled supply voltage VDD per clock uf-8 5924.5 5422.3 232.2 0.890 6170 5.22 cycle is presented in Fig. 4(a). It is shown that the basic HBD filter implementation denoted by (par) dissipates the TABLE II minimum amount of energy per clock cycle when compared C HARACTERIZATION OF THE I MPLEMENTATIONS AT EMV with the other three implementations. The reason being that Arch. EMV Freq. Throughput E/Cyc E/smp the leakage for this circuit is less than that of the other circuits [mV] [kHz] [ksamples/s] [fJ] [fJ] thanks to less area. The energy minima (per clock cycle) of par 241 23.6 23.6 45 45 45.5 fJ for par implementation is achieved around 241 mV uf-2 238 23.6 47.2 71 35 (indicated by the dot), which is lower than EMV of any other uf-4 247 22.0 88.0 150 38 architecture, which confirms that lesser area contributes to less uf-8 251 15.4 123.4 380 48 energy per clock cycle. However, it is crucial to investigate TABLE III the energy spent on the processing of each sample of data, P ERFORMANCES OF THE I MPLEMENTATIONS AT R EQUIRED and the apparent benefit of using par structure is lost when T HROUGHPUTS the energy per operation or energy per sample is considered. Throughput Circuits Vdd V [mV] E/Cyc [fJ] E/smp [fJ] Fig. 4(b), shows the energy dissipation per sample for different 2 Msamples/s uf-8 390 656 82.2 structures. Reason being that unfolded circuits perform twice, 1 Msamples/s uf-8 368 586 73.3 four and eight times as much operations per clock cycle, uf-4 376 246 61.5 therefore the over all energy per sample for these circuits is uf-2 400 136 68.3 reduced when compared to a single sample implementation. 500 Ksamples/s uf-8 344 525 65.2 Fig. 4(b), shows that the most efficient architecture is uf-2 as it uf-4 352 226 54.7 dissipates 35.8 fJ per sample which is 45 % less than the energy uf-2 368 116 58.4 dissipated by the par structure. Here, we may observe that par 400 85.2 85.2 the uf-8 architecture is less energy efficient than par, even in 250 Ksamples/s uf-8 300 434 55.0 energy dissipation per sample at lower voltages and is almost uf-4 320 188 47.0 equal to par, near the threshold voltages. The reason for this uf-2 344 126 51.8 behaviour is that the uf-8 has higher switching activity µe . The par 368 72.9 72.9 maximum frequency attainable with respect to VDD is shownthe unfolded structures, which wins with respect to throughput in Fig 4(c), the maximum frequency for both par and uf-2, isover a limited increase in the critical path [8]. always higher than their counterparts due to a shorter critical path, and the uf-8 has the slowest maximum speed because ofB. Hardware Mapping longer critical path, see Table I. Fig 4(d), shows the energy dissipation of all the structures with respect to throughput. All the cells used for implementation are from a low-leakage Table II, presents the characteristics of all the presentedhigh-threshold (LL-HVT) standard cell library. Tight synthesis architectures at EMV, showing the maximum frequenciesconstraints were set to get minimum area and a short critical attainable, the corresponding throughputs, energy dissipatedpath. The parameters for the energy model were retrieved by per clock cycle, as well as per sample. These simulations showgate-level simulations with back annotated toggle and timing that we benefit from unfolding technique, both in energy perinformation, which includes glitches. The parameters obtained sample and in throughput.were applied to the energy model to characterize the designs In the project discussed in Sec. I, we need a chain of fourin the sub-VT domain. HBD filters, that reduces the high frequency data with the rate of 4 Msamples/s from the ADC to the actual data rate of IV. S IMULATION R ESULT 250 Ksamples/s. The first HBD filter must process the input In this section the architectures of the filter are evaluated data stream with the rate of 2 Msamples/s. This throughputwith respect to energy and throughput. The parameters re- requirement is only fulfilled by using uf-8 HBD near 390 mV,quired for the energy model [4], extracted during synthesis as shown in Table III and Fig. 4(d). The throughput require-and energy simulations, discussed in II, are presented in ment of data with the rate of 1 Msamples/s for the secondTable I. The values for kleak follow the area cost, indicating HBD is fulfilled by using any three of the unfolded structure,proportional leakage with respect to area. The k parameters uf-8, uf-4 and uf-2. The throughput requirement of data withfor the unfolded implementations are not proportional to the the rate of 500 Ksamples/s for third HBD is fulfilled by allunfolding factor j since the number of internal registers remain four structures as shown in Table III and Fig. 4(d). Theunchanged from the basic implementation, although there is throughput requirement of data with the rate of 250 Ksamples/san increase in the number of input and output registers. for last HBD is again fulfilled by all structures. In Fig. 4(b),
  4. 4. 3 10 2 10 uf-8 Energy/samp [fJ] Energy [fJ] uf-4 10 2 uf-8 uf-2 par uf-4 par uf-2 0.15 0.2 0.25 0.3 0.35 0.4 0.15 0.2 0.25 0.3 0.35 0.4 VDD [V] VDD [V] (a) (b) 3 10 90 2 10 80 Energy/samp [fJ] fmax [kHz] uf-2 uf-8 70 1 10 par uf-4 par 60 0 10 50 uf-8 40 uf-4 −1 uf-2 10 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1k 10k 100k 1M V [V] Throughput [samples] DD (c) (d)Fig. 4. Simulation Plots of HBD filter architectures, (a) Energy vs VDD per clock cycle, (b) Energy vs VDD per sample. (c) Frequency vs VDD , (d)Energy vs Throughputthe uf-2 structure appears to be the most energy efficient unfolded implementation to achieve low energy dissipation percircuit. However, when stringent throughput requirements are sample at EMV, when compared to the energy dissipated byin-place the uf-4 structure proves to be the best option as a basic basic HBD filter implementation.shown in Fig. 4(d) and Table III. This analysis shows that ACKNOWLEDGMENTits crucial to identify the most suitable architectures for thegiven throughput and energy requirements. Furthermore, in The authors would like to thank Swedish Foundation for[10] it is argued that low-leakage low-threshold cells are more Strategic Research (SSF) for funding the Wireless Communi-beneficial at higher throughput rates in sub-VT domain, which cation for Ultra Portable Devices projects at Lund University.needs to be further investigated for these filter implementation. R EFERENCES In [1] it was shown in that the supply voltage of sub-VT [1] E. Vittoz, Low-Power Electronics Design. CRC Press, 2004, ch. 16.circuits may be reduced down to 50 mV. However, in practical [2] P. van der Meer, Low-Power Deep Sub-Micron CMOS Logic. Kluwerterms at such low voltage values functional failures frequently Academic Publishers, 2006.occur due to the process variations. It was found in [11] that [3] H. Soeleman and et al., “Robust subthreshold logic for ultra-low power operation,” IEEE T-VLSI Systems, vol. 9, pp. 90–99, Feb 2001.the supply voltage value which realizes operation with less [4] O. C. Akgun and Y. Leblebici, “Energy efficiency comparison ofthan 0.001 failure rate for a 65 nm LL-HVT process is 250 mV asynchronous and synchronous circuits operating in the sub-thresholdand this value is taken as the minimum reliable operating regime,” Journal of Low Power Electronics, vol. 4, OCT 2008. [5] P. Nilsson and M. Torkelson, “Method to save silicon area by increasingvoltage (ROV), indicated in the Fig. 4(b) by a line at 250 mV. the filter order,” in Electronic letters. ACM, NY, USA, 1995.The simulations show that for the required throughput we are [6] H. Ohlsson and et al., “Arithmetic transformations for increased maximaloperating safely above ROV, see Table III. sample rate of bit-parallel bireciprocal lattice wave digital filters,” in ISCAS, 2001. V. C ONCLUSION [7] K. K. Parhi, VLSI Digital Signal Processing Systems, 1999, ch. 5. [8] P. Åstrom, P. Nilsson, and et al., “Power reduction in custom CMOS In this paper four HBD filter structures are evaluated for digital filter structures,” AICSP Journal, vol. 18, pp. 97–105, 1998.minimum energy dissipation in the sub-VT domain for a [9] J. Rodrigues and et al., “A <1 pJ Sub-VT cardiac event detector in 65 nm LL-HVT CMOS,” VLSI-SOC, 2010.throughput constrained system. All architectures i.e., the un- [10] D. Markovic, J.M.Rabaey, and et al., “Ultralow-power design in near-folded by 2,4,8 and the basic HBD filter, are implemented and threshold region,” Proceedings of the IEEE, 2010.simulated using 65 nm LL-HVT standard cells. The application [11] J. Rodrigues and et al., “Energy dissipation reduction of a cardiac event detector in the sub-Vt domain by architectural folding,” PATMOS, 2009.of a sub-VT energy model reveals that it is beneficial to use