IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 1547
Low-Swing Differential ...
1548 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012
Fig. 1. LS-DCCFF.
Fig. ...
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 1549
Fig. 3. Modification to ...
1550 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012
Fig. 7. delay versus se...
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 1551
chip. The authors would...
Upcoming SlideShare
Loading in …5
×

Low-Swing Differential Conditional Capturing Flip-Flop for LC Resonant Clock Distribution Networks

503 views
423 views

Published on

For more projects or your own idea contact us @ www.nanocdac.com

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
503
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Low-Swing Differential Conditional Capturing Flip-Flop for LC Resonant Clock Distribution Networks

  1. 1. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 1547 Low-Swing Differential Conditional Capturing Flip-Flop for LC Resonant Clock Distribution Networks Seyed E. Esmaeili, Asim J. Al-Kahlili, and Glenn E. R. Cowan Abstract—In this paper we introduce a new flip-flop for use in a low- swing LC resonant clocking scheme. The proposed low-swing differential conditional capturing flip-flop (LS-DCCFF) operates with a low-swing si- nusoidal clock through the utilization of reduced swing inverters at the clock port. The functionality of the proposed flip-flop was verified at ex- treme corners through simulations with parasitics extracted from layout. The LS-DCCFF enables 6.5% reduction in power compared to the full- swing flip-flop with 19% area overhead. In addition, a frequency depen- dent delay associated with driving pulsed flip-flops with a low-swing sinu- soidal clock has been characterized. The LS-DCCFF has 870 ps longer data to output delay as compared to the full-swing flip-flop at the same setup time for a 100 MHz sinusoidal clock. The functionality of the proposed flip-flop was tested and verified by using the LS-DCCFF in a dual-mode multiply and accumulate (MAC) unit fabricated in TSMC 90-nm CMOS technology. Low-swing resonant clocking achieved around 5.8% reduction in total power with 5.7% area overhead for the MAC. Index Terms—Delay, flip-flop, low-swing, power, resonant clock. I. INTRODUCTION The clock distribution network (CDN) in digital integrated circuits distributes the clock signal which acts as a timing reference control- ling data flow within the system. Since the clock signal has the highest capacitance and operates at high frequencies, the CDN consumes a large amount of total power in synchronous systems. Approximately 30%–50% of high performance processor power is consumed in the CDN [1]. The CDN and latches dissipate around 70% of the IBM POWER4 1.3 GHz microprocessor’s power [2]. Latest developments in integrated circuit design specifically in 3-D integration where multi- plane synchronization is required, lead us to believe that the power con- sumption of the CDN will remain at these high levels [3]. Resonant clocking enables the generation of clock signals with re- duced power consumption. The traditional approach for LC resonant CDNs is to use the LC tank to drive the global clock distribution while the local square clock is being delivered through conventional buffers. However, around 66% of clock power is being dissipated in the last buffer stage driving the flip-flops [4], leading to minor power savings in LC globally-resonant locally-square CDNs. In order to achieve max- imum power savings, the LC tank should drive the entire clock network (both global and local) without using intermediate buffers. This would require designing, modifying and understanding flip-flop performance with the sinusoidal clock signal generated in LC resonant networks. C. Kim et al. [5] demonstrated that a low-swing square-wave clock double-edge triggered flip-flop has enabled 78% power savings in the CDN. Low-swing clocking would normally require two voltage levels, VDD and VDD0Low. These voltage levels can be generated using one of two schemes: 1) dual-supply voltages and 2) regular power supply. The first scheme adds circuit and extra area complexity to the overall Manuscript received March 04, 2011; revised May 20, 2011; accepted May 28, 2011. Date of publication June 30, 2011; date of current version June 14, 2012. The authors are with the Department of Electrical and Computer Engineering, Concordia University, Montreal, QC H3G 1M8, Canada (e-mail: s_esma@ece. concordia.ca). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2011.2158613 chip design and layout. However, it leads to a reduction in the number of clock network transistors which improves power savings [6]. The second scheme uses circuit methods to achieve low-swing. However, the design of low-swing buffers becomes challenging in the absence of a second power supply [6]. We have followed a similar approach to the one proposed in [4] in which the clock buffers are removed to allow the global and local clock energy to resonate between the inductor and entire clock capac- itance including the receiving end flip-flops thus enabling maximum power savings. In addition, removing the clock buffers simplifies LC low-swing clocking since only reduced swing buffers are used at the flip-flop gate and not in intermediate levels within the clock tree [7]. In this paper, we introduce a low-swing differential conditional cap- turing flip-flop (LS-DCCFF) for use in low-swing LC resonant CDNs. As far as the authors know, this is the first application of low-swing clocking to LC resonant CDNs. In our approach, no additional power supply is required to achieve low-swing clocking. We have character- ized a frequency dependent delay associated with driving the pulsed flip-flop with a low-swing sinusoidal clock. We also provide measure- ment results from an integrated test chip fabricated in TSMC 90-nm CMOS technology. Furthermore, we report the power savings achiev- able through low-swing clocking. The remainder of this paper is organized as follows. A description of the proposed LS-DCCFF and a characterization of the delay associated with LC low-swing clocking are presented in Section II. Section III de- scribes the test chip. Section IV includes simulation and measurement results obtained. The conclusion of this paper is provided in Section V. II. LOW-SWING LC RESONANT CLOCKING A. LS-DCCFF Fig. 1 shows the proposed LS-DCCFF. Conditional capturing is used to minimize power at low data switching activities by eliminating re- dundant internal transitions [8]. As shown in Fig. 1, reduced swing in- verters similar to the one presented in [7] are used at the node fed by the low-swing sinusoidal clock signal. This is done to reduce short circuit power. The load pMOS transistor in the reduced swing inverters is al- ways in saturation since Vgs = Vds. It lowers the voltage at the source of the second pMOS in each inverter to approximately VDD 0 jVtpj thus turning it off when the low-swing sinusoidal clock signal reaches its peak voltage. The peak voltage for the low-swing clock was chosen to be equal to 0.65 V since VDD =1 V and the threshold voltage of the pMOS transistor is approximately 00.34 V. From here on and for simplicity the term LS and FS refer to low- swing and full-swing, respectively. B. Delay Associated With Low-Swing LC Resonant Clocking Vpull down presented in Fig. 2 is the voltage level at which transistor MN1 with the resonant clock signal applied to its gate (see Fig. 1) is able to pull down node SET/RESET to the low voltage level required to trigger the NAND latch. Due to the time difference between the low- and full-swing sinusoidal clock signals to reach Vpull down, the low-swing flip-flop experiences longer data to output delay (TDQ) as compared to the full-swing flip-flop for the same setup time (TDCLK). In the following, an analysis is conducted to estimate the delay in reaching Vpull down for the low-swing resonant clock signal. Let the full- and low-swing clock signals be given by the following equations: v(t)full swing = 1 2VDD sin 2ft 0 2 + 1 2VDD (1) v(t)low swing = 0:65 2 VDD sin 2ft 0 2 + 0:65 2 VDD (2) 1063-8210/$26.00 © 2011 IEEE
  2. 2. 1548 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 Fig. 1. LS-DCCFF. Fig. 2. Delay between the low- and full-swing resonant clock signals to reach . where f is the clock frequency, VDD and 0:65VDD are the peak voltage for the full- and low-swing sinusoidal clock signals, respectively. Depending on the input state, either node SET or RESET is pulled down to trigger the NAND latch when the clock signal reaches Vpull down. Substituting this value in (1) and referring to Fig. 2 Vpull down = 1 2 VDD sin 2fT1 0 2 + 1 2 VDD (3) from which T1 = 1 2f sin01 2Vpull down VDD 01 + 2 : (4) Using the same approach for the low-swing clock signal T2 = 1 2f sin01 2Vpull down 0:65VDD 01 + 2 : (5) The time difference between the two clock signals to reach Vpull down, which defines the TDQ delay between the low- and full-swing flip-flops, is given by TDQ = T2 0T1 = 1 2f sin01 2Vpull down 0:65VDD 01 0sin01 2Vpull down VDD 01 : (6) Equation (6) gives the delay between the full- and low-swing flip- flops. It illustrates that this delay is inversely proportional to clock fre- quency, i.e., at higher frequencies, the delay decreases. C. Power Following the approach proposed by the authors in [9] and [10], the power dissipation of the resonant clock network is given by the fol- lowing equation: Presonant clock = Rclk 2 (fVpeak(Cclk + N CFF ))2 (7) where Rclk and Cclk are the clock capacitance and resistance as seen by the driver, f and Vpeak are the frequency and peak voltage of the generated clock signal, CFF is the loading capacitance of the flip-flop, N is the number of flip-flops, and is the factor by which the loading capacitance of the flip-flop connected at the clock leaves is reflected to the driver side. Equation (7) illustrates that generating a low-swing clock signal with Vpreak = 0:65VDD results in around 58% power reduction in the clock network. III. TEST CHIP To demonstrate the correct operation of the proposed LS-DCCFF and to highlight potential power savings enabled through low-swing clocking, a test chip with a MAC unit designed using the proposed flip-flop under low-swing sinusoidal clocking was fabricated in TSMC 90-nm CMOS technology. Since the 16 216-bit multiplier itself was not pipelined, a clock frequency of 100 MHz was chosen for the test chip. Due to the large inductor needed for clock generation and the limited area available, the clock generator was not implemented on-chip. The sinusoidal clock signal is fed by an external source through an analog pad. Furthermore, the DCCFF was modified to enable dual-mode operation of the MAC unit under full- and low-swing clocking without significant area overhead. As illustrated in Fig. 3, the LS-DCCFF presented in Fig. 1 was modified at node X to allow the operation under full- and low-swing clocking. When signal FULL_SWING is high, full-swing clocking is enabled and the inverted clock output of the normal inverters CLKD_FS is feeding transistor MN1. Whereas low-swing clocking is enabled when signal FULL_SWING is low and the output of the reduced voltage swing inverters CLKD_LS feeds transistor MN1. A simplified floorplan and a die photo of our chip are shown in Figs. 4 and 5, respectively. The chip covers an area of 1 mm 2 1 mm. Two separate instances of the full- and low-swing DCCFF were im- plemented at the lower portion of the chip for testing. Due to the large capacitance associated with the pads, all outputs were connected to the pads through a buffer stage consisting of four progressively sized in- verters (see Fig. 4). IV. TEST CHIP EXTRACTED SIMULATION AND MEASUREMENTS Fig. 6 demonstrates the correct operation of the LS-DCCFF at a supply voltage of 1 V with an operating frequency of 100 MHz and
  3. 3. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 1549 Fig. 3. Modification to enable full- and low-swing flip-flop clocking. Fig. 4. Simplified floorplan of the test chip. Fig. 5. Die photopgraph of the test chip. a low-swing sinusoidal clock. This figure shows the low-swing sinu- soidal clock signal (channel 1, first signal from top), the inverted clock Fig. 6. Measurement waveforms of the LS-DCCFF at 100 MHz. signal (channel 2, second from top), the input D(channel 3, third from top), and the output Q (channel 4). HSPICE simulation on extracted circuits verifies correct function- ality of both flip-flops under best conditions given by the Fast-Fast (FF) corner at a low temperature of 025 C, normal conditions given by the Typical-Typical (TT) corner at room temperature, and worst conditions given by the Slow-Slow (SS) corner at a high temperature of 125 C. An average reduction in the TDQ delay of 130 ps was observed in the FF corner whereas the SS corner resulted in 76 ps increase in delay compared to the TT corner. Furthermore, correct functionality of both flip-flops under low- and full-swing sinusoidal clocking with 610% variation in the supply voltage was verified through measurements. A 610% variation in low-swing clock results in 99 ps reduction and 300 ps increase in Hold time. Hold time reduces with higher clock swing due to a reduction in the delay of reduced swing inverters. The oppo- site is true for lower clock swing. The negligible variation in TDQ delay with clock swing is basically due to the change in the time required for the clock signal to reach Vpull down. Post-layout-simulation results presented in Fig. 7 illustrate that for the same setup time, the difference between the TDQ delays for the full- and low-swing flip-flops is approximately 870 ps. This confirms the accuracy of (6) with an error of 4% compared to simulation results for Vpull down =500 mV. The measurement results presented in the figure (limited by our experimental setup)1 are within close proximity to post- layout-simulation. The extra delay associated with measurements can be related to the extra capacitance of the pads, package, wires, and test fixture.The response presented in Fig. 7 was obtained at the 0:5VDD voltage level for the data D and output Qwaveforms and at half of the clock peak for the sinusoidal clock signals, i.e., at 0.5 and 0.325 V for the full- and low-swing clock signals, respectively. The behavior of the current flowing through node X, i.e., I_FS for the full-swing flip-flop and I_LS for the low-swing flip-flop in Fig. 1, as well as the voltage level of nodes SET, Q, and QB for the two flip-flops for the same setup time of 950 ps is illustrated in Fig. 8. As shown in the figure, the maximum current flowing from node X to ground in the full- and low-swing flip-flops occurs when the full swing clock CLK_FS and low-swing clock signal CLK_LS at the gate of transistor MN1 reaches Vpull down = 500 mV. At this point, node SET is pulled down and the output Q of the NAND latch is pulled up to VDD. When 1The function generator used for testing is AFG3101 which can generate si- nusoidal and square signals with a maximum frequency of 100 and 50 MHz, respectively. Since the generator has only one channel, the Reference Out of one generator was connected to the Reference In of the second generator for synchronization.
  4. 4. 1550 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 Fig. 7. delay versus setup time for the full- and low-swing flip-flops: (a) FS-DCCFF and (b) LS-DCCFF. Fig. 8. for the full- and low-swing flip-flops at the same setup time—ex- tracted simulation. QB is grounded, transistor MN3 turns off, thus cutting the flow of the current. As illustrated in Fig. 7, TDQ becomes independent from TDCLK when data is applied on or after the point where the clock signal reaches or exceeds Vpull down since at this point transistors MN1/MN2 are completely switched on and are able to directly sink node SET or RESET. This occurs in the full-swing case for a setup time less than or equal to 0 ps, i.e., when input D is applied at or after point T1 in (4). In the low-swing case, TDQ becomes independent from TDCLK when input D is applied at or after point T2 in (5), i.e., at a setup time TABLE I AREA AND POWER COMPARISON BETWEEN FULL- AND LOW-SWING CLOCKING less than or equal to -905 ps which is the time difference between the 0:325VDD and Vpull down for the low-swing sinusoidal clock signal. Fig. 7 also shows that the low-swing flip-flop can operate at a negative setup time of approximately -2000 ps whereas the full-swing flip-flop can only operate at a negative setup time of approximately 0950 ps. This is because the reduced swing inverters in the low-swing flip-flop experience more delay than the normal inverters used in the full-swing flip-flop. Table I gives area and power overhead of low-swing as compared to full-swing resonant clocking. The area and power were estimated at the gate level. As shown in the table, the LS-DCCFF experiences 6.5% reduction in power as compared to the full-swing flip-flop with an area overhead of 19%. Static power consumption in the full-and low-swing flip-flops was assumed to be equal since the flip-flops have exactly the same transistor size except for the load pMOS in the reduced swing inverters. The table also illustrates that the application of low-swing clocking with the LS-DCCFFs causes a 5.7% increase in total area and 5.8% reduction in total power consumption. The clock distribution network capacitance was estimated by using Cadence’s Calibre PEX extractor and then simulating the extracted netlist in Cadence’s HSPICE simulator. Simulation results on the ex- tracted network show that the clock net has a total capacitance of 8.48 pF. The inductor needed to resonate the clock network at 100 MHz is approximately 0.32 H. Such a large inductor would normally be connected off-chip. In [9], the authors proposed a detailed analytical approach to determine required driver strength in the resonant clock based on the clock capacitance, resistance, output swing, and the pulse width of the applied reference signals. Using the approach proposed in [9], we illustrate that in order to resonate the clock tree at 100 MHz with a full-swing sinusoidal clock, the width of the pMOS transistor in the driver would be approximately 3.97 m. To generate the low-swing clock signal with a reduced peak voltage of 0:65VDD, the size of the transistor in the clock generator can be reduced by 66%. The scheme proposed in [11] in which a power management system based on auto- matic amplitude control can be used to insure the integrity of the gen- erated clock with the desired peak voltage. V. CONCLUSION We have proposed a low-swing sinusoidally clocked flip-flop to ob- tain further power reduction in LC resonant CDNs. Low-swing reso- nant clocking in pulsed flip-flops results in a delayed flip-flop response. Theoretical analysis has been performed and the delay associated with low-swing sinusoidal clocking was characterized. The functionality of the proposed flip-flop has been investigated through HSPICE simulation on an extracted circuit layout at ex- treme corners and tested through on-chip measurements. A MAC unit designed using the proposed flip-flop was tested on-chip where low-swing resonant clocking achieves around 5.8% reduction in total power with a 5.7% area overhead. ACKNOWLEDGMENT The authors are grateful to CMC for facilitating chip fabrication and to Prof. Y. R. Shayan for his help in using the Wireless Lab to test the
  5. 5. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 8, AUGUST 2012 1551 chip. The authors would also like to thank Eng. T. S. Obuchowicz for his help in running the TSMC 90-nm design kit before the submission deadline. REFERENCES [1] S. D. Naffziger and G. Hammond, “The implementation of the next- generation 64 b Itanium™ microprocessor,” in Dig. Tech. Papers, IEEE Int. Solid-State Circuits Conf., 2002, pp. 344–472. [2] C. J. Anderson, J. Petrovick, J. M. Keaty, J. Warnock, G. Nussbaum, J. M. Tendier, C. Carter, S. Chu, J. Clabes, J. Dilullo, P. Dudley, P. Harvey, B. Krauter, J. LeBlanc, L. Pong-Fei, B. McCredie, G. Plum, P. J. Restle, S. Runyon, M. Scheuermann, S. Schmidt, J. Wagoner, R. Weiss, S. Weitzel, and B. Zoric, “Physical design of a fourth-generation POWER GHz microprocessor,” in Dig. Tech. Papers, IEEE Int. Solid- State Circuits Conf., 2001, pp. 232–233. [3] V. F. Pavlidis, I. Savidis, and E. G. Friedman, “Clock distribution net- works in 3-D integrated systems,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 10.1109/TVLSI.2010.2073724 . [4] A. J. Drake, K. J. Nowka, T. Y. Nguyen, J. L. Burns, and R. B. Brown, “Resonant clocking using distributed parasitic capacitance,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1520–1528, Sep. 2004. [5] C. Kim and S. M. Kang, “A low-swing clock double-edge triggered flip-flop,” in Proc. Symp. VLSI Circuits, 2001, pp. 183–186. [6] F. H. A. Asgari and M. Sachdev, “A low-power reduced swing global clocking methodology,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 538–545, May 2004. [7] J. Pangjun and S. S. Sapatnekar, “Low-power clock distribution using multiple voltages and reduced swings,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 3, pp. 309–318, Jun. 2002. [8] H. Mahmoodi, V. Tirumalashetty, M. Cooke, and K. Roy, “Ultra low- power clocking scheme using energy recovery and clock gating,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 1, pp. 33–44, Jan. 2009. [9] S. E. Esmaeili, A. J. Al-Khalili, and G. E. R. Cowan, “Estimating re- quired driver strength in the resonant clock generator,” in Proc. IEEE Asia Pacific Conf. Circuits Syst., 2010, pp. 927–930. [10] S. E. Esmaeili, A. J. Al-Khalili, and G. E. R. Cowan, “Dual-edge triggered sense amplifier flip-flop for resonant clock distribution networks,” IET Comput. Digit. Tech., vol. 4, no. 6, pp. 499–514, Nov. 2010. [11] Z. Xu and K. L. Shepard, “Design and analysis of actively-deskewed resonant clock network,” IEEE J. Solid-State Circuits, vol. 44, no. 2, pp. 558–568, Feb. 2009.

×