The document describes an ultra-low power asynchronous logic in-situ self-adaptive VDD system for wireless sensor networks. The proposed system uses quasi-delay-insensitive asynchronous logic implemented with pre-charged static logic circuits. It features a self-adaptive VDD scaling system that dynamically adjusts the supply voltage based on processing requirements to minimize power consumption while operating robustly in the sub-threshold voltage region. The system design includes an asynchronous filter bank module powered by the adjustable VDD rail and a power management module that monitors circuit delays to determine the optimal VDD setting.
Ultra-Low Power Asynchronous Logic Wireless Sensor Network
1. Prepared and Presented By:
Hossam Hassan
MSIS LAB, CBNU
An Ultra-Low Power Asynchronous-Logic In-Situ Self-
Adaptive VDD System for Wireless Sensor Networks
Authors: Tong Lin, Kwen-Siong Chong, Joseph S. Chang, and Bah-Hwee Gwee
Journal: IEEE Journal of Solid-State Circuits, vol. 48, no. 2, 2013
2. Outline
• Preliminaries
• Wireless Sensor Network
• Node Architecture
• Proposed Idea for Low Power Design
• Self-Adaptive VDD System for Wireless Sensor Networks
• Adaptive Vdd Scaling Systems
• System Design
• Results And Benchmarking
3. Preliminaries
• What Is Asynchronous Logic?
• Traditional way of Sequencing and Computation is the use of a
global time reference (“the clock”)
• Can we compute without a clock?
• Yes!: “asynchronous” or “clockless” logic
• Also “self-timed” or “speed-independent”
• Asynchronous system: collection of modules communicating
by handshake protocols
• Can we compute without a clock and without delay
assumptions?
• Quasi-delay-insensitive (QDI) logic
Adopted from:
Alain J. Martin, California Institute of Technology
4. Preliminaries
• Why Asynchronous and QDI Logic?
• No clock
• Up to 50% of clock power recuperated (get back)
• Automatic shut-off of idle parts
• Perfect clock gating
• No glitches (spurious transitions)
• Up to 50% of power in combinational circuits
• Automatic adaptation to parameter’s variations
• Voltage scaling: Perfect exchange of delay against energy through voltage scaling
• Flexibility of asynchronous interfaces:
• Better use of concurrency
• Robustness to PVT Variations: Variations of physical parameters all affect timing.
Adopted from:
Alain J. Martin, California Institute of Technology
5. Preliminaries
• Disadvantages of Async
• Size overhead (more transistors) (i.e. Handshaking)
• Poorly understood and rarely taught
• No industrial CAD tools (yet) (i.e. Custom Design)
• No well-developed testing procedure (yet) (i.e. Custom Design)
7. Preliminaries
• NULL Convention Logic
• NCL is a delay-insensitive (DI) asynchronous (i.e. clockless) paradigm, which means that NCL
circuits will operate correctly regardless of when circuit inputs become available; therefore
NCL circuits are said to be correct by-construction (i.e. no timing analysis is necessary for
correct operation). NCL circuits utilize dual-rail or quad-rail logic to achieve delay-
insensitivity.
8. Preliminaries
• Pre-Charge Static Logic (PCSL):
• It is an asynchronous-logic Quasi-Delay-Insensitive architecture
based on Static-Logic, featuring fully-range Dynamic Voltage Scaling
including robust operation in the sub-threshold voltage regime,
with simultaneous low hardware overheads, high-speed and yet
low power dissipation.
• The PCSL logic circuit achieves this by integration of the Request
sub-circuit into the Static-Logic cell.
• During the initial phase, the output of Static-Logic cell (within the
PCSL logic circuit) is pre-charged.
• During the evaluate phase, the Static-Logic cell computes the input
and the PCSL logic circuit outputs the computation.
Enable the
circuit
State Retention
(i.e store the logic
output value)
Pre-Charged Static-Logic
(PCSL) architecture
9. Preliminaries
• Muller C-elements:
• It is a small digital block widely used in design of asynchronous circuits and systems.
• In a Synchronous Circuit, the role of the clock is to define points in time where signals are stable and valid. In
between the clock ticks, signals may exhibit hazards and may make multiple transitions as combo circuit
stabilizes.
• In Asynchronous System, situation is different. The absence of clock means signals are valid all the time, every
transition has a meaning and consequently any hazard and races must be avoided.
Muller C Element and corresponding
CMOS implementation.
Truth Table for Muller C
Element
10. Preliminaries
• Filter bank
• In signal processing, a filter bank is an array of band-pass filters that separates the input
signal into multiple components, each one carrying a single frequency sub-band of the
original signal.
• The process of decomposition performed by the filter bank is called analysis (meaning
analysis of the signal in terms of its components in each sub-band); the output of analysis is
referred to as a sub-band signal with as many sub-bands as there are filters in the filter bank.
• The reconstruction process is called synthesis, meaning reconstitution of a complete signal
resulting from the filtering process.
11. Preliminaries
• Frequency Response Masking (FRM):
• Frequency-response masking filters are a technique to design sharp low-pass, high-pass,
bandpass and band-stop filters with arbitrary passband bandwidth.
• furthermore linear phase FIR filters are generated, which have advantages such as
guaranteed stability and are free of phase distortion.
• however, the problem with FIR filters is the high complexity for sharp filters
• with the frequency-response masking technique the resulting filter has very sparse
coefficients
• since only a very small fraction of its coefficient values are nonzero, its complexity is very
much lower than the infinite word-length minimax optimum filter
• with an additional multiplier-less design method the complexity is reduced to a minimum
• in linear phase FIR filters phase is a linear function of frequency
• they have a symmetric impulse response
12. Preliminaries
• Dynamic frequency scaling
• It is a technique in computer architecture whereby the frequency of a microprocessor can be
automatically adjusted "on the fly", either to conserve power or to reduce the amount of
heat generated by the chip.
• It is commonly used in laptops and other mobile devices, where energy comes from a battery
and thus is limited.
• Dynamic voltage scaling:
• It is another power conservation technique that is often used in conjunction with frequency
scaling, as the frequency that a chip may run at is related to the operating voltage.
• Since increasing power use may increase the temperature, increases in voltage or frequency
may increase system power demands.
14. Wireless Sensor Network
• Spatially distributed autonomous sensors
• Monitor physical or environmental conditions
• Temperature, sound, etc.
• Pass their data through the network to a main location
• Modern networks are bi-directional, also enabling control of sensor activity
• Applications
• Battlefield surveillance
• Industrial process monitoring
15. Wireless Sensor Network
• The WSN is built of "nodes“
• a few to several hundreds or even thousands
• each node is connected to one (or sometimes several) sensors
• Each such sensor network node has typically several parts
• a radio transceiver
• a microcontroller
• an electronic circuit for interfacing with the sensors
• an energy source, usually a battery
• As the WSN is typically designed for multiple-year operational life-span,
power is carefully budgeted and where pertinent, energized only when
required, such that the overall average power is typically 10–100 uW.
• Achieve the lowest possible power operation for the prevailing throughput
and circuit conditions—VDD adjusted to within 50 mV of the minimum
voltage, yet high operational robustness with minimal overheads for a WSN.
17. Proposed Idea for Low Power Design
• Signal processor accounts for ~50% of total power
consumption
• ‘Sub-threshold Self-Adaptive Scaling’ (SSAVS)
• Circuits work in sub-threshold region
• Supply voltage is adjusted dynamically depending on
the processing speed required by external environment
• Adopting the Quasi-Delay-Insensitive (QDI)
asynchronous-logic protocols where the circuits
therein are self-timed,
• Embodiment of Subthreshold Pre-Charged-Static-
Logic (PCSL) design approach.
• Async SSAVS system has been benchmarked against
its conventional sync DVFS system counterpart.
18. Proposed Idea for Low Power Design
• Asynchronous logic implementation
• Pre-charged Static Logic (PCSL)
• Superior than existing asynchronous logics in energy, delay and chip area.
19. Self-Adaptive VDD System for Wireless
Sensor Networks
• As the WSN is typically designed for multiple-year operational life-span, power is carefully
budgeted and where pertinent, energized only when required, such that the overall average
power is typically 10–100 uW.
• In our WSN depicted in Fig. 1, its overall active/passive operation ratio is approximately 20/80. In
the passive mode, only the Sensor Front-End module is continuously energized. The Sensor and
the Conditioning Circuits therein are powered directly by VDD_BAT ( 2.8 V) battery, via a Low-
Dropout (LDO) Regulator.
• The Simple Processor is powered by VDD_NOM (1.2 V) via a power-efficient Buck DC-DC
Converter.
• The Simple Processor ascertains if the input is possibly useful, and if it is, the WSN goes into
active mode where it signals the Power Management module to energize the Signal Processor
module via VDD_ADJ .
20. Self-Adaptive VDD System for Wireless
Sensor Networks
• The voltage of VDD_ADJ, typically in the sub-threshold voltage (sub-Vt) range, is self-adjusted
such that the lowest possible voltage is used—to enable ultra-low power operation.
• Signal Processor Module:
• The Signal Processor module buffers (via a FIFO) the output of the Simple Processor, filters the output
signal before final computation by the Microcontroller Unit (MCU).
• When the MCU ascertains that the filtered signal is useful, the Wireless Transceiver is energized and the
processed signal is subsequently transmitted wirelessly.
• With the wireless transmission expected to be 0.01% active and with a 20/80 WSN active/passive
operation, 50% of the overall power is attributed to the Signal Processor module, which is of interest in
terms of power dissipation.
21. Self-Adaptive VDD System for Wireless
Sensor Networks
• The approaches taken to minimize power involve all levels of the design space including
algorithmic design and at the hardware level.
• Frequency Response Masking (FRM) technique
• In the algorithmic design, the filtering in the Signal Processor module embodies the Frequency
Response Masking (FRM) technique.
• This involves the Interpolated Finite Impulse Response (IFIR) Filter and the FRM Filter Bank (FB), and is
computationally more efficient than the usual FIR and IIR filter approaches.
• Ultra-low power design techniques in the hardware level, the operation in the sub- region is one
of the most effective.
• This is particularly applicable because the speed of the digital circuits in the Signal Processor is
modest—the clocking speed ranges from 1.4 kHz to 1.4 MHz for a sampling rate range from 0.1
kSamples/s (kS/s) to 100 kS/s.
22. Self-Adaptive VDD System for Wireless
Sensor Networks
• Despite the potential advantages of sub- operation, this region of operation is challenging here
for several reasons.
• First, the WSN is designed to work in a wide range of conditions, including extreme environments (-55o
C to +125o C).
• Second, Process, Voltage and Temperature (PVT) variations for fine-dimensioned CMOS processes
increase dramatically in sub- operation, and the ensuing delay variations are very severe, possibly
intractable. Typically, a very large delay safety margin (for synchronous-logic (sync) circuits) would need
to be allowed for.
• Third, the input signal to the Signal Processor module is variable. From a robust operation perspective,
the circuits would need to be designed to meet the worst-case conditions— the fastest input rate and
extreme temperatures.
• To design the WSN for ultra-low power operation, a self-adjusting VDD approach whilst operating
in the sub-Vt region, termed ‘Sub-threshold Self-Adaptive VDD Scaling’ (SSAVS) where the VDD is
in-situ dynamically self-adjusted is adopted.
23. Self-Adaptive VDD System for Wireless
Sensor Networks
• The operation involves ‘dialing up’ VDD when the need for computation increases or when the
operating conditions are less favorable, and VDD is ‘dialed-down’ when the conditions are the
converse.
• Put simply, the lowest VDD is used where possible because in general the lower the VDD, the lower is
the power dissipation due to dynamic and leakage currents.
• The novel self-adjustment is obtained very simply—by exploiting (and comparing) the existing
Request and Acknowledge signals of the QDI protocol signaling, and thereafter adjusting the
VDD_ADJ accordingly. The ensuing overhead is hence very low.
24. Adaptive Vdd Scaling Systems
• The general modality of adaptive VDD scaling systems to reduce power is to adaptively adjust as
low as possible (with appropriate timing margin) to meet the throughput requirement for the
prevailing operating conditions (including PVT variations).
• This largely requires the pertinent circuit delay variations to be tracked, observed, or inferred.
• There are many reported techniques, but it can be argued that these reported tracked, observed
and inferred techniques are inadequate in terms of robustness, particularly in sub-Vt operation.
Further, the hardware/computation overheads are considerable, including the need to scale VDD
with the scaling of the clock frequency, i.e. Dynamic Voltage Frequency Scaling (DVFS).
• The proposed idea directly measuring the delay and comparing it against the throughput for the
prevailing conditions, and VDD is thereafter adjusted accordingly.
• To enable this, the adoption of the self-timed async QDI where its dual-rail encoding includes the
Request signal which indicates that the input sample is ready and the Acknowledge signal that indicates
the completion of the computation.
25. Adaptive Vdd Scaling Systems
• By counting the number of Requests against Acknowledges within a given period, we ascertain if
the delay of the circuit is excessive, or otherwise, with respect to the throughput for the
prevailing conditions.
• VDD is thereafter adjusted accordingly such that the delay is just slightly less than the delay between
input samples, thereby satisfying the throughput.
• Further, as Acknowledges is inherent in QDI async protocols, the computation is uninterrupted
while VDD is transitioning during its self-adjustment; in reported adaptive scaling systems, circuit
operation typically ceases when is transitioning.
26. System Design
• Fig. 2 depicts the proposed SSAVS system
within the Power Management module
embodying the SSAVS Controller and its
associated adjustable VDD means (a Buck
DC-DC Converter), and the PCSL-based 8x8-
Bit Quad-Channel Async QDI FRM FB within
the FRM FB.
• There are two voltage rails in the overall
proposed SSAVS system a fixed VDD_NOM
and a variable VDD_ADJ whose sub-Vt
voltage typically ranges from 150mV to
400mV.
• For ease of illustration, the specific VDD rail
is shown in parenthesis for the supply rails
and for signals of the various modules.
27. System Design
• In Fig. 2, the voltage of input and of request signals is first adjusted from VDD_NOM =1.2 V to
VDD_ADJ by the Step-Down Level Converter, and are thereafter buffered by the Async FIFO
Buffer (depth of 50) before input (Input_FB and Req_FB) to the async FRM FB.
• The FB outputs (Output 1–4) and their associated Acknowledges (combined from Ack 1–4 via the
Completion Detection Circuit) are output to the MCU for further processing.
• Acknowledge is also fed back to the Async FIFO Buffer.
• The Request and Acknowledge signals are input to the Power Management module, and
Acknowledge is stepped up from VDD_ADJ to VDD_NOM.
• The SSAVS Controller within the Power Management module monitors the number of requests
and Acknowledge signals in each period (a 10 Hz clock generated by the Update VDD Clock
Generator for a target throughput of 1 kS/s).
• The VDD_Code is a 5-bit code that sets one of 24 voltage levels (in the Buck DC-DC Converter)
ranging from ‘00000’=50 mV to ‘10111’=1.2 V (in 50 mV steps) for VDD_ADJ.
28. System Design
• Fig. 3 graphically depicts an example of the self-adjustment of VDD_ADJ.
• When the WSN is first initiated, the SSAVS Controller outputs VDD_Code = ‘10111’, equivalently
VDD_ADJ = 1.2 V, and the speed of the FB would far exceed the required computation.
• The voltage of VDD_ADJ of the FB is in-situ adaptively self-adjusted to be as low as possible
(within 50 mV) to meet the throughput for the prevailing operating conditions, and on average,
the voltage of VDD_ADJ is slightly higher than the actual required minimum.
• Hence, the FB is ultra-low power and highly power-efficient.
29. System Design
• In view of the need for sub-Vt operation, it is imperative to adopt circuits based on the static-logic
family to mitigate the effects of critical transistor sizing; dynamic- and pass-logic families are
inappropriate.
• Pre-Charged Static-Logic’ (PCSL).
• The basic architecture comprises an Inverting Static-Logic Cell, three transistors (for output pre-charging
during the reset phase/evaluation during the computation phase), and two inverters (for output
buffering). The outputs are Q.T (Output True) and Q.F (Output False).
The basic architecture of the proposed async cells, coined ‘Pre-Charged Static-Logic’ (PCSL).
30. System Design
• In PCSL cells, when Request is ‘0’, both outputs are ‘0’. On the other hand, when Request is ‘1’
(indicating that an operation is ready) and when the input signals are valid, the operation
commences and an ensuing output is obtained.
• The architecture of the PCSL cell involves an integration of the sub-circuit associated with the
signal and a buffer (to each output) into the standard static-logic library cell (redesigned for dual-
rail async), thereby sharing of (common) transistors.
• This reduces the number of transistors, resulting in simultaneous lower power/energy dissipation,
faster speed and smaller IC area.
31. System Design
• To depict the hardware advantage of the proposed PCSL approach, the 2-input AND/NAND gate in
can be compared to the same gate realized by three reported static-logic QDI approaches:
a) Delay-Insensitive- Minterm-Synthesis (DIMS) approach
b) NULL Convention Logic (NCL) with complex gates (denoted NCL1), and
c) NCL with fast-reset complex gates denoted NCL2).
32. System Design
• On the basis of simulations (130 nm CMOS), delay and IC area of six basic cells of the various
approaches. The competing cells are normalized to the PCSL cells whose actual values are shown
within parentheses. The average attributes are tabulated in the last row.
• Cells embodying the proposed PCSL approach simultaneously exhibit the lowest , shortest delay
and smallest IC area.
33. System Design
• With the proposed PCSL QDI realization approach, an 8x8-Bit Quad-Channel Async QDI FRM FB is
designed.
• A semicustom design flow is adopted.
• Each FB channel comprises an Async Read/Write Controller, an 8x8-Bit Coefficient Memory, an 8x8-Bit
Data Memory, an 8-Bit PCSL Multiplier, and a 20-Bit PCSL Adder.
• To preserve the QDI protocol and proper async handshaking, Datapath Completion Detection (DCD) and
Latch Completion Detection (LCD) circuits are included with Muller C-elements (denoted by a ‘C’).
Latch Completion
Detection (LCD)
Datapath Completion
Detection (DCD)
34. Scenario 1, the
sync DVFS
system
embodies a
temperature
sensor and on
the basis of the
measured
temperature and
pre-
characterization
of the sync filter,
the clocking
frequency is
selected
accordingly.
RESULTS AND
BENCHMARKING
35. Scenario 2, the
sync DVFS
system is much
simpler where
the clocking
frequency is
fixed (to the
worst-case) to
accommodate all
conditions.
RESULTS AND
BENCHMARKING
36. • Scenario 1, no specific FB is particularly advantageous—the sync DVFS FB and async SSAVS FB are
advantageous in different conditions.
• Nevertheless, the sync FB may be disadvantageous if the temperature sensor overheads
associated with DVFS for Scenario 1 are considered.
• In Scenario 2, the async FB is advantageous in terms of reduced delay with respect to VDD, usually
lower Eper with respect to VDD, and in terms of power dissipation, advantageous in some
conditions (while the sync advantageous in other conditions).
• Further, in the context of continuous circuit operation and overheads associated with DVS, the
proposed SSAVS is advantageous over the conventional DVFS in terms of uninterrupted circuit
operation and not requiring external intervention (such as changing clock rate, pre-
characterization, etc.).
Results And Benchmarking
Editor's Notes
Can we compute without a clock and without delay assumptions?
Quasi-delay-insensitive (QDI) logic
There is another class of logic gates which relies on the use of a clock signal. This class of circuit is known as dynamic circuits. The clock signal is used to divide the gate operation into two halves. In the first half, the output node is pre-charged to a high or low logic state. In the second half of a clock cycle, the circuit evaluates the correct output state. When Ø is low, Z is charged to high. When Ø is high, n logic block evaluates input, and conditionally discharges Z. This circuit adds series resistance to the pull-down n-channel transistor, therefore the fall time is increased slightly. This circuit is dynamic because during evaluation, the output high level at Z is maintained by the stray capacitance at the output node. If Ø stays high (i.e. evaluation period) for a long time, Z may eventually discharge to a low logic level.
In a Synchronous Circuit, the role of the clock is to define points in time where signals are stable and valid. In between the clock ticks, signals may exhibit hazards and may make multiple transitions as combo circuit stabilizes. In Asynchronous System, situation is different. The absence of clock means signals are valid all the time, every transition has a meaning and consequently any hazard and races must be avoided.In the synchronous world, OR Gate only indicates that both inputs are LOW, when HIGH it does not indicate which one signal made a transition. Similarly AND gate only indicates when both inputs are HIGH but does not indicate which one does LOW when the output of AND gate is LOW. Knowing this transition is very important for Asynchronous circuits as these transitions may have a reverse impact or hazard/ Race condition and should be avoided. So a better circuit in this respect is Muller C Element shown in Figure 2.
Dynamic voltage scaling is a power management technique in computer architecture, where the voltage used in a component is increased or decreased, depending upon circumstances. Dynamic voltage scaling to increase voltage is known as overvolting; dynamic voltage scaling to decrease voltage is known as undervolting. Undervolting is done in order to conserve power, particularly in laptops and other mobile devices,[1] where energy comes from a battery and thus is limited, or in rare cases, to increase reliability. Overvolting is done in order to increase computer performance.
The approaches taken to minimize power involve all levels of the design space including algorithmic design and at the hardware level. In the former, the filtering in the Signal Processor module embodies the Frequency Response Masking (FRM) technique [4]. This involves the Interpolated Finite Impulse Response (IFIR) Filter and the FRM Filter Bank (FB), and is computationally more efficient than the usual FIR and IIR filter approaches. Ultra-low power design techniques in the latter are extensively reported in literature [5]–[15] and of these, operation in the sub- region is one of the most effective. This
is particularly applicable here because the speed of the digital circuits in the Signal Processor is modest—the clocking speed ranges from 1.4 kHz to 1.4 MHz for a sampling rate range from 0.1 kSamples/s (kS/s) to 100 kS/s.
The modus operandi involves ‘dialing up’ when the need for computation increases or when the operating conditions are less favorable, and is ‘dialed-down’ when the conditions are the converse. Put simply, the lowest is used where
possible because in general the lower the , the lower is the power dissipation due to dynamic and leakage currents. In this paper, we describe an SSAVS system for the Signal Processor module in a WSN based on a proposed methodology within the Quasi-Delay-Insensitive (QDI) asynchronous-logic (async) approach [6], [12], [14], [16], and with a novel in-situ self-adjusting means. The proposed design methodology, coined ‘Pre-Charged Static-Logic’ (PCSL) [17], is essentially a static-logic library cell architecture that exploits the fast reset feature and is appropriate for full-range Dynamic Voltage Scaling (DVS) [18]—for ranging from nominal voltage to deep sub- . The proposed SSAVS system for the WSN is demonstrated by means of application to the FRM FB. The novel self-adjustment is obtained very simply—by exploiting (and comparing) the existing Request and Acknowledge signals of the QDI protocol signaling, and thereafter adjusting the accordingly (see Section III later). The ensuing overhead is hence very low. This paper is organized as follows. Section II reviews adaptive scaling systems. Section III presents the design of the
proposed system. Section IV presents the measurement results of prototype ICs and benchmarking thereof. Finally, conclusions are drawn in Section V.
The general modality of adaptive scaling systems to reduce power is to adaptively adjust as low as possible (with appropriate timing margin) to meet the throughput requirement for the prevailing operating conditions (including PVT variations).
This largely requires the pertinent circuit delay variations to be tracked, observed, or inferred. A reported delay tracking technique is based on a Look-Up Table [19], [20] comprising tabulated pre-characterized throughput versus data according to critical path circuit delay(s) under worst-case PVT conditions for the given throughput. To avoid excessive timingmargins, Statistical Static Timing Analysis [19] may be employed mostly to account for local (within-die) variations. Another reported technique [21] attempts to track real-time variations by adding PVT sensors. However, in sub- operation, because of the exponential relationship of sub- delay with PVT, even small errors in these
sensor readings could lead to large circuit delay uncertainties, and the overheads associated with the sensors may defeat any advantage. The reported critical path delay matching [22]–[26] involves a ring oscillator matched to the critical path delay to set the clock frequency, and is subsequently adjusted. For improved matching, the entire logic of the critical path may be replicated at high hardware cost [24]. Although this may be able to mitigate the delay uncertainties issues associated with global PVT variations, it may not comprehensively account for local variations, particularly in sub- operation. Another reported technique employs timing error detection/correction [27]–[30], where VDD is reduced until the ensuing computation is erroneous. VDD is thereafter increased and the computation repeated. The applicability of this technique is arguably limited due to the severe/intractable PVT variations in suboperation,
to possibly severe meta-stability issues due to the lack of timing margin, and to the need for re-computations. Another reported technique [31], [32] attempts to ascertain the circuit delay indirectly by measuring the variations in the
supply current drawn to infer the ‘duration’ of the computation, and subsequently adjusted. This technique is likely to be ambiguous in sub- operation where the ratio of the current during computation to idle is small.
On the basis of the aforesaid review, it can be argued that these reported tracked, observed and inferred techniques are inadequate in terms of robustness, particularly in sub-Vt operation. Further, the hardware/computation overheads are considerable, including the need to scale with the scaling of the clock frequency, i.e. Dynamic Voltage Frequency Scaling (DVFS).
We instead propose a definitive means by directly measuring the delay and comparing it against the throughput for the prevailing conditions, and is thereafter adjusted accordingly. To enable this, we adopt the self-timed async QDI (vis-à-vis the conventional sync) where its dual-rail encoding includes the Request signal which indicates that the input sample is ready and the Acknowledge signal that indicates the completion of the computation.
By counting the number of Requests against Acknowledges within a given period, we ascertain if the delay of the circuit is excessive, or otherwise, with respect to the throughput for the prevailing conditions. VDD is thereafter adjusted accordingly such that the delay is just slightly less than the delay between input samples, thereby satisfying the throughput. Further, as is inherent in QDI async protocols, the computation is uninterrupted while is transitioning during its self-adjustment; in reported adaptive scaling systems, circuit operation typically ceases when is transitioning [20]. Of specific interest, note that the delay is definitive because the delay is that ascertained for the prevailing operating conditions, and we will show later that the associated hardware to adjust is very modest. At this juncture, to the best of our knowledge, ultra-low power QDI circuits with self-adaptive , operating in the sub- region and in extreme environments (hence requiring extremely high reliability), have yet to be reported or demonstrated. Further it would be interesting to compare their attributes, including IC area, delay, energy/operation and power dissipation, against their conventional sync DVFS counterpart and under various conditions (see Section IV later).
FB within the FRM FB. There are two voltage rails in the overall proposed SSAVS system: a fixed and a variable whose sub- voltage typically ranges from 150mV to 400mV. For ease of illustration, the specific rail is shown in parenthesis for the supply rails and for signals of the various modules. In Fig. 2, the voltage of and of signals is first adjusted from to by the Step-Down Level Converter, and are thereafter buffered by the Async FIFO Buffer (depth of 50) before input ( and ) to the async FRM FB. The FB outputs ( 1–4) and their associated (combined from 1–4 via the Completion Detection Circuit) are output to theMCU for further processing. is also fed back to the Async FIFO Buffer. The
and signals are input to the Power Management module, and is stepped up from to . The SSAVS Controller within the Power Management module monitors the number of and signals in each period (a 10 Hz clock generated by the Update Clock Generator for a target throughput of 1 kS/s). The is a 5-bit code that sets one of 24 voltage levels (in the Buck DC-DC Converter) ranging from ' to ‘ (in 50 mV steps) for .
and the speed of the FB would far exceed the required computation. In this scenario, the number of FB clocks will be equal to the number of clocks in each period. In the next period, the SSAVS Controller will subsequently decrement
by 1 bit to ‘10110’ and correspondingly reduces by 50 mV to 1.15 V. The process continues where is continuously decremented as with the voltage of commensurably reduced. Eventually, at period in Fig. 3, is decremented to ‘00010’, equivalently . This is the juncture where the speed of the FRM FB is just slightly slower than the data rate for the prevailing conditions—the number of clocks hence exceeds the number of clocks in one period. Although the speed of the FRM FB is slightly too slow, no error occurs because the unconsumed inputs are stored in the Async FIFO Buffer (Fig. 2). In the next period, , the SSAVS Controller reacts accordingly by incrementing by 1 bit to ‘00011’ and the corresponding increased by 50 mV to 200 mV. With increased, the speed of the FRM FB now slightly exceeds the required computation and the unconsumed inputs stored in the FIFO buffer are in turn computed at a slightly faster rate than the data rate. Consequently, the number of clocks is now less than the number of clocks and at the end of this
cells, when is ‘0’, both outputs are ‘0’. On the other hand, when is ‘1’ (indicating that an operation is ready) and when the input signals are valid, the operation commences and an ensuing output is obtained. The architecture of the PCSL cell involves an integration of the subcircuit associated with the signal and a buffer (to each output) into the standard static-logic library cell (redesigned for dual-rail async), thereby sharing of (common) transistors. This reduces the number of transistors, resulting in simultaneous lower power/energy dissipation, faster speed and smaller IC area (see Table II later). On the basis of this architecture, Figs. 4(b)–(g) depict the schematic of six basic PCSL cells (all with 3-transistor limit in any stack).
To depict the hardware advantage of the proposed PCSL approach, the 2-input AND/NAND gate in Fig. 4(b) can be compared to the same gate realized by three reported static-logic QDI approaches in Figs. 5(a)–(c): (a) Delay-Insensitive- Minterm-Synthesis (DIMS) approach [33], (b) NULL Convention Logic (NCL) with complex gates [34] (denoted NCL1), and (c) NCL with fast-reset complex gates [35] (denoted NCL2). On the basis of simulations (130 nm CMOS), Table II benchmarks , delay and IC area of the aforesaid six basic cells of the various approaches. The competing cells are normalized to the PCSL cells whose actual values are shown within parentheses. The average attributes are tabulated in the last row.
It is apparent from Table II that the cells embodying the proposed PCSL approach feature the lowest , save the simple AND/NAND and OR/NOR gates of NCL1. On average, of cells embodying the reported DIMS, NCL1, and NCL2 approaches is significantly higher: 4.0 , 1.6 , and 1.9 respectively. It is also apparent that the cells embodying the proposed PCSL approach feature the shortest delay (the sum of two components, (computation phase) and (reset phase), averaged over all input combinations), save the simple AND/NAND and OR/NOR gates of NCL1.
On average, the reported DIMS, NCL1, and NCL2 cells are significantly slower: 4.1 , 1.8 , and 1.9 respectively. It is also apparent that the cells embodying the proposed PCSL approach require the smallest IC area; the layouts are based on the standard-cell approach where the cell height is fixed at 4 and the cell width is in multiples of 0.4 . On average, the IC area required for cells embodying the reported DIMS, NCL1, and NCL2 approaches is significantly larger: 4.7 , 2.6 , and 2.7 respectively; from a perspective of dual-rail async and (single-rail) sync circuits, the smaller IC area is worthwhile because the IC area overhead of the former is somewhat mitigated. In short, cells embodying the proposed PCSL approach simultaneously exhibit the lowest , shortest delay and smallest IC area. With the proposed PCSL QDI realization approach, an 8
8-Bit Quad-Channel Async QDI FRM FB is designed. A semicustom design flow is adopted, where the front-end is designed using an assortment of in-house design tools and commercial synthesis tools based on a flow similar to NCL-X [34]. The back-end implementation, on the other hand, is based on commercial EDA tools with our customized library cells (including the proposed PCSL).
For Scenario 1, we will use a (delay) point along the 3σ plot of the pertinent temperature and adjust that point for 10% VDD variation; the 10% VDD variation is congruous with the International Technology Roadmap for Semiconductors.
For Scenario 1, we will use a (delay) point along the 3σ plot of the pertinent temperature and adjust that point for 10% VDD variation; the 10% VDD variation is congruous with the International Technology Roadmap for Semiconductors.