2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18 NS2 PROJECTS IN PONDICHERRY

VLSI IEEE TRANSACTION – 2017-18
PROJECT TITLE TITLE FOR VLSI
LOW POWER
VLSI19_LP01 Title: A 2.5-ps Bin Size and 6.7-ps Resolution FPGA Time-to-Digital Converter Based on Delay
Wrapping and Averaging
Abstract: A high-resolution time-to-digital converter (TDC) implemented with field
programmable gate array (FPGA) based on delay wrapping and averaging is presented. The
fundamental idea is to pass a singleclock through a series of delay elements to generate multiple
reference clocks with different phases for input time quantization. Due to periodicity, those
phases will be equivalently wrapped within one reference clock period to achieve the required
fine resolution. In practice, a hybrid delay matrix is created to significantly reduce the required
number of delay cells. Multiple TDC cores are constructed for parallel measurements and then
exquisite routing control and averaging are applied to smooth out the large quantization error s
caused by the in homogeneity of the TDC delay lines for both linearity and single-shot precision
enhancement. To reduce the impact of temperature sensitivity, a cancellation circuit is created
to substantially reduce the offset and confine the output difference within 2 LSB for the same
input interval over the full operation temperature range of FPGA. With such a fine resolution of
2.5 ps, the integral nonlinearity is measured to be from merely −2.98 to 3.23 LSB and the
corresponding rms resolution is 4.99–6.72 ps. The proposed TDC is tested to be fully functional
over 0 °C–50 °C ambient temperature range with extremely low resolution variation. Its
performance is even superior to many full-custom-designed TDCs.
VLSI21_LP02 Title: Adaptive Multi-bit Crosstalk-Aware Error Control Coding Scheme for On-Chip
Communication
Abstract: The presence of different noise sources and continuous increase in crosstalk in the
deep sub micrometer technology raised concerns for on-chip communication reliability, leading
to the incorporation of crosstalk avoidancetechniques in error control codingschemes. This brief
proposes joint crosstalk avoidance with adaptive error control scheme to reduce the power
consumption by providing appropriate communication resiliency based on runtime noise level.
By switching between shielding and duplication as the crosstalk avoidance technique and
between hybrid automatic repeat request and forward error correction as the error control
policies, three modes of error resiliencies are provided. The results show that, in reduced mode,
the scheme achieves up to 25.3% power savings at3-mm wire length as compared to the original
non-adaptive scheme at the cost of only 3.4% power overhead in high protection mode.
VLSI22_LP03 Title: Coordinate Rotation-Based Low Complexity K-Means Clustering Architecture
Abstract: In this brief, we propose a low-complexity architectural implementation of the K-
means-based clustering algorithm used widely in mobile health monitoring applications for
unsupervised and supervised learning. The iterative nature of the algorithm computing the
distance of each data point from a respective centroid for a successful cluster formation until
convergence presents a significant challenge to map it onto a low-power architecture. This has
been addressed by the use of a 2-D Coordinate Rotation Digital Computer-based low-complexity
engine for computing the n-dimensional Euclidean distance involved during clustering. The
proposed clustering engine was synthesized using the TSMC 130-nm technology library, and a
place and route was performed following which the core area and power were estimated as 0.36
mm2 and 9.21mW at 100 MHz, respectively, making the design applicable for low-power real-

time operations within a sensor node.
VLSI34_LP04 Title: Low-Power Scan-Based Built-In Self-Test Based on Weighted Pseudorandom Test Pattern
Generation and Reseeding
Abstract: A new low-power (LP) scan-based built-in self-test(BIST) technique is proposed based
on weighted pseudorandom test pattern generation and reseeding. A new LP scan architecture is
proposed, which supports both pseudorandom testing and deterministic BIST. During the
pseudorandom testing phase, an LP weighted random test pattern generation scheme is
proposed by disabling a part of scan chains. During the deterministic BIST phase, the design-for-
testability architectureis modified slightly whilethe linear-feedback shiftregister is kept short. In
both the cases, only a small number of scan chains are activated in a single cycle. Sufficient
experimental results are presented to demonstrate the performance of the proposed LP BIST
approach.
VLSI40_LP05 Title: A Way-Filtering-Based Dynamic Logical–Associative Cache Architecture for Low-Energy
Consumption
Abstract: Last-level caches (LLCs) help improve performance but suffer from energy overhead
because of their large sizes. An effective solution to this problem is to selectively power down
several cacheways,which, however, reduces cacheassociativity and performanceand thus limits
its effectiveness in reducing energy consumption. To overcome this limitation, we propose a new
cachearchitecture that can logically increasecacheassociativity of way-powered-down LLCs. Our
proposed scheme is designed to be dynamic in activating an appropriate number of cache ways
in order to eliminate the need for static profiling to determine an energy-optimized cache
configuration. The experimental results show that our proposed dynamic scheme reduces the
energy consumption of LLCs by 34% and 40% on single- and dual-core systems, respectively,
compared with the best performing conventional static cache configuration. The overall system
energy consumption including CPU, L2 cache, and DRAM is reduced by 9.2% on quad-core
systems.
VLSI41_LP06 Title: Resource-Efficient SRAM-based Ternary Content Addressable Memory
Abstract: Static random access memory (SRAM)-based ternary content addressable memory
(TCAM) offers TCAM functionality by emulating it with SRAM. However, this emulation suffers
from reduced memory efficiency while mapping the TCAM table on SRAM units. This is due to
the limited capacity of the physical addresses in the SRAM unit. This brief offers a novel memory
architecture called a resource-efficient SRAM-based TCAM (REST), which emulates TCAM
functionality using optimal resources. The SRAM unit is divided into multiple virtual blocks to
store the address information presented in the TCAM table. This approach virtually increases the
overall address space of the SRAM unit, mapping a greater portion of the TCAM table in SRAM
and increasing the overall emulated TCAM bits/SRAM at the cost of reduced throughput. A
72×28-bitREST consumes only one 36-kbitSRAM and a few distributed RAMs via implementation
on a Xilinx Kintex-7 field-programmable gate array. It uses only 3.5% of the memory resources
compared with a conventional SRAM-based TCAM (hybrid-partitioned TCAM).
VLSI42_LP07 Title: Write-Amount-Aware Management Policies for STT-RAM Caches
Abstract: Spin-transfer torque random access memory (STT-RAM) technology has emerged as
one of the most promising memory technologies owing to its non-volatility, high density, and
low-leakage power characteristics. However, STT-RAM has certain drawbacks such as high write
energy consumption and limits to the number of write cycles. To enable the adoption of STT-

RAM in the implementation of cache memories, new cache hierarchy management policies are
required to overcome such drawbacks. In this brief, we evaluated several cache hierarchy
management policies in the context of static random access memory L1 caches and an STT-RAM
L2 cache. We found that a nonexclusivepolicy is superior to non-inclusive and exclusive policies
in terms of energy consumption and endurance. We also proposea subblock-based management
policy because the write energy consumption and endurance are proportional and inversely
proportional to the amount of written data, respectively. A combination of the proposed policy
with a nonexclusive policy reduces the L2 cache energy consumption by 33.3% (31.5%) and
improves the lifetime by 56.3% (56.8%) in a single-core (quad-core) system.
VLSI45_LP08 Title: Fault Diagnosis Schemes for Low-Energy Block Cipher Midori Benchmarked on FPGA
Abstract: Achieving secure high-performance implementations for constrained applications such
as implantable and wearable medical devices are a priority in efficient block ciphers. However,
security of these algorithms is not guaranteed in the presence of malicious and natural faults.
Recently, a new lightweight block cipher, Midori, has been proposed that optimizes the energy
consumption besides havinglow latency and hardware compl exity. In this paper, fault diagnosis
schemes for variants of Midori are proposed. To the best of the authors’ knowledge, there has
been no faultdiagnosis schemepresented in the literaturefor Midori to date. The faultdiagnosis
schemes are provided for the nonlinear S-box layer and for the round structures with both 64-bit
and 128-bit Midori symmetric key ciphers. The proposed schemes are benchmarked on a field
programmable gate array and their error coverage is assessed with fault-injection simulations.
These proposed error detection architectures make the implementations of this new low-energy
lightweight block cipher more reliable.
VLSI50_LP09 Title: High-Throughput and Energy-Efficient Belief Propagation Polar Code Decoder
Abstract: Owing to their capacity-achieving performance and low encoding and decoding
complexity, polar codes have received significant attention recently. Successive cancellation
decoding (SCD) and belief propagation decoding (BPD) are two popular approaches for decoding
polar codes. SCD, despite having less computational complexity when compared with BPD,
suffers from long latency due to the serial nature of the SC algorithm. BPD, on the other hand, is
parallel in nature and is more attractive for low-latency applications. However, due to the
iterative nature of BPD, the required latency and energy dissipation increase linearly with the
number of iterations. In this paper, we propose a novel scheme based on sub-factor graph
freezing to reduce the average number of computations as well as the average number of
iterations required by BPD, which directly translates into lower latency and energy dissipation.
Simulation results show that the proposed scheme has no performance degradation and
achieves significant reduction in computation complexity over the existing methods. Moreover,
the hardware architecture for the proposed scheme is developed and compared with the state-
of-the-art BPD implementations for (1024, 512) polar codes. A decoding throughput of 13.9 Gb/s
is achieved along with a 60%–73% improvement in energy reduction and two times increase in
hardware efficiency when compared with the existing BPD implementations.
VLSI60_LP10 Title: High-Speed Parallel LFSR Architectures Based on Improved State-Space Transformations
Abstract: Linear feedback shift register (LFSR) has been widely applied in BCH and CRC
encoding. In order to increase the system throughput, the parallelization of LFSR is usually
needed. Previously, a technique named state-space transformation was presented to reduce the
complexity of parallel LFSR architectures. Exhaustive searches are performed to find good
transformation matrix candidates. This brief proposes a new technique for construction of the
transformation matrix together with a more efficient searchingalgorithm. The realization results

indicate that the proposed architecture outperforms the prior arts, improving the hardware
efficiency by around 35% and the corresponding searching algorithm finds the desirable
transformation matrix much faster.
VLSI62_LP11 Title: Scalable Approach for Power Droop Reduction During Scan-Based Logic BIST
Abstract: The generation of significant power droop (PD) during at-speed test performed by
Logic Built-In Self Test (LBIST) is a serious concern for modern ICs. In fact, the PD originated
during test may delay signal transitions of the circuit under test (CUT): an effect that may be
erroneously recognized as delay faults, with consequent erroneous generation of test fails and
increase in yield loss. In this paper, we propose a novel scalable approach to reduce the PD
during at-speed test of sequential circuits with scan-based LBIST using the launch-on capture
scheme. This is achieved by reducing the activity factor of the CUT, by proper modification of the
test vectors generated by the LBIST of sequential ICs. Our scalable solution allows us to reduce
PD to a value similar to that occurring during the CUT in field operation, without increasing the
number of test vectors required to achieve target fault coverage (FC). We present a hardware
implementation of our approach that requires limited area overhead. Finally, we show that,
compared with recent alternative solutions providing a similar PD reduction, our approach
enables a significant reduction of the number of test vectors (by more than 50%), thus the test
time, to achieve a target FC.
VLSI61_LP12 Title: Stochastic Implementation and Analysis of Dynamical Systems Similar to the Logistic Map
Abstract: Stochastic computing (SC) is a digital computation approach that operates on random
bit streams to perform complex tasks with much smaller hardware footprints compared with
conventional binary radix approaches. SC works based on the assumption that input bit streams
are independent random sequences of 1s and 0s. Previous SC efforts have avoided implementing
functions that have feedback, because doing so has the potential for creating highly correlated
inputs. We propose a number of solutions to overcome the challenges of implementing feedback
in stochastic logic. We use a family of dynamical system functions that are similar to the well -
known logistic map x→µx(1−x)as case studies. We show that complex behaviors, such as period
doubling and chaos, do indeed occur in digital logic with only a few gates operating on a few 0s
and 1s. Our energy consumption is between 21% and 31% of the conventional binary approach.
In order to verify our design methodology, we have measured the mean switching rate between
the basins of attraction of two coexisting fixed points and the peak width of the steady-state
distribution of the output using a logistic-map-like function as an example. Theoretical results
match well with our numerical experiments.
HIGH SPEED DATA TRANSMISSION
VLSI13_HS01 Title: Efficient Designs of Multi-ported Memory on FPGA
Abstract: The utilization of block RAMs (BRAMs) is a critical performance factor for multi -ported
memory designs on field programmable gate arrays (FPGAs). Not only does the excessive
demand on BRAMs block the usage of BRAMs from other parts of a design, but the complex
routing between BRAMs and logic also limits the operating frequency. This paper first introduces
a brand new perspective and a more efficient way of using a conventional two reads one write
(2R1W) memory as a 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this
paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than
the previous approach of duplicating the 2R1W module. Memories with more read/write ports
can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory.
Compared with previous xor-based and livevalue table-based approaches, the proposed designs
can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-
depth. For complex multi ported designs, the proposed BRAM-efficient approaches can achieve

higher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with
8K-depth, the proposed design can save 53% of BRAMs and enhance the operating frequency by
20%.
VLSI14_HS02 Title: High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
Abstract: In this paper, a novel high-speed elliptic curve cryptography (ECC) processor
implementation for point multiplication (PM) on field-programmable gate array (FPGA) is
proposed. A new segmented pipelined full-precision multiplier is used to reduce the latency, and
the Lopez-Dahab Montgomery PM algorithm is modified for careful scheduling to avoid data
dependency resulting in a drastic reduction in the number of clock cycles (CCs) required. The
proposed ECC architecture has been implemented on Xilinx FPGAs’ Virtex4, Virtex5, and Virtex7
families. To the best of our knowledge, our single- and three-multiplier-based designs show the
fastest performance to date when compared with reported works individually. Our one-
multiplier-based ECC processor also achieves the highest reported speed together with the best
reported area-time performance on Virtex4 (5.32 µs at 210 MHz), on Virtex5 (4.91µs at 228
MHz), and on the more advanced Virtex7 (3.18 µsat 352 MHz). Finally, the proposed three-
multiplier-based ECC implementation is the first work reporting the lowest number of CCs and
the fastest ECC processor design on FPGA (450 CCs to get 2.83 µs on Virtex7).
VLSI26_HS03 Title: An On-Chip Monitoring Circuit for Signal-Integrity Analysis of 8-Gb/s Chip-to-Chip
Interfaces With Source-Synchronous Clock
Abstract: This paper presents an on-chip monitoring circuit (OCMC) for analyzing the signal
integrity of high speed signals for a chip-to-chip interface with a source synchronous clocking
scheme. The proposed OCMC consists of a fractional-N phase-locked loop (PLL)-based frequency
synthesizer, a high-bandwidth track-and-hold circuit, and a 10-bit analog-to-digital converter
(ADC) to implement a subsampling scheme. The proposed fractional -N PLL-based frequency
synthesizer improves the time jitter accumulated in a voltage controlled oscillator using a
fractional frequency divider operated by an eight-phase clock. The bandwidth of the track-and
hold circuit is designed to be 6 GHz, using inductive peaking realized through a source follower.
The OCMC samples 49 points over two unit intervals of a high-speed input signal when the
frequency multiplication of the frequency synthesizer is 6.125/6. The 10-bit ADC uses the
architecture of a pipelined successive approximation register ADC to reduce the power
consumption and chip area. The proposed OCMC is implemented with 65-nm CMOS technology
and a 1.2 V supply. The 8-Gb/s chip-to-chip interface signal is reconstructed with time and
voltage resolutions of 5.1 ps and 1.17 mV, respectively.
VLSI35_HS04 Title: A 2.4–3.6-GHz Wideband Sub-harmonically Injection-Locked PLL with Adaptive Injection
Timing Alignment Technique
Abstract: This paper proposes a wideband sub harmonically injection-locked PLL (SILPLL) with
adaptiveinjection timingalignmenttechnique. The SILPLL includes three main circuitblocks:one-
oscillator-period constant-delay (OOPCD) divider, timing-adjusted phase detector (TPD), and
pulse generator (PG). The proposed injection timing alignment technique can align the injection
timing adaptively in a wide range of the output clock frequency usingthe two blocks (OOPCD and
TPD) and a falling edge locking scheme of pulses. It can avoid the risk that SILPLL may lock to the
wrong frequency or even fail to lock. The PG block is used for half-integral injection to relax the
tradeoff between the phase noise of SILPLL and the output frequency resolution. The OOPCD
circuit occupies a negligible area. After the injection timing alignment is finished, the OOPCD is
powered off so that no extra power is consumed. The SILPLL is implemented in the 65-nm 1P9M
CMOS process. It consumes 8.6 mW at 1.2 V supply and occupies an active core area of
1×0.6mm2 . The measured output frequency range is 2.4∼3.6 GHz with an output frequency

resolution of 200 MHz and the phase noise is−127.6 dBc/Hz at an offset of 1 MHz from a carrier
frequency of 3.4 GHz. The rms jitter integrated from 1 kHz to 30 MHz is less than 112 fs for all the
covered frequency points. Under the supply voltage range from 1.1 to 1.3 V and the temperature
range from −20 °C to 70 °C, the rms jitter variation of all thecovered frequency points is less than
27 fs, which shows good robustness over environmental variation.
VLSI39_HS05 Title: Hardware-Efficient Built-In Redundancy Analysis for Memory With Various Spares
Abstract: Memory capacity continues to increase, and many semiconductor manufacturing
companies are trying to stack memory dice for larger memory capacities. Therefore, built-in
redundancy analysis (BIRA) is of utmost importance because the probability of faul t occurrence
increases with a larger memory capacity. A traditional spare structure that consists of simple
rows and columns is somewhat inadequate for multiple memory blocks BIRA because the
hardware overhead and spare allocation efficiency are degraded. The proposed BIRA uses
various types of spares and can achieve a higher yield than a simple row and column spare
structure. Herein, we propose a BIRA that can achievean optimal repair rate using various spare
types. The proposed analyzer can exhaustively search not only row and column spare types but
also global and local spare types. In addition, this paper proposes a fault-storing content-
addressable memory (CAM) structure. The proposed CAM is small and collects faults efficiently.
The experimental results show a high repair rate with a small hardware overhead and a short
analysis time.
VLSI43_HS06 Title: Fast Automatic Frequency Calibrator Using an Adaptive Frequency Search
Algorithm
Abstract: A new adaptive frequency search algorithm (A-FSA) is presented for a fast
automaticfrequencycalibratorinwidebandphase-lockedloops(PLLs).The proposed A-
FSA optimizes the number of clock counts for each frequency comparison cycle,
depending on the difference between the target frequency and the PLL output
frequency, as opposed to a binary frequency search algorithm (B-FSA), where the
frequency search time per cycle is fixed. This eliminates unnecessary clocking times
duringthe frequencycomparison process,andthusreduces the total PLL lock time. The
additional circuitryneededforA-FSAisonlyasimple countercontroller,thusminimizing
hardware overhead. To verify the effectiveness of the proposed algorithm, two
widebandPLLsare designedand simulatedusinga65-nmCMOS technology:one withB-
FSA,and the otherwithA-FSA.The latterachievesalock time faster than the former by
at least a factor of 2, even under worst case conditions.
VLSI49_HS07 Title: A High-Efficiency 6.78-MHz Full Active Rectifier with Adaptive Time Delay Control for
Wireless Power Transmission
Abstract: This paper presents a full active rectifier consisting of GaN devices and a CMOS
controller designed for wireless power transmission in high-power consumer devices. An
adaptive time delay control circuit is developed to maximize the conduction interval of the GaN
switch, which can significantly reduce the power loss caused by the forward voltage imposed by
the diode. The proposed control algorithm also eliminates the reverse leakage current of the
rectifier, and thus further improves its power transfer efficiency. The controller implemented
based on a high voltage 0.18-µm CMOS process and the power stage consisting of four GaN
transistors are assembled on the same printed circuit board (PCB) board. The proposed rectifier
provides a maximum output current of 3 A at 5 V, with a 6.78-MHz ac input voltage.

VLSI24_HS08 Title: Scalable Device Array for Statistical Characterization of BTI-Related Parameters
Abstract: A device array circuit,scalable in terms of the number of transistors used, is proposed.
The proposed array facilitates accurate and simultaneous bias voltage application to a large
number of devices, making it suitable for the measurement based statistical characterization of
device degradation, known as bias temperature instability. Using the proposed array, the
degradation measurement of thousands of transistors is made possible in a practical amount of
time. The experimental results show that the defect-centric model can approximate the
statistical variation in magnitudes of threshold voltage shifts (delta- VTH) and that the varianceof
delta- VTH bears an inverse relationship to the channel areas of transistors. The degradation
variability under ac stress conditions is also presented for the first time.
AREA EFFICIENT/ TIMING & DELAY REDUCTION
VLSI04_AE01 Title: VLSI Design of 64bit × 64bit High Performance Multiplier with Redundant Binary Encoding
Abstract: For multiplier dominated applications such as digital signal processing, wireless
communications, and computer applications, high speed multiplier designs has always been a
primary requisite. In this paper a high performance 16x16 bit redundant binary (RB) multiplier
have been designed by using recently proposed redundant binary encoding approach to
eliminate the error correcting word and a delay efficient parallel prefix Ling adder for final
redundant binary to normal binary (RB-NB) conversion. Since redundant binary (RB)
representation allows carry-free addition and adaptability, it has been used in 16x16 bit high-
performance RB multiplier design for summation of partial product terms. The design of
multiplier also reduces redundantpartial productaccumulation stage when eliminating the error
correcting word which improves the complexity and the critical path delay. The performance of
RB multiplier design compared with conventional RB modified booth encoding multiplier
(CRBMBE). The comparison is based on synthesis result obtained by synthesizing both multiplier
architectures targeting a Xilinx FPGA in terms of area and delay analysis.
VLSI05_AE02 Title: A Method to Design Single Error Correction Codes with Fast Decoding for a Subset of
Critical Bits
Abstract: Single error correction (SEC) codes are widely used to protect data stored in memories
and registers. In some applications, such as networking, a few control bits are added to the data
to facilitate their processing. For example, flags to mark the start or the end of a packet are
widely used. Therefore, it is important to have SEC codes that protect both the data and the
associated control bits.Itis attractivefor these codes to provide fastdecoding of the control bits,
as these are used to determine the processing of the data and are commonly on the critical
timing path. In this brief, a method to extend SEC codes to supporta few additional control bits is
presented. The derived codes support fast decoding of the additional control bits and are
therefore suitable for networking applications.
VLSI07_AE03 Title: ENFIRE: A Spatio-Temporal Fine-Grained Reconfigurable Hardware
Abstract: Field programmable gate arrays (FPGAs) are well-established as fine-grained
reconfigurable computing platforms. However, FPGAs demonstrate poor scalability in advanced
technology nodes due to the large negative impact of the elaborate programmable interconnects
(PIs). The need for such vast PIs arises from two key factors: 1) fine-grained bit-level data
manipulation in theconfigurablelogic blocksand 2) the purely spatial computing model followed
in the FPGAs. In this paper, we propose ENFIRE, a novel memory-based spatio-temporal
framework designed to provide the flexibility of reconfigurable bit-level information processing
while improving scalability and energy efficiency. Dense 2-D memory arrays serve as the main
computing elements storingnot only the data to be processed but also the functional behavior of
the application mapped into lookup tables. Computing elements are spatially distributed,

communicatingas needed over a hierarchical bus interconnect,while the functions areevaluated
temporally inside each computing element. A custom software framework facilitates application
mapping to the framework. By leveraging both spatial and temporal computing, ENFIRE
significantly reduces the interconnect overhead when compared with FPGA. Simulation results
show an improvement of 7.6×in energy, 1.6×in energy efficiency, 1.1×in leakage, and 5.3×in
unified energy efficiency, a metric that considers energy and area together, compared with
comparable FPGA implementations.
VLSI08_AE04 Title: Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and
Throughput Tradeoffs
Abstract: Hybrid floating-point (FP) implementations improve software FP performance without
incurring the area overhead of full hardware FP units. The proposed implementations are
synthesized in 65-nm CMOS and integrated into small fixed-point processors with a RISC-like
architecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a
processor to augment an existing instruction set architecture and increase FP throughput with
little area overhead. The hybrid implementations with USL support increase software FP
throughput per core by 2.18×for addition/subtraction, 1.29×for multiplication, 3.07–4.05×for
division, and 3.11–3.81×for square root, and use 90.7–94.6% less area than dedicated fused
multiply– add (FMA) hardware. Hybrid implementations with custom FP-specific hardware
increase throughput per core over a fixed point software kernel by 3.69–7.28×for
addition/subtraction, 1.22–2.03×for multiplication, 14.4×for division, and 31.9× for square root,
and use 77.3–97.0% less area than dedicated FMA hardware. The circuitarea and throughput are
found for 38 multiply–add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square
root designs. Thirty-three multiply– add implementations are presented, which improve
throughput per core versus a fixed-point software implementation by 1.11–15.9× and use 38.2–
95.3% less area than dedicated FMA hardware.
VLSI11_AE05 Title: Efficient Soft Cancelation Decoder Architectures for Polar Codes
Abstract: The floodingbelief propagation (FO-BP) and the soft-cancelation (SCAN) algorithms are
the two most popular soft-output BP algorithms for the decoding of capacity-achieving polar
codes. The FO-BP algorithm has high throughput at the cost of performance degradation in high
signal-to-noiseratio (SNR) region or with large block length. The SCAN algorithm has much better
decoding performance while suffering from long decoding latency and low throughput. In this
paper, an improved BP algorithm, named reduced complexity soft cancelation (RCSC) algorithm,
is proposed. Compared with the SCAN algorithm, the number of memory entries required by the
RCSC algorithm is reduced by more than 50% in general, while achieving comparable or even
better (e.g., when block size N=215) decoding performance. When block size is large(e.g., N ≥215),
the proposed RCSC algorithm reduces the required memory entries by more than 23% compared
with the state-of-the-art FO-BP algorithm. The numerical results show that the error
performance improvement of the RCSC algorithmis more significantwhen the SNR increases.For
a different tradeoff, a reduced latency soft-cancelation (RLSC) algorithm is proposed to reduce
the decoding latency and increasethe throughput of the RCSC algorithmwhileslightly sacrificing
decoding performance. Finally, the optimized VLSI architectures are presented for the RCSC and
RLSC algorithms, respectively. The synthesis results demonstrate the efficiency of the proposed
algorithms and architectures.

VLSI12_AE06 Title: Low-Complexity Digit-Serial Multiplier Over GF(2m) Based on Efficient Toeplitz Block
Toeplitz Matrix–Vector Product Decomposition
Abstract: In this paper, we have shown that a regular Toeplitz matrix-vector product (TMVP) can
be transformed into a Toeplitz block TMVP (TBTMVP) using a suitable permutation matrix. Based
on the TBTMVP representation, we have proposed a new (a,b)-way TBTMVP decomposition
algorithm for implementing a digit-serial multiplication. Moreover, it is shown that, based on
iterative block recombination, we can improve the space complexity of the proposed TBTMVP
decomposition. From the synthesis results, we have shown that the proposed TBTMVP-based
multiplier involves less area, less area–delay product, and higher throughput compared with the
existing digit serial multipliers.
VLSI15_AE07 Title: Hybrid LUT Multiplexer FPGA Logic Architectures
Abstract: Hybrid configurable logic block architectures for field-programmable gate arrays that
contain a mixture of look up tables and hardened multiplexers are evaluated toward the goal of
higher logic density and area reduction. Multiple hybrid configurable logic block architectures,
both non fracturable and fracturable with varying MUX:LUT logic element ratios are evaluated
across two benchmark suites usinga custom tool flow consisting of Leg Up-HLS, Odin-II front-end
synthesis,ABC logic synthesis and technology mapping, and VPR for packing, placement, routing,
and architecture exploration. Technology mapping optimizations that target the proposed
architectures arealso implemented within ABC. Experimentally, we show that for non fracturable
architectures, without any mapper optimizations, we naturally save up to∼8%area post place
and route; both accountingfor complex logic block and routing area while maintaining mapping
depth. With architecture-aware technology mapper optimizations in ABC, additional area is
saved, post-place-and-route.For fracturable architectures, experiments show that only marginal
gains are seen after place-and-route up to∼2%.
VLSI16_AE08 Title: Sign-Magnitude Encoding for Efficient VLSI Realization of Decimal Multiplication
Abstract: Decimal X×Y multiplication isa complex operation,where intermediate partial products
(IPPs) are commonly selected from a set of pre-computed radix-10Xmultiples. Some works
require only[0,5]×X via recoding digits of Y to one-hot representation of signed digits in[−5,5].
This reduces the selection logic at the cost of one extra IPP. Two’s complement signed-digit
(TCSD) encoding is often used to represent IPPs,where dynamic negation (via one xor per bit of X
multiples) is required for the recoded digits of Y in [−5,−1].In this paper, despite generation of 17
IPPs, for 16-digit operands, we manage to start the partial product reduction (PPR) with 16 IPPs
that enhance the VLSI regularity. Moreover, we save 75% of negating xor's via representing
precomputed multiples by sign-magnitude signed-digit (SMSD) encoding. For the first-level PPR,
we devise an efficient adder, with two SMSD input numbers, whose sum is represented with
TCSD encoding. Thereafter, multilevel TCSD 2:1 reduction leads to two TCSD accumulated partial
products, which collectively undergo a special early initiated conversion scheme to get at the
final binary-coded decimal product. As such, a VLSI implementation of 16×16-digit parallel
decimal multiplier is synthesized, where evaluations show some performance improvement over
previous relevant designs.
VLSI20_AE09 Title: FPGA Realization of Low Register Systolic All-One-Polynomial Multipliers over GF (2m)
and Their Applications in Trinomial Multipliers
Abstract: Systolic all-one-polynomial (AOP) multipliers usually suffer from the problem of high
register complexity, especially in field-programmable gate array (FPGA) platforms where the
register resources are not that abundant. In this paper, we have shown that the AOP-based
systolic multipliers can easily achievelow register-complexity implementations and the proposed
architectures can be employed as computation cores to derive efficient implementations of

systolic Montgomery multipliers based on trinomials.First,we propose a novel data broadcasting
scheme in which the register complexity involved within existingAOP-based systolicmultipliersis
significantly reduced. We have found out that the modified AOP-based structure can be packed
as a standard computation core. Next, we propose a novel Montgomery multiplication algorithm
that can fully employ the proposed AOP-based computation core. The proposed Montgomery
algorithmemploys a novel precomputed modular operation, and the systolic structures based on
this algorithm fully inherit the advantages brought from the AOP-based core (low register
complexity, low critical-path delay, and low latency) except some marginal hardware overhead
brought by a precomputation unit. The proposed architectures are then implemented by Xilinx
ISE 14.1 and it is shown that compared with the existingdesigns,the proposed designs achieve at
least 61.8% and 47.6% less area-delay product and power delay product than the best of
competing designs, respectively.
VLSI48_AE10 Title: Low-Complexity Transformed Encoder Architectures for Quasi-Cyclic Non-binary LDPC
Codes Over Subfields
Abstract: Quasi-cyclic low-density parity-check (QC-LDPC) codes are adopted in many digital
communication and storage systems. The encoding of these codes is traditionally done by
multiplying the message vector with a generator matrix consisting of dense circulant
submatrices. To reduce the encoder complexity, this paper introduces two schemes making use
of finite Fourier transform. We focus on QC-LDPC codes whose circulant submatrices are of
dimension (2r−1) × (2r−1) and the entries are elements of GF(2p), where p divides r, and hence,
GF(2p) is a subfield of GF(2r). These cover a broad range of codes, and binary LDPC codes are a
special case. Making use of conjugacy constraints, low-complexity architectures are developed
for finiteFourier and inversetransforms over subfields in this paper. In addition, composite field
arithmetic is exploited to eliminate the computations associated with message mapping and
reduce the complexity of Fourier transform. For a (2016, 1074) non binary QC-LDPC code whose
generator matrix consists of circulants of dimension 63×63 with GF(22)entries, the proposed
encoders achieve 22% area reduction compared with the conventional encoders without
sacrificing the throughput.
VLSI58_AE11 Title: Antiwear Leveling Design for SSDs With Hybrid ECC Capability
Abstract: With the joint considerations of reliability and performance, hybrid error correction
code (ECC) becomes an option in the designs of solid-state drives (SSDs). Unfortunately, wear
leveling(WL) might resultin the early performance degradation to SSDs, which is common with a
limited number of P/E cycles, due to the efforts to delay the bit-error-rate growth. In this paper,
an anti-WL design is proposed to avoid such a performance problem so that the performance of
SSDs with hybrid ECC capability can be improved without sacrificing their reliability. The
capability of the proposed design was evaluated by a series of experiments, for which it was
shown that the proposed design could greatly improve the read and write performance of SSDs
up to 50% without affecting the endurance of the investigated SSDs, compared with traditional
approaches.
VLSI33_AE12 Title: Energy-Efficient VLSI Realization of Binary64 Division with Redundant Number Systems
Abstract: VLSI realizations of digit-recurrence binary division usually use redundant
representation of partial remainders and quotient digits. The former allows for fast carry-free
computation of the next partial remainder, and the latter leads to less number of the required
divisor multiples. In studying the previous relevant works, we have noted that the binary carry
save(CS) number system is prevalentin the representation of partial remainders, and redundant
high radix representation of quotient digits is popular in order to reduce the cycle count. In this
paper, we explore a design space containing four division architectures. These are based on
binary CS or radix-16 signed digit (SD) representations of partial remainders. On the other hand,

they use full or partial precomputation of divisor multiples.The latter uses smaller multiplexer at
the cost two extra adders, where one of the operands is constant within all cycles. The quotient
digits are represented by radix-16 [−9,9]SDs. Our synthesis-based evaluation of VLSI realizations
of the best previous relevant work and the four proposed designs show reduced power and
energy figures in the proposed designs at the cost of more silicon area and delay measures.
However, our energy-delay product is 26%–35% less than that of the reference work.
Audio, Image and Video Processing
VLSI02_IM01 Title: A Dual-Clock VLSI Design of H.265 Sample Adaptive Offset Estimation for 8k Ultra-HD TV
Encoding
Abstract: Sample adaptive offset (SAO) is a newly introduced in-loop filtering component in
H.265/High Efficiency Video Coding (HEVC). While SAO contributes to a notable coding efficiency
improvement, the estimation of SAO parameters dominates the complexity of in-loop filtering in
HEVC encoding. This paper presents an efficient VLSI design for SAO estimation. Our design
features a dual-clock architecturethat processes statistics collection (SC) and parameter decision
(PD), the two main functional blocks of SAO estimation, at high- and low speed clocks,
respectively. Such a strategy reduces the overall area by 56% by addressing the heterogeneous
data flows of SC and PD. To further improve the area and power efficiency, algorithm-
architecture co-optimizations are applied, including a coarse range selection (CRS) and an
accumulator bit width reduction (ABR). CRS shrinks the range of fine processed bands for the
band offset estimation. ABR further reduces the area by narrowing the accumulators of SC. They
together achieve another 25% area reduction. The proposed VLSI design is capable of processing
8k at 120-frames/s encoding. It occupies 51k logic gates, only one-third of the circuit area of the
state-of-the-art implementations.
VLSI03_IM02 Title: RoBA Multiplier: A Rounding-Based Approximate Multiplier for High-Speed yet Energy-
Efficient Digital Signal Processing
Abstract: In this paper, we propose an approximate multiplier that is high speed yet energy
efficient. The approach is to round the operands to the nearest exponent of two. This way the
computational intensive part of the multiplication is omitted improving speed and energy
consumption at the price of a small error. The proposed approach is applicable to both signed
and unsigned multiplications. We propose three hardware implementations of the approximate
multiplier that includes one for the unsigned and two for the signed operations. The efficiency of
the proposed multiplier is evaluated by comparing its performance with those of some
approximateand accurate multipliersusingdifferent design parameters. In addition, the efficacy
of the proposed approximate multiplier is studied in two image processing applications, i.e.,
image sharpening and smoothing.
VLSI06_IM03 Title: Energy-Efficient Reduce-and-Rank Using Input-Adaptive Approximations
Abstract: Approximate computing is an emerging design paradigm that exploits the intrinsic
ability of applications to produce acceptable outputs even when their computations are executed
approximately. In this paper, we explore approximate computing for a key computation pattern,
reduce-andrank (RnR), which is prevalent in a wide range of workloads, including video
processing, recognition, search, and data mining. An RnR kernel performs a reduction operation
(e.g., distance computation, dot product, and L1-norm) between an input vector and each of a
set of reference vectors, and ranks the reduction outputs to select the top reference vectors for
the current input. We propose three complementary approximation strategies for the RnR
computation pattern. The first is interleaved reduction and-ranking, wherein the vector
reductions are decomposed into multiple partial reductions and interleaved with the rank
computation. Leveraging this transformation, we propose the use of i ntermediate reduction
results and ranks to identify future computations that are likely to have a low impact on the

output, and can, hence, be approximated. The second strategy, input similarity-based
approximation, exploits the spatial or temporal correlation of inputs (e.g., pixels of an image or
frames of a video) to identify computations that are amenable to approximation. The third
strategy, reference vector reordering, rearranges the order in which the reference vectors are
processed such that vectors that are relatively more critical in evaluating the correct output, are
processed at the beginning of RnR operation. The number of these critical reference vectors is
usually small, which renders a substantial portion of the total computation to be amenable to
approximation. These strategies address a key challenge in approximate computing—
identification of which computations to approximate—and may be used to drive any
approximation mechanism, such as computation skipping or precision scaling to realize
performance and energy improvements. A second key challenge in approximate computing is
that the extent to which computations can be approximated varies significantly from application
to application, and across inputs for even a single application. Hence, input-adaptive
approximation, or the ability to automatically modulate the degree of approximation based on
the nature of each individual input,is essential for obtainingoptimal energy savings. In addition,
to enable quality configurability in RnR kernels, we propose a kernel-level quality metric that
correlates well to application-level quality, and identify key parameters that can be used to tune
the proposed approximation strategies dynamically. We develop a runtime framework that
modulates the identified parameters during the execution of RnR kernels to minimize their
energy while meeting a given target quality. To evaluate the proposed concepts, we designed
quality-configurable hardware implementations of six RnR-based applications from the
recognition, mining, search, and video processing application domains in 45-nm technology.
VLSI23_IM04 Title: Dual-Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers
Abstract: In this paper, we proposefour 4:2 compressors,which have the flexibility of switching
between the exact and approximate operating modes. In the approximate mode, these dual -
quality compressors provide higher speeds and lower power consumptions at the cost of lower
accuracy. Each of these compressors has its own level of accuracy in the approximate mode as
well as different delays and power dissipations in the approximate and exact modes. Using these
compressors in the structures of parallel multipliers provides configurable multipliers whose
accuracies (as well as their powers and speeds) may change dynamically during the runtime. The
efficiencies of these compressors in a 32-bitDadda multiplier are evaluated in a 45-nm standard
CMOS technology by comparing their parameters with those of the state-of-the-art approximate
multipliers.Theresults of comparison indicate, on average, 46% and 68% lower delay and power
consumption in the approximate mode. Also, the effectiveness of these compressors is asses sed
in some image processing applications.
VLSI47_IM05 Title: An FPGA-Based Hardware Accelerator for Traffic Sign Detection
Abstract: Traffic sign detection plays an important role in a number of practical applications,
such as intelligentdriver assistance and roadway inventory management. In order to process the
large amount of data from either real-time videos or large off-line databases, a high-throughput
traffic sign detection system is required. In this paper, we propose an FPGA-based hardware
accelerator for traffic sign detection based on cascade classifiers. To maximize the throughput
and power efficiency, we propose several novel ideas, including: 1) rearranged numerical
operations; 2) shared image storage; 3) adaptive workload distribution; and 4) fast image block
integration. The proposed design is evaluated on a Xilinx ZC706 board. When processing high-
definition (1080p) video, it achieves the throughput of 126 frames/s and the energy efficiency of
0.041 J/frame.

VLSI63_IM06 Title: Soft Error Rate Reduction of Combinational Circuits Using Gate Sizing in the Presence of
Process Variations
Abstract: Soft errors in combinational logic circuits are emerging as a significant reliability
concern for nano scale VLSI designs. This paper presents a novel sensitivity-based gate sizing
methodology to reduce the soft error rate (SER) of combinational circuits in the presence of
process variations. The proposed method is based on modeling the statistics of SER of the circuit
gates as a random variableto formulate a statistical optimization problem.A backward traversing
algorithmwith capability for incremental analysis is developed for computing the distribution of
circuit gates of SER random variables. We present a gate resizing algorithm in which the gates
with the most contribution to the circuit SER are selected in a candidate set using a statistical
ordering approach. The proposed algorithm trades off SER reduction and area overheads. The
experimental results showthat usingthe proposed methodology, the circuitstatistical SERcan be
reduced by up to 56.4% compared with the 14.8% SER reduction of a circuit obtained using the
worst casemethodology at the expense of 10% area overhead under 10% process variation ratio.
The results also show that the proposed method achieves about 40% more SER reduction
compared with that obtained usingclosed-formanalysis for statistical soft error rate estimation
(CASSER), the most recently published similar work, in the same experimental conditions.
Comparing the runtime of the proposed optimization algorithm with the optimization based on
CASSER, it is observed that the proposed method is two orders of magnitude faster than CASSER
due to its incremental analysis property.
VLSI30_IM07 Title: Time-Encoded Values for Highly Efficient Stochastic Circuits
Abstract: Stochastic computing (SC) is a promising technique for applications that require low
area overhead and fault tolerance, but can tolerate relatively high latency. In the SC paradigm,
logical computation is performed on randomized bit streams. In prior work, streams were
generated with linear feedback shift registers; these contributed heavily to the hardware cost
and consumed a significant amount of power. This paper introduces a new approach for
encoding signal values: computation is performed on analog periodic pulse signals. Exploiting
pulsewidth modulation,time-encoded signalscorresponding to specific values are generated by
adjusting the frequency and duty cycles of pulse width modulated (PWM) signals. With this
approach, the latency, area, and energy consumption are all greatly reduced. Experimental
results on image processing applications show up to 99% performance speedup, 98% saving in
energy dissipation, and 40% area reduction compared to prior stochastic approaches. Circuits
synthesized with the proposed approach can work as fast and energy-efficiently as a
conventional binary design while retaining the fault-tolerance and low cost advantages of
conventional stochastic designs.
VLSI28_IM08 Title: Design of Power and Area Efficient Approximate Multipliers
Abstract: Approximate computing can decrease the design complexity with an increase in
performance and power efficiency for error resilient applications. This brief deals with a new
design approach for approximation of multipliers. The partial products of the multiplier are
altered to introduce varying probability terms. Logic complexity of approximation is varied for
the accumulation of altered partial products based on their probability. The proposed
approximation is utilized in two variants of 16-bit multipliers. Synthesis results reveal that two
proposed multipliers achieve power savings of 72% and 38%, respectively, compared to an exact
multiplier.They have better precision when compared to existingapproximate multipliers. Mean
relative error figures are as low as 7.6% and 0.02% for the proposed approximate multipliers,
which are better than the previous works. Performance of the proposed multipliers is evaluated
with an image processing application, where one of the proposed models achieves the highest
peak signal to noise ratio.

VERIFICATION
VLSI31_VE01 Title: COMEDI: Combinatorial Election of Diagnostic Vectors From Detection Test Sets for Logic
Circuits
Abstract: Although the modern automatic test pattern generation (ATPG) tools can efficiently
produce near-optimal test sets with high fault-coverage for a circuit-under-test, a diagnostic test
set (DTS), which is needed for faultlocalization,is much more challenging to construct. The DTS is
used to analyze the responses of failing chips during manufacturing test for the purpose of
identifying the root cause of observed errors. In this paper, a novel technique for selecting a
powerful DTS for stuck-at faults from a pool of ATPG detection vectors is proposed. Unlike
existing methods, this technique does not use any diagnostic test generation, circuit
modification, or miter-based approach. It constructs a combinatorial cover of the pool to
determine a test set with high diagnostic coverage (DC). Two variants of the covering algorithm
are proposed based on this technique. The experimental results on several combinational and
scan-based benchmark circuits demonstrate the effectiveness of our method in terms of the size
of the DTS, DC, and CPU time.
VLSI44_VE02 Title: Reordering Tests for Efficient Fail Data Collection and Tester Time Reduction
Abstract: During fail data collection, a tester collects information that is useful for defect
diagnosis. If fail data collection can be terminated early, the tester time as well as the volume of
fail data will be reduced. Test reordering can enhance the ability to terminate the process early
without affecting the quality of diagnosis. In this paper, test reordering targets logic defects
based on information that is derived during defect diagnosis. The defect diagnosis procedure is
enhanced to identify tests that areuseful for defect diagnosisacrossa sampleof faulty instances
of a circuit. Tests that are determined to be useful for more faulty instances of a circuit are
placed earlier in the test set based on the expectation that the same tests will be useful for other
faulty instances of the circuit. The experimental results for logic defects in benchmark circuits
support the effectiveness of this approach and indicate that test reordering helps to terminate
fail data collection early without impacting the diagnosis quality.
NETWORKING
VLSI51_NOC01 Title: Multicast-Aware High-Performance Wireless Network-on-Chip Architectures
Abstract: — Today’s multiprocessor platforms employ the network-on-chip (NoC) architecture as
the preferable communication backbone. Conventional NoCs are designed predominantly for
unicast data exchanges. In such NoCs, the multicast traffic is generally handled by converting
each multicast message to multiple unicast transmissions. Hence, applications dominated by
multicast traffic experience high queuing latencies and significant performance penalties when
running on systems designed with unicast-based NoC architectures. Various multicast
mechanisms such as XY-tree multicast and path multicast have already been proposed to
enhance the performance of the traditional wireline mesh NoC incorporating multicast traffic.
However, even with such added features, the multihop nature of the wireline mesh NoC leads to
high network latencies and thus limits the achievable system performance. In this paper, to
sustain the high-bandwidth and high-throughput requirements of emerging applications, we
propose the design of a wireless NoC (WiNoC) architecture incorporating necessary multicast
support. By integrating congestion-aware multicast routing with network coding, the WiNoC is
able to efficiently handle heavy multicast injections.

VLSI - BACK END PROJECT - TANNER(nm) / HSPICE(nm) / DSCH3 - MICROWIND(um)
VLSI01_BE01 Title: Temporarily Fine-Grained Sleep Technique for Near- and Sub-threshold Parallel
Architectures
Abstract: This paper presents a design approach for improving energy-efficiency and throughput
of parallel architectures in near- and sub-threshold voltage circuits. The focus is to suppress
leakageenergy dissipation of the idleportions of circuits duringactivemodes, which can allowus
to wholly transformthe throughput improvement from parallel architectures into energy savings
via deep voltage scaling. We begin by investigating the efficacy of parallel and pipeline
architectures in the near- and sub-threshold circuits. The investigation reveals that active energy
dissipation largely undermines the ability of deep voltage scaling to transform excessive
throughput into energy savings. Techniques, such as power-gating switches (PGSs), can mitigate
active-leakage power dissipation; however, the over head for entering and exiting sleep modes
can offset the energy savings provided by sleep mode, particularly if sleep time is fine grained for
suppressingactiveleakage.Therefore, in this paper,we proposea PGS design technique, inspired
by the so-called zigzag super cutoff CMOS, in order to optimize the overheads of mode
transitions of PGS in near- and sub-threshold circuits. The proposed technique enables to have
circuits in sleep mode for as short as a single clock cycle with a negligible amount of energy and
delay overheads. We apply our proposed design to parallel multiplier-based test circuits
operating at near- and sub-threshold voltages. Simulations show a significant improvement in
energy efficiency over baselines at the same throughput.
VLSI10_BE02 Title: Low-Power Design for a Digit-Serial Polynomial Basis Finite Field Multiplier Using
Factoring Technique
Abstract: In CMOS-based application-specific integrated circuit (ASIC) designs, total power
consumption is dominated by dynamic power, where dynamic power consists of two major
components, namely, switchingpower and internal power. In this paper, we present a low-power
design for a digit-serial finite field multiplier in GF(2m ). In the proposed design, a factoring
technique is used to minimize switching power. To the best of our knowledge, factoring method
has not been reported in the literature being used in the design of a finite field multiplier at an
architectural level.Logic gate substitution is also utilized to reduce internal power. Our proposed
design alongwith several existing similar works have been realized for GF(2233)on ASIC platform,
and a comparison is made between them. The synthesis results show that the proposed
multiplier design consumes at least 27.8% lower total power than any previous work in
comparison.
VLSI17_BE03 Title: Analysis and Design of a Low-Voltage Low-Power Double-Tail Comparator
Abstract: The need for ultralow-power, area efficient, and high speed analog-to-digital
converters is pushing toward the use of dynamic regenerative comparators to maximize speed
and power efficiency. In this paper, an analysis on the delay of the dynamic comparators will be
presented and analytical expressions arederived.From the analytical expressions, designers can
obtain an intuition about the main contributors to the comparator delay and fully explore the
tradeoffs in dynamic comparator design. Based on the presented analysis, a new dynamic
comparator is proposed, where the circuit of a conventional double-tail comparator is modified
for low-power and fastoperation even in small supply voltages. Without complicating the design
and by adding few transistors, the positive feedback during the regeneration is strengthened,
which results in remarkably reduced delay time. Post layout simulation results in a 0.18-µm
CMOS technology confirm the analysis results. It is shown that in the proposed dynamic
comparator both the power consumption and delay time are significantly reduced.

VLSI27_BE04 Title:10T SRAM Using Half-VDD Precharge and Row-Wise Dynamically Powered Read Port for
Low Switching Power and Ultralow RBL Leakage
Abstract: We present, in this paper, a new 10T static random access memory cell having single
ended decoupled read-bitline (RBL) with a 4T read port for low power operation and leakage
reduction. The RBL is precharged at half the cell’s supply voltage, and is all owed to charge and
discharge according to the stored data bit. An inverter, driven by the complementary data node
(QB), connects the RBL to the virtual power rails through a transmission gate during the read
operation. RBL increases toward the VDD level for a read-1, and discharges toward the ground
level for a read-0. Virtual power rails have the same value of the RBL precharging level during the
write and the hold mode, and are connected to true supply levels only during the read operation.
Dynamic control of virtual rails substantially reduces the RBL leakage. The proposed 10T cell in a
commercial 65 nm technology is 2.47×the size of 6T with β=2, provides 2.3×read static noise
margin, and reduces the read power dissipation by 50% than that of 6T. The value of RBL leakage
is reduced by more than 3 orders of magnitude and (ION/IOFF) is greatly improved compared
with the 6T BL leakage. The overall leakage characteristics of 6T and 10T are similar, and
competitive performance is achieved.
VLSI54_BE05 Title: Delay Analysis for Current Mode Threshold Logic Gate Designs
Abstract: Current mode is a popular CMOS-based implementation of threshold logic functions,
where the gate delay depends on the sensor size. This paper presents a new implementation of
current mode threshold functions for improved gate delay and switching energy. An analytical
method is also proposed in order to identify quickly the sensor size that minimizes the gate
delay. Simulation results on different gates implemented using the optimum s ensor size indicate
that the proposed current mode implementation method outperforms consistently the existing
implementations in delay as well as switching energy.
VLSI55_BE06 Title: Area and Energy-Efficient Complementary Dual-Modular Redundancy Dynamic Memory
for Space Applications
Abstract: The limited size and power budgets of space-bound systems often contradict the
requirements for reliablecircuit operation within high-radiation environments. In this paper, we
propose the smallest solution for soft-error tolerant embedded memory yet to be presented. The
proposed complementary dual-modular redundancy (CDMR) memory is based on a four-
transistor dynamic memory core that internally stores complementary data values to provide an
inherent per-bit error detection capability. By adding simple, low-overhead parity, an error-
correction capability is added to the memory architecture for robust soft-error protection. The
proposed memory was implemented in a 65-nm CMOS technology, displaying as much as a
3.5×smaller silicon footprint than other radiation-hardened bit cells. In addition, the CDMR
memory consumes between 48% and 87% less standby power than other considered solutions
across the entire operating region.
VLSI56_BE07 Title: Probability-Driven Multi-bit Flip-Flop Integration With Clock Gating
Abstract: Data-driven clock gated (DDCG) and multi-bit flip-flops (MBFFs) are two low-power
design techniques that are usually treated separately. Combining these techniques into a single
grouping algorithm and design flow enables further power savings. We study MBFF multiplicity
and its synergy with FF data-to-clock togglingprobabilities.Aprobabilistic model is implemented
to maximize the expected energy savings by grouping FFs in increasing order of their data -to-
clock toggling probabilities. We present a front-end design flow, guided by physical layout
considerationsfor a 65-nm 32-bit MIPS and a 28-nm industrial network processor. It is shown to
achieve the power savings of 23% and 17%, respectively, compared with designs with ordinary
FFs. About half of the savings was due to integrating the DDCG into the MBFFs.

VLSI59_BE08 Title: A High-Speed and Power-Efficient Voltage Level Shifter for Dual-Supply Applications
Abstract: This brief presents a fast and power-efficient voltage level shifting circuit capable of
converting extremely low levels of input voltages into high output voltage levels. The efficiency
of the proposed circuit is due to the fact that not only the strength of the pull-up device is
significantly reduced when the pull-down device is pulling down the output node, but the
strength of the pull-down device is also increased usinga low-power auxiliary circuit. Post layout
simulation results of the proposed circuit in a 0.18-µm technology demonstrate a total energy
per transition of 157 fJ, a static power dissipation of 0.3 nW, and a propagation delay of 30 ns for
input frequency of 1 MHz, low supply voltage level of VDDL=0.4V, and high supply voltage level
of VDDH=1.8V.
VLSI32_BE09 Title: A 0.1–2-GHz Quadrature Correction Loop for Digital Multiphase Clock Generation Circuits
in 130-nm CMOS
Abstract: A 100-MHz–2-GHz closed-loop analog in-phase/quadrature correction circuit for
digital clocksispresented. The proposed circuitconsists of a phase-locked loop- type architecture
for quadrature error correction. The circuit corrects the phase
errortowithina1.5°upto1GHzandtowithin3°at2GHz. It consumes 5.4 mA from a 1.2 V supply at 2
GHz. The circuit was designed in UMC 0.13-µm mixed-mode CMOS with an active area of
102µm×95µm. The impactof duty cycledistortion has been analyzed.High-frequency quadrature
measurement related issues havebeen discussed.The proposed circuitwas used in two different
applications for which the functionality has been verified.
VLSI09_BE10 Title: Conditional-Boosting Flip-Flop for Near-Threshold Voltage Application
Abstract: A conditional-boostingflip-flop isproposed for ultra-lowvoltageapplication where the
supply voltageis scaled down to the near-threshold region. The proposed flip-flop adopts voltage
boosting to provide low latency with reduced performance variabil ity in the near threshold
voltage region. It also adopts conditional capture to minimize the switching power consumption
by eliminating redundant boosting operations. Experimental results in a 65-nm CMOS process
indicated that the proposed flip-flop provided up to 72% lower latency with 75% less
performance variability due to process variation, and up to 67% improved energy-delay product
at 25% switching activity compared with conventional pre-charged differential flip-flops.
VLSI36_BE11 Title: An All-MOSFET Sub-1-V Voltage Reference With a−51-dB PSR up to 60 MHz
Abstract: This paper presents a voltage reference (VR) with a power supply rejection (PSR)
better than 50 dB for frequencies of up to 60 MHz, and uses MOSFETs in strong inversion.
Another innovation is a compact MOSFET low-pass filter, which was developed along with a
feedback technique for a wide-bandwidth PSR not achieved in previous works. The proposed all -
MOSFET VR was fabricated using a standard 0.18µm CMOS process.
VLSI46_BE12 Title: A 65-nm CMOS Constant Current Source with Reduced PVT Variation
Abstract: This paper presents a new nanometer-based low-power constant current reference
that attains a small value in the total process–voltage–temperature variation. The circuit
architecture is based on the embodiment of a process-tolerant bias current circuit and a scaled
process-tracking bias voltage source for the dedicated temperature-compensated voltage to-
current conversion in a pre-regulator loop. Fabricated in a UMC 65-nm CMOS process, it
consumes 7.18µWwitha1.4V supply. The measured results indicate that the current reference
achieves an average temperature coefficient of 119ppm/°C over 12 samples in a temperature
range from−30 °C to 90 °C without any calibration.Besides,a low line sensitivity of 180 ppm/V is
obtained. This paper offers a better sensitivity figure of merit with respect to the reported
representative counterparts.

VLSI57_BE13 Title: A Fault Tolerance Technique for Combinational Circuits Based on Selective-Transistor
Redundancy
Abstract: With fabrication technology reaching nano-levels, systems are becoming more prone
to manufacturing defects with higher susceptibility to soft errors. This paper is focused on
designingcombinational circuits for soft error tolerance with minimal area overhead. The idea is
based on analyzing random pattern testability of faults in a circuit and protecting sensitive
transistors,whosesoft error detection probability is relatively high,until desired circuitrelia bility
is achieved or a given area overhead constraint is met. Transistors are protected based on
duplicatingand sizing a subset of transistors necessary for providing the protection. In addition
to that, a novel gate-level reliability evaluation technique is proposed that provides similar
results to reliability evaluation at the transistor level (using SPICE) with the orders of magnitude
reduction in CPU time. LGSynth’91 benchmark circuits are used to evaluate the proposed
algorithm. Simulation results show that the proposed algorithm achieves better reliability than
other transistor sizing-based techniques and the triple modular redundancy technique with
significantly lower area overhead for 130-nm process technology at a ground level.
VLSI52_BE14 Title: Temporarily Fine-Grained Sleep Technique for Near- and Sub-threshold Parallel
Architectures
Abstract: This paper presents a design approach for improving energy-efficiency and
throughput of parallel architectures in near- and sub-threshold voltage circuits. The focus is to
suppress leakage energy dissipation of the idle portions of circuits during active modes, which
can allow us to wholly transform the throughput improvement from parallel architectures into
energy savings via deep voltage scaling. We begin by investigating the efficacy of parallel and
pipeline architectures in the near- and sub-threshold circuits. The investigation reveals that
active energy dissipation largely undermines the ability of deep voltage scaling to transform
excessivethroughput into energy savings.Techniques,such as power-gating switches (PGSs), can
mitigate active-leakage power dissipation; however, the overhead for entering and exiting sleep
modes can offset the energy savings provided by sleep mode, particularly if s leep time is fine
grained for suppressing active leakage. Therefore, in this paper, we propose a PGS design
technique, inspired by the so-called zigzag super cutoff CMOS, in order to optimize the
overheads of mode transitions of PGS in near- and sub-threshold circuits. The proposed
technique enables to have circuits in sleep mode for as short as a single clock cycle with a
negligible amount of energy and delay overheads. We apply our proposed design to parallel
multiplier-based test circuits operating at near- and sub-threshold voltages. Simulations show a
significant improvement in energy efficiency over baselines at the same throughput.
VLSI53_BE15 Title: A 100-mA, 99.11% Current Efficiency, 2-mVppRipple Digitally Controlled LDO with Active
Ripple Suppression
Abstract: Digital low-dropout (DLDO) regulators are gaining attention due to their design
scalability for distributed multiple voltage domain applications required in state-of-the-art
system on-chips. Due to the discrete nature of the output current and the discrete-time control
loop, the steady-state response of the DLDO has inherent output voltage ripple. A hybrid DLDO
(HD-LDO) with fast response and stable operation across a wide load range while reducing the
output voltage ripple is proposed. In the HD-LDO, a DLDO and a low current analog ripple
cancelation amplifier (RCA) work in parallel.The output dc of the RCA is sensed by a 2-bit analog-
to-digital converter, and the digitized linear stage current is fed into the DLDO as an error signal.
Duringload transients,a gear-shiftcontroller enables fasttransientresponseusing dynamic load
estimation.The DLDO suppresses the output dc of the RCA within its currentresolution.With this
arrangement, a majority of the dc load current is provided by the DLDO and the RCA supplies
ripplecancelation current.The HD-LDO is designed and fabricated in a 180-nm CMOS technology,
and occupies 0.697 mm2 of the die area. The HD-LDO operates with an input voltage range of

1.43–2.0 V and an output voltage range of 1.0–1.57 V. At 100-mA load current, the HD-LDO
achieves a current peak efficiency of 99.11% and a settling time of 15 clock periods with a 0.5-
MHz clock for a current switching between 10 and 90 mA. The RCA suppresses fundamental,
second, and third harmonics of the switching frequency by 13.7, 13.3, and 14.1 dB, respectively.
VLSI18_BE16 Title: Sense Amplifier Half-Buffer (SAHB): A Low-Power High-Performance Asynchronous Logic
QDI Cell Template
Abstract: We propose a novel asynchronous logic (async) quasi-delay-insensitive (QDI) sense-
amplifier half-buffer (SAHB) cell design approach,with emphases on high operational robustness,
high speed, and low power dissipation. There are five key features of our proposed SAHB. First,
the SAHB cell embodies the async QDI 4-phase(4φ) signaling protocol to accommodate process–
voltage–temperature variations. Second, the sense amplifier (SA) block in SAHB cells embodies a
cross-coupled latch with a positive feedback mechanism to speed up the output evaluation.
Third, the evaluation block in the SAHB comprises both nMOS pull -up and pull-down networks
with minimum transistor sizing to reduce the parasitic capacitance. Fourth, both the evaluation
block and SA block are tightly coupled to reduce redundant internal switching nodes. Fifth, the
SAHB cell is designed in CMOS static logic and hence appropriate for full range dynamic voltage
scaling operation for VDD ranging from nominal voltage (1 V) to subthreshold voltage (∼0.3 V).
When six library cells embodying our proposed SAHB are compared with those embodying the
conventional async QDI pre-charged half buffer (PCHB) approach, the proposed SAHB cells
collectively feature simultaneous ∼64% lower power, ∼21% faster, and ∼6% smaller IC area; the
PCHB cell is inappropriate for subthreshold operation. A prototype 64-bit Kogge–Stone pipeline
adder based on the SAHB approach (at 65 nm CMOS) is designed. For a 1-GHz throughput and at
nominal VDD, the design based on the SAHB approach simultaneously features ∼56% lower
energy and∼24% lower transistor count advantages than its PCHB counterpart. When
benchmarked against the ubiquitous synchronous logic counterpart, our SAHB dissipates ∼39%
lower energy at the 1-GHz throughput.
VLSI38_BE17 Title: On Micro-architectural Mechanisms for Cache Wear out Reduction
Abstract: Hot carrier injection (HCI) and bias temperature instability (BTI) are two of the main
deleterious effects that increase a transistor’s threshold voltage over the lifetime of a
microprocessor.This voltagedegradation causes slower transistor switching and eventually can
result in faulty operation. HCI manifests itself when transistors switch from logic “0” to “1” and
vice versa, whereas BTI is the result of a transistor maintaining the same logic value for an
extended period of time. These failuremechanisms areespecially acutein those transistors used
to implement the SRAM cells of first-level (L1) caches, which are frequently accessed, so they are
critical to performance, and they are continuously aging. This paper focuses on micro
architectural solutions to reduce transistor aging effects induced by both HCI and BTI in the data
array of L1 data caches. First, we show that the majority of cell flips are concentrated in a small
number of specific bits within each data word. In addition, we also build upon the previous
studies, showing that logic “0” is the most frequently written value in a cache by identifying
which cells hold a given logic valuefor a significantamount of time. Based on these observa tions,
this paper introduces a number of architectural techniques that spread the number of flips
evenly across memory cells and reduce the amount of time that logic “0” values are stored in the
cells by switching OFF specific data bytes. Experimental results show that the threshold voltage
degradation savings range from 21.8% to 44.3% depending on the application.

VLSI29_BE18 Title: Energy-Efficient TCAM Search Engine Design Using Priority-Decision in Memory
Technology
Abstract: Ternary content-addressable memory (TCAM)-based search engines generally need a
priority encoder (PE) to select the highest priority match entry for resolving the multiple match
problem due to the don’t care (X) features of TCAM. In contemporary network security, TCAM-
based search engines are widely used in regular expression matching across multiple packets to
protect againstattacks,such as by viruses and spam. However, the use of PE results in increased
energy consumption for pattern updates and search operations. Instead of using PEs to
determine the match, our solution is a three-phase search operation that utilizes the length
information of the matched patterns to decide the longest pattern match data. This paper
proposes a promising memory technology called priority-decision in memory (PDM), which
eliminates the need for PEs and removes restrictions on ordering, implying that patterns can be
stored in an arbitrary order without sorting their lengTHP. Moreover, we present a sequential
input-state (SIS) scheme to disable the mass of redundant search operations in state segments
on the basis of an analysisdistribution of hex signatures in a virus database. Experimental results
demonstrate that the PDM-based technology can improve update energy consumption of
nonvolatile TCAM (nvTCAM) search engines by 36%–67%, because most of the energy in these
search engines is used to reorder. By adoptingthe SIS-based method to avoid unnecessary search
operations in a TCAM array, the search energy reductionis around 64% of nvTCAM search
engines.
VLSI37_BE19 Title: A 92-dB DR, 24.3-mW, 1.25-MHz BW Sigma–Delta Modulator Using Dynamically Biased
Op Amp Sharing
Abstract: A 2–2 cascaded switched-capacitor sigma-delta modulator is presented for design of
low-voltage, low-power, broadband analog-to-digital conversion. To reduce power dissipation in
both analog and digital circuits and ensure low-voltage operation, a half-sample delayed-input
feed forward architecture is employed in combination with 4-bit quantization, which results in
reduced integrator output swings and relaxed timing constraint in the feedback path. The
integrator power is further reduced by sharing an op amp in the two integrators in each stage
and periodically changing the op amp bias condition between a high-current and a low-current
mode using a fast low-power high-precision charge pump circuit. Implemented in a 0.18-μm
CMOS technology, the experimental prototype achieves a 92-dB dynamic range, a 91-dB peak
signal-to-noise ratio, and an 84-dB peak signal-to-noise plus distortion ratio, respectively for a
signal bandwidth of 1.25 MHz Operated at a 40-MHz sampling rate, the modulator dissipates
24.3mW from a 1 V supply.
VLSI25_BE20 Title: A 0.45 V 147–375 nW ECG Compression Processor With Wavelet Shrinkage and Adaptive
Temporal Decimation Architectures
Abstract: This paper presents a real-time electrocardiogram (ECG) data compression processor
with improved energy efficiency while maintaining high accuracy and real -time operation.
Wavelet shrinkage is exploited to filter the noise and achieve sparse ECG signal representation.
Adaptive temporal decimation is proposed to achieve configurable processing to adaptively
reduce the data amount and computational activities for further power reduction. Modified
Huffman and run-length wavelet source coding (MHRLC) is also designed to represent wavelet
coefficients with optimized average code length and reduced memory requirement. Fabricated in
0.18-µmCMOS, the ECG processor is implemented with customized near-threshold digital logics
for minimum energy operation. The prototype was fully validated with the MIT-BIH Arrhythmia
database.

2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18 NS2 PROJECTS IN PONDICHERRY

2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18 NS2 PROJECTS IN PONDICHERRY

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to 2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18 NS2 PROJECTS IN PONDICHERRY

Similar to 2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18 NS2 PROJECTS IN PONDICHERRY (20)

More from Nexgen Technology

More from Nexgen Technology (20)

Recently uploaded

Recently uploaded (20)

2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18 NS2 PROJECTS IN PONDICHERRY