VLSI IEEE Transaction 2018 - IEEE Transaction

NXFEE INNOVATION
SEMICONDUCTOR IP & PRODUCT DEVELOPMENT COMPANY
NXFEE Innovation (Semiconductor IP & VLSI IEEE Transaction & Product Development)
#45, Vivekananda street, Dhevan kandappa Mudaliarnagar, Nainarmandapam, Pondicherry-4
Web: www.nxfee.com Email: nxfee.innovation@gmail.com Ph: +91 9789443203, +91 9677783735.
NXFEE - VLSI IEEE TRANSACTION - 2018
PROJECT TITLE TITLE FOR VLSI
LOW POWER
VLSI_IEEE_01
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 12000/-
TOPIC : A 128-Tap Highly Tunable CMOS IF Finite Impulse Response Filter for Pulsed
Radar Applications
Abstract : A configurable-bandwidth (BW) filter is presented in this paper for pulsed
radar applications. To eliminate dispersion effects in the received waveform, a finite
impulse response (FIR) topology is proposed, which has a measured standard deviation
of an in-band group delay of 11 ns that is primarily dominated by the inherent, fully
predictable delay introduced by the sample-and-hold. The filter operates at an IF of 20
MHz, and is tunable in BW from 1.5 to 15 MHz, which makes it optimal to be used with
varying pulse widths in the radar. Employing a total of 128 taps, the FIR filter provides
greater than 50-dB sharp attenuation in the stop band in order to minimize all out-of-
band noise in the low signal-to-noise received radar signal. Fabricated in a 0.18-µm
silicon on insulator CMOS process, the proposed filter consumes approximately 3.5
mW/tap with a 1.8-V supply. A 20-MHz two-tone measurement with 200-kHz tone
separation shows IIP3 greater than 8.5 dBm.
VLSI_IEEE_02
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 8000/-
TOPIC : A Closed-Form Expression for Minimum Operating Voltage of CMOS D Flip-Flop
Abstract : In this paper, a closed-form expression for estimating the minimum operating
voltage (VDDmin) of D flip-flops (FFs) is proposed. VDDmin is defined as the minimum
supply voltage at which the FFs are functional without errors. The proposed expression
indicates that VDDmin of FFs is a linear function of the square root of logarithm of the
number of FFs, and its slope depends on the within-die variation of the threshold
voltage (VTH) and its intercept depends on the balance between nMOS and pMOS,
which is mainly due to the die-to-die VTH variation. The proposed expression of VDDmin
is validated by the simulation results as well as the silicon measurements. Finally, we
discuss the dependence of VDDmin on the device parameters.
VLSI_IEEE_05
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 8000/-
TOPIC : Design of Temperature-Aware Low-Voltage 8T SRAM in SOI Technology for High-
Temperature Operation (25 °C–300 °C)
Abstract : A temperature-aware low-voltage 8T static random access memory (SRAM)
for high-temperature operations is presented. A dedicated read port with virtual ground
and optimal body bias improves sensing margin under very high temperature (up to 300
°C). Bit line offset voltage for data “0” caused by the virtual ground scheme is also
compensated by a replica bit line. The independent body bias control feature of the
employed silicon-on-insulator (SOI) technology allows the write margin to be enhanced
significantly without using any write-assist circuitry. Test chips were fabricated in a 1-µm
SOI technology with tungsten interconnect for reliability at high temperature and lesser

NXFEE INNOVATION
process variation. Measurement results demonstrate that the proposed SRAM operates
successfully up to 300 °C with the supply voltage range of 2–5 V. At the minimum
performance variation point (VDD = 2.5 V), the SRAM consumes 1.48 mW and shows the
access time of 156 ns and the maximum clock frequency of 14.38 MHz at 300 °C.
VLSI_IEEE_09
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : Design of an Area-Effcient Million-Bit Integer Multiplier Using Double Modulus
NTT
Abstract : This brief proposes a double modulus number theoretical transform (NTT)
method for million-bit integer multiplication in fully homomorphic encryption. In our
method, each NTT point is processed simultaneously under two moduli, and the final
result is generated through the Chinese reminder theorem. The employment of double
modulus enlarges the permitted NTT sample size from 24 to 32 bits and thus improves
the transform efficiency. Based on the proposed double modulus method, we
accomplish a VLSI design of million-bit integer multiplier with the Schönhage–Strassen
algorithm. Implementation results on Altera Stratix-V FPGA show that this brief is able
to compute a product of two 1024k-bit integers every 4.9 ms at the cost of only 7.9k
ALUTs and 3.6k registers, which is more area-efficient when compared with the current
competitors.
VLSI_IEEE_10
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : A Fast and Low-Complexity Operator for the Computation of the Arctangent of a
Complex Number
Abstract : The computation of the arctangent of a complex number, i.e., the atan2
function, is frequently needed in hardware systems that could profit from an optimized
operator. In this brief, we present a novel method to compute the atan2 function and a
hardware architecture for its implementation. The method is based on a first stage that
performs a coarse approximation of the atan2 function and a second stage that
improves the output accuracy by means of a lookup table. We present results for fixed-
point implementations in a field-programmable gate array device, all of them
guaranteeing last-bit accuracy, which provide an advantage in latency, speed, and use of
resources, when compared with well-established fixed-point options.
VLSI_IEEE_11
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : A Reconfigurable LDPC Decoder Optimized for 802.11n/ac Applications
Abstract : This paper presents a high data-rate low-density parity-check (LDPC) decoder,
suitable for the 802.11n/ac (WiFi) standard. The innovative features of the proposed
decoder relate to the decoding algorithms and the interconnection between the
processing elements. The reduction of the hardware complexity of decoders based on
the min-sum (MS) algorithms comes at the cost of performance degradation, especially
at high-noise regions. We introduce more accurate approximations of the logsum-
product algorithm that also operate well for low signal-tonoise ratio values.
Telecommunication standards, including WiFi, support more than one quasi-cyclic LDPC
codes of different characteristics, such as codeword length and code rate. A proposed
design technique derives networks, capable of supporting a variety of codes and

NXFEE INNOVATION
efficiently realizing connectivity between a variable number of processing units, with a
relatively small hardware overhead over the single-code case. As a demonstration of the
proposed technique, we implemented a reconfigurable network based on barrel
rotators, suitable for LDPC decoders compatible with WiFi standard. Our approach
achieves low complexity and high clock frequency, compared with related prior works. A
90-nm application-specified integrated circuit implementation of the proposed high-
parallel WiFi decoder occupies 4.88 mm2 and achieves an information throughput rate
of 4.5 Gbit/s at a clock frequency of 555 MHz.
VLSI_IEEE_13
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Approximate Sum-of-Products Designs Based on Distributed Arithmetic
Abstract : Approximate circuits provide high performance and require low power. Sum-
of-products (SOP) units are key elements in many digital signal processing applications.
In this brief, three approximate SOP (ASOP) models which are based on the distributed
arithmetic are proposed. They are designed for different levels of accuracy. First model
of ASOP achieves an improvement up to 64% on area and 70% on power, when
compared with conventional unit. Other two models provide an improvement of 32%
and 48% on area and 54% and 58% on power, respectively, with a reduced error rate
compared with the first model. Third model achieves the mean relative error and
normalized error distance as low as 0.05% and 0.009%, respectively. Performance of
approximate units is evaluated with a noisy image smoothing application, where the
proposed models are capable of achieving higher peak signalto-noise ratio than the
existing state-of-the-art techniques. It is shown that the proposed approximate models
achieve higher processing accuracy than existing works but with significant
improvements in power and performance.
VLSI_IEEE_25
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Vector Processing-Aware Advanced Clock-Gating Techniques for Low-Power
Fused Multiply-Add
Abstract : The need for power efficiency is driving a rethink of design decisions in
processor architectures. While vector processors succeeded in the high-performance
market in the past, they need a retailoring for the mobile market that they are entering
now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high
power consumption, deserves special attention. Although clock gating is a well-known
method to reduce switching power in synchronous designs, there are unexplored
opportunities for its application to vector processors, especially when considering active
operating mode. In this research, we comprehensively identify, propose, and evaluate
the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques
ensure power savings without jeopardizing the timing. We evaluate the proposed
techniques using both synthetic and “real-world” application-based benchmarking.
Using vector masking and vector multilane-aware clock gating, we report power
reductions of up to 52%, assuming active VFU operating at the peak performance.
Among other findings, we observe that vector instruction-based clock-gating techniques
achieve power savings for all vector FP instructions. Finally, when evaluating all
techniques together, using “real-world” benchmarking, the power reductions are up to

NXFEE INNOVATION
80%. Additionally, in accordance with processor design trends, we perform this research
in a fully parameterizable and automated fashion.
VLSI_IEEE_29
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : A Flexible Wildcard-Pattern Matching Accelerator via Simultaneous Discrete
Finite Automata
Abstract : Regular expression matching becomes indispensable elements of Internet of
Things network security. However, traditional ternary content addressable memory
(TCAM) search engine is unable to handle patterns with wildcards, as it precisely tracks
only one active state with single transition. This paper proposes a promising
simultaneous pattern matching methodology for wildcard patterns by two separated
engines to represent discrete finite automata. A key preprocessing to encode possible
postfix pattern by a unique key ensures that follow-up patterns can accurately traverse
all possible matches with limited hardware resources. This approach is practical and
scalable for achieving good performance and low space consumption in network
security, and it can be applicable to any regular expressions even with multi-wildcard
patterns. The experimental results demonstrate that this scheme can efficiently and
accurately recognize wildcard patterns by simultaneously tracking only two active
states. By adopting SRAM TCAM in the proposed architecture, the energy consumption
is reduced to around 39%, compared with the energy consumption using a computing
system that contains a large memory lookup and comparison overhead.
VLSI_IEEE_30
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 8000/-
TOPIC : Low-Power and Fast Full Adder by Exploring New XOR and XNOR Gates
Abstract : In this paper, novel circuits for XOR/XNOR and simultaneous XOR–XNOR
functions are proposed. The proposed circuits are highly optimized in terms of the
power consumption and delay, which are due to low output capacitance and low short-
circuit power dissipation. We also propose six new hybrid 1-bit full-adder (FA) circuits
based on the novel full-swing XOR–XNOR or XOR/XNOR gates. Each of the proposed
circuits has its own merits in terms of speed, power consumption, powerdelay product
(PDP), driving ability, and so on. To investigate the performance of the proposed
designs, extensive HSPICE and Cadence Virtuoso simulations are performed. The
simulation results, based on the 65-nm CMOS process technology model, indicate that
the proposed designs have superior speed and power against other FA designs. A new
transistor sizing method is presented to optimize the PDP of the circuits. In the
proposed method, the numerical computation particle swarm optimization algorithm is
used to achieve the desired value for optimum PDP with fewer iterations. The proposed
circuits are investigated in terms of variations of the supply and threshold voltages,
output capacitance, input noise immunity, and the size of transistors.

NXFEE INNOVATION
VLSI_IEEE_31
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 8000/-
TOPIC : A 0.9-V 12-bit 100-MS/s 14.6-fJ/Conversion-Step SAR ADC in 40-nm CMOS
Abstract : This paper presents a low-power 12-bit 100-MS/s asynchronous successive
approximation register analog-to-digital converter (SAR ADC). Several techniques are
developed to enhance the ADC performance. The non binary capacitor array with small
digital-to-analog converter (DAC) capacitors (total 394 fF) allows for reducing DAC
settling time and power consumption while maintaining extremely high hardware
utilization. The proposed nonlinear capacitance correction method solves the nonlinear
capacitance problems of the comparator when the small unit capacitor is used. The
latch output glitch removal method ensures the speed and accuracy of the comparator
at the low supply voltage. Furthermore, the proposed high-speed SAR logic and timing
sequence improved SAR logic’s operating speed by 75% compared with traditional SAR
logic. The prototype was fabricated using a 40-nm CMOS technology. At a 0.9-V supply
and 100-MS/s sampling rate, the ADC achieves a signal-to-noise distortion ratio of 67.3
dB and consumes 2.6 mW, resulting in a figure of merit of 14.6 fJ/conversion-step. The
ADC core occupies an active area of only 50 × 280 µm2.
VLSI_IEEE_36
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 8000/-
TOPIC : SRAM Circuits for True Random Number Generation Using Intrinsic Bit Instability
Abstract : This paper describes a novel approach to a true random number generator
(TRNG) using SRAM circuits. The principles of operation are described in the context of
past work on integrated circuit TRNGs. The required modifications to standard SRAM
arrays are minor and have little impact on the area. Experimental results from large 1-
Mbit SRAM arrays fabricated on a 55-nm process using the foundry supplied SRAM cell
layouts show good results. Simple helper functions, suitable for very small hardware
implementation, allow improvement, including the ability for the resulting binary strings
to pass all of the National Institute of Standards randomness tests. We describe the
circuits, their principle of operation and statistical behavior, as well as the underlying
physical mechanisms providing the entropy.
VLSI_IEEE_44
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Improving Error Correction Codes for Multiple-Cell Upsets in Space Applications
Abstract : Currently, faults suffered by SRAM memory systems have increased due to
the aggressive CMOS integration density. Thus, the probability of occurrence of single-
cell upsets (SCUs) or multiple-cell upsets (MCUs) augments. One of the main causes of
MCUs in space applications is cosmic radiation. A common solution is the use of error
correction codes (ECCs). Nevertheless, when using ECCs in space applications, they must
achieve a good balance between error coverage and redundancy, and their
encoding/decoding circuits must be efficient in terms of area, power, and delay.
Different codes have been proposed to tolerate MCUs. For instance, Matrix codes use
Hamming codes and parity checks in a bi-dimensional layout to correct and detect some
patterns of MCUs. Recently presented, column–line–code (CLC) has been designed to
tolerate MCUs in space applications. CLC is a modified Matrix code, based on extended
Hamming codes and parity checks. Nevertheless, a common property of these codes is
the high redundancy introduced. In this paper, we present a series of new low

NXFEE INNOVATION
redundant ECCs able to correct MCUs with reduced area, power, and delay overheads.
Also, these new codes maintain, or even improve, memory error coverage with respect
to Matrix and CLC codes.
HIGH SPEED DATA TRANSMISSION
VLSI_IEEE_04
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Approximate Error Detection With Stochastic Checkers
Abstract : Designing reliable systems, while eschewing the high overheads of
conventional fault tolerance techniques, is a critical challenge in the deeply scaled
CMOS and post CMOS era. To address this challenge, we leverage the intrinsic resilience
of application domains such as multimedia, recognition, mining, search, and analytics
where acceptable outputs are produced despite occasional approximate computations.
We propose stochastic checkers (checkers designed using stochastic logic) as a new
approach to performing error checking in an approximate manner at greatly reduced
overheads. Stochastic checkers are inherently inaccurate and require long latencies for
computation. To limit the loss in error coverage, as well as false positives (correct
outputs flagged as erroneous), caused due to the approximate nature of stochastic
checkers, we propose input permuted partial replicas of stochastic logic, which improves
their accuracy with minimal increase in overheads. To address the challenge of long
error detection latency, we propose progressive checking policies that provide an early
decision based on a prefix of the checker’s output bit stream. This technique is further
enhanced by employing progressively accurate binary-to-stochastic converters. Across a
suite of error-resilient applications, we observe that stochastic checkers lead to greatly
reduced overheads (29.5% area and 21.5% power, on average) compared with
traditional fault tolerance techniques while maintaining high coverage and very low
false positives.
VLSI_IEEE_06
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 8000/-
TOPIC : A 0.65-V, 500-MHz Integrated Dynamic and Static RAM for Error Tolerant
Applications
Abstract : The diminishing returns provided by voltage scaling have led to a recent
paradigm shift toward so-called “approximate computing,” where computation accuracy
is traded off for cost in error-tolerant applications. In this paper, a novel approach to
achieving the power–performance–area versus data integrity tradeoff is proposed by
integrating robust static memory cells and error-prone dynamic cells within a single
array. In addition, the resulting integrated dynamic and static random access memory
(iD-SRAM) provides the ability to trade off power consumption and accuracy on-the-fly
according to the current conditions and operating mode. A 4-kB iD-SRAM array was
implemented in a low-power, 65-nm CMOS technology, providing as much as an 80%
power reduction and a 20% area reduction as compared with standard approaches,
when applied to a video decoder at 500 MHz.

NXFEE INNOVATION
VLSI_IEEE_07
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : Efficient FPGA Mapping of Pipeline SDF FFT Cores
Abstract : In this paper, an efficient mapping of the pipeline single-path delay feedback
(SDF) fast Fourier transform (FFT) architecture to field-programmable gate arrays
(FPGAs) is proposed. By considering the architectural features of the target FPGA,
significantly better implementation results are obtained. This is illustrated by mapping
an R22SDF 1024-point FFT core toward both Xilinx Virtex-4 and Virtex-6 devices. The
optimized FPGA mapping is explored in detail. Algorithmic transformations that allow a
better mapping are proposed, resulting in implementation achievements that by far
outperforms earlier published work. For Virtex-4, the results show a 350% increase in
throughput per slice and 25% reduction in block RAM (BRAM) use, with the same
amount of DSP48 resources, compared with the best earlier published result. The
resulting Virtex-6 design sees even larger increases in throughput per slice compared
with Xilinx FFT IP core, using half as many DSP48E1 blocks and less BRAM resources. The
results clearly show that the FPGA mapping is crucial, not only the architecture and
algorithm choices.
VLSI_IEEE_08
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Algorithm and Architecture Design of Adaptive Filters With Error Nonlinearities
Abstract : This paper presents a framework based on the logarithmic number system to
implement adaptive filters with error nonlinearities in hardware. The framework is
demonstrated through pipelined implementations of two recently proposed adaptive
filtering algorithms based on logarithmic cost, namely, least mean logarithmic square
(LMLS) and least logarithmic absolute difference (LLAD). To the best of our knowledge,
the proposed architectures are the first attempts to implement both LMLS and LLAD
algorithms in hardware. We derive error computing algorithms to realize the nonlinear
error functions for LMLS and LLAD and map them onto hardware. We also propose a
novel variable-α scheme to enhance the original LMLS algorithm and prove its
robustness and suitability for VLSI implementations in practical applications. Detailed bit
width and error analysis are carried out for the proposed VLSI fixed point
implementations. Post layout implementation results show that with an additional
multiplier over conventional least mean square (LMS), 7-dB improvement in steady-
state mean square deviation performance can be achieved and with the proposed
variable-α scheme, 12-dB improvement can be achieved without compromising the
convergence. We will show that LMLS can potentially replace LMS in practical
applications, by demonstrating a proof-of-concept by extending the framework to
transform domain adaptive filters.

NXFEE INNOVATION
VLSI_IEEE_17
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 16000/-
TOPIC : Design and FPGA Implementation of a Reconfigurable Digital Down Converter for
Wideband Applications
Abstract : This brief presents a field-programmable gate array-based 2 implementation
of a reconfigurable digital down converter (DDC) that 3 can process input bandwidth of
up to 3.6 GHz and provide a flexible 4 down-converted output. The proposed DDC
consists of a mixer and 5 a re-sampling filter. The re-sampling filter can work at much
higher 6 clock rate. The reason is that all the single-cycle recursive loops in the 7 re-
sampling filter are pipelined by using either real/imaginary part-time 8 multiplexing or
parallel processing technique. With features like arbitrary 9 sampling rate conversion,
and dynamic configuration, the proposed design 10 is highly flexible, so that it can
generate a down-converted output with 11 sampling rate, selectable within the range of
1 kS/s–225 MS/s. Moreover, 12 the flexibility is further improved by being able to
specify the output 13 sampling rate and center frequency to a resolution of less than 1
S/s. The 14 experimental results show that the proposed design can achieve the same
15 functionality as the existing work but with fewer hardware resources.
VLSI_IEEE_18
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 16000/-
TOPIC : The Implementation of the Improved OMP for AIC Reconstruction Based on
Parallel Index Selection
Abstract : Sparse signal recovery becomes extremely challenging for a variety of real-
time applications. In this paper, we improve the orthogonal matching pursuit (OMP)
algorithm based on parallel correlation indices selection mechanism in each iteration
and Goldschmidt algorithm. Simulation results show that the improved OMP algorithm
with a reduced number of iterations and low hardware complexity of matrix operations
has higher success rate and recovery signal-to-noise-ratio (RSNR) for sparse signal
recovery. This paper presents an efficient complex valued system hardware architecture
of the recovery algorithm for analog-to-information structure based on compressive
sensing. The proposed architecture is implemented and validated on the Xilinx Virtex6
field-programmable gate array (FPGA) for signal reconstruction with N = 1024, K = 36,
and M = 256. The implementation results showed that the improved OMP algorithm
achieved a higher RSNR of 31.04 dB compared with the original OMP algorithm. This
synthesized design consumes a few percentages of the hardware resources of the FPGA
chip with the clock frequency of 135.4 MHZ and reconstruction time of 170 µs, which is
faster than the existing design.

NXFEE INNOVATION
VLSI_IEEE_20
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : Approximate Hybrid High Radix Encoding for Energy-Efficient Inexact Multipliers
Abstract : Approximate computing forms a design alternative that exploits the intrinsic
error resilience of various applications and produces energy-efficient circuits with small
accuracy loss. In this paper, we propose an approximate hybrid high radix encoding for
generating the partial products in signed multiplications that encodes the most
significant bits with the accurate radix-4 encoding and the least significant bits with an
approximate higher radix encoding. The approximations are performed by rounding the
high radix values to their nearest power of two. The proposed technique can be
configured to achieve the desired energy–accuracy tradeoffs. Compared with the
accurate radix-4 multiplier, the proposed multipliers deliver up to 56% energy and 55%
area savings, when operating at the same frequency, while the imposed error is
bounded by a Gaussian distribution with near-zero average. Moreover, the proposed
multipliers are compared with state-of-the-art inexact multipliers, outperforming them
by up to 40% in energy consumption, for similar error values. Finally, we demonstrate
the scalability of our technique.
VLSI_IEEE_23
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 8000/-
TOPIC : Low Phase Noise Ku-Band VCO With Optimal Switched-Capacitor Bank Design
Abstract : In this brief, a low phase noise Ku-band voltage-controlled oscillator (VCO)
fabricated in a 130-nm BiCMOS process is presented. The phase noise mechanism of the
switched-capacitor bank is analyzed, an optimum bank design to reduce phase noise is
proposed, and a tradeoff with tuning range is discussed. The prototype 12.2–13.1-GHz
VCO achieves a measured phase noise of −120.6 dBc/Hz at 1-MHz offset when running
at 12.67 GHz. The VCO core consumes a power of 17.7 mW and attains a figure of merit
of 190.
VLSI_IEEE_24
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : A High-Accuracy Programmable Pulse Generator With a 10-ps Timing Resolution
Abstract : Automatic test equipment must have high-precision and low-power pulse
generators (PGs) for testing memory and device-under-test ICs. This paper describes a
high-accuracy and wide-data-rate-range PG with a 10-ps time resolution. The PG
comprises an edge combiner (EC) and a multiphase clock generator (MPCG). The EC can
produce an arbitrary waveform through 32 phase outputs of the MPCG. The EC adopts a
one/zero detector and phase selection logic to define an operational data rate range
and a timing resolution, respectively. Therefore, the EC uses the phase selection logic to
combine the period window of the one/zero detector with the MPCG output phases.
The EC also uses a countdown counter for a wide operational range. In the MPCG, a
multiphase oscillator (MPO) adopts a ring oscillator scheme with sub feedback loops to
extend its maximum operational frequency. The MPO also uses a phase error corrector
to reduce the output phase error resulting from process and layout mismatches. Thus,
the PG can obtain high accuracy waveforms owing to small phase errors. The test chip
was implemented using a 0.13-µm CMOS process. The core area and power
consumption of the PG were measured to be 250 × 300 µm2 and 18.7 mW, respectively.
The data rate range of the PG was determined to be from 3.2 kHz to 893 MHz. The time

NXFEE INNOVATION
resolution and average accuracy of the PG were measured to be 10 ps and ±0.3 LSB,
respectively.
VLSI_IEEE_32
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 20000/-
TOPIC : A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition
Abstract : Fast Fourier transform (FFT) is the kernel and the most time-consuming
algorithm in the domain of digital signal processing, and the FFT sizes of different
applications are very different. Therefore, this paper proposes a variable-size FFT
hardware accelerator, which fully supports the IEEE-754 single-precision floating-point
standard and the FFT calculation with a wide size range from 2 to 220 points. First, a
parallel Cooley–Tukey FFT algorithm based on matrix transposition (MT) is proposed,
which can efficiently divide a large size FFT into several small size FFTs that can be
executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is
designed, and several FFT performance optimization techniques such as hybrid twiddle
factor generation, multibank data memory, block MT, and token-based task scheduling
are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1
GHz with the area of 2.4 mm2 and the power consumption of 91.3 mW at 25 ◦C, 0.9 V.
Finally, several experiments are carried out to evaluate the proposal’s performance in
terms of FFT execution time, resource utilization, and power consumption. Comparative
experiments show that our FFT hardware accelerator achieves at most 18.89× speedups
in comparison to two software-only solutions and two hardware dedicated solutions.
VLSI_IEEE_33
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 8000/-
TOPIC : A 12-bit 40-MS/s SAR ADC With a Fast-Binary-Window DAC Switching Scheme
Abstract : This paper presents a 12-bit 40-MS/s successive approximation register
analog-to-digital converter (ADC) for ultrasound imaging systems. By incorporating a
fast binary window digital-to-analog converter (DAC) switching technique, the
problematic most significant bit transition glitch was removed to improve linearity
without increasing the input capacitance or using a calibration scheme. A hybrid DAC
was also developed to overcome the yield problem that occurs when a tiny unit
capacitance is used in the DAC. Moreover, a reference buffer was used to accelerate the
DAC settling to achieve high speed conversion. The prototype ADC was fabricated using
a 130-nm CMOS technology. The ADC core occupied an active area of 0.1 mm2 and
consumed a total power of 1.32 mW when a 1.2-V supply was used at a conversion rate
of 40 MS/s. The measured peak signal-to-noise-and-distortion ratio and spuriousfree
dynamic range were 64 and 77.5 dB, respectively. The peak effective number of bits was
10.33, which is equivalent to a Walden figure-of-merit of 25.6 fJ/conversion step.

NXFEE INNOVATION
VLSI_IEEE_34
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Combating Data Leakage Trojans in Commercial and ASIC Applications With
Time-Division Multiplexing and Random Encoding
Abstract : Globalization of microchip fabrication opens the possibility for an attacker to
insert hardware Trojans into a chip during the manufacturing process. While most
defensive methods focus on detection or prevention, a recent method, called
Randomized Encoding of Combinational Logic for Resistance to Data Leakage (RECORD),
uses data randomization to prevent hardware Trojans from leaking meaningful
information even when the entire design is known to the attacker. Both RECORD and its
sequential variant require significant area and power overhead. In this paper, a Time-
Division Multiplexed version of the RECORD design process is proposed which reduces
area overhead by 63% and power by 56%. This time-division multiplexing (TDM) concept
is further refined to allow commercial off the shelf (COTS) products and IP cores to be
safely operated from a separate chip. These new methods tradeoff latency (5.3× for
TDM and 3.9× for COTS) and energy use to accomplish area and power savings and
achieve greater security than the original RECORD process.
VLSI_IEEE_35
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : A 3.2-GHz Supply Noise-Insensitive PLL Using a Gate-Voltage-Boosted Source-
Follower Regulator and Residual Noise Cancellation
Abstract : In this brief, we propose a supply noise-insensitive charge pump phase-
locked loop (PLL) using a source-follower (SF) regulator and noise cancellation. In order
to minimize the voltage drop of the SF regulator while improving supply rejection, a
gate-voltage-boosting technique and the body-controlled noise cancellation are
proposed. To suppress the phase noise from the ring oscillator, a reference multiplier is
employed to maximize the PLL loop bandwidth. Implemented in 65-nm CMOS, a
prototype PLL at 3.2 GHz achieves supply noise spur of less than −33 dBc for a 50-mVpp
supply noise around the loop bandwidth while consuming 3.12 mW from a 1-V supply.
VLSI_IEEE_37
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : Low-Complexity VLSI Design of Large Integer Multipliers for Fully Homomorphic
Encryption
Abstract : Large integer multiplication has been widely used in fully homomorphic
encryption (FHE). Implementing feasible large integer multiplication hardware is thus
critical for accelerating the FHE evaluation process. In this paper, a novel and efficient
operand reduction scheme is proposed to reduce the area requirement of radix-r
butterfly units. We also extend the singleport, merged-bank memory structure to the
design of number theoretic transform (NTT) and inverse NTT (INTT) for further area
minimization. In addition, an efficient memory addressing scheme is developed to
support both NTT/INTT and resolving carries computations. Experimental results reveal
that significant area reductions can be achieved for the targeted 786 432- and 1 179
648-bit NTT-based multipliers designed using the proposed schemes in comparison with
the related works. Moreover, the two multiplications can be accomplished in 0.196 and
2.21 ms, respectively, based on 90-nm CMOS technology. The low-complexity feature of
the proposed large integer multiplier designs is thus obtained without sacrificing the

NXFEE INNOVATION
time performance.
VLSI_IEEE_38
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Algorithm and VLSI Architecture Design of Proportionate-Type LMS Adaptive
Filters for Sparse System Identification
Abstract : Proportionate-type normalized LMS (Pt-NLMS) family of adaptive filtering
algorithms for sparse system identification pose significant implementation challenges
due to their high computational complexity especially for real-time applications like
network echo cancelation. In this paper, we make the first attempt to implement Pt-
NLMS algorithms in hardware. Several reformulations are proposed to simplify the
original Pt-NLMS algorithms, thereby making them amenable to realtime VLSI
implementations and the reformulated algorithms referred as delayed µ-law
proportionate LMS (DMPLMS) algorithm for white input and delayed wavelet MPLMS
(DWMPLMS) for colored input are then implemented in hardware. Simulation studies
demonstrate that the performance loss is very small for the proposed reformulations.
We implemented the proposed designs considering 16-bit fixed point representation in
hardware, and synthesis results show that the DMPLMS architecture with ≈30% increase
in hardware over the state-of-the-art conventional delayed LMS architecture achieves
3× improvement in convergence rate for white input and the DWMPLMS architecture
with ≈70% increase in hardware achieves 10× improvement in convergence rate for
correlated input conditions.
VLSI_IEEE_41
(BACK-END)
SOFTWARE :
TANNER EDA
STUDENT COST
MRP:
RS. 10000/-
TOPIC : A Fast-Locking, Low-Jitter Pulse width Control Loop for High-Speed ADC
Abstract : A fast-locking, high-precision, and low-jitter pulse width control loop (PWCL)
for high-speed high-resolution analog-to-digital converter is presented. Only through
controlling the delay of rising edge to adjust duty cycle, the clock jitter could be
suppressed greatly. An improved charge pump with a follower circuit and self-biased
loop was designed to decrease the voltage ripples for higher accuracy and lower jitter. A
startup circuit was adopted to enable the pulse width control loop lock rapidly. With the
SMIC 0.18 µm 3.3 V CMOS process, the simulation and measured results show that
within 180 ns the PWCL can lock the clock duty cycles for the accuracy of 50 ± 1% with
10%∼90% input duty cycle from 50 to 550 MHz. The rms-jitter is 73 fs at 250 MHz. The
active area is about 0.023 mm2.

NXFEE INNOVATION
AREA EFFICIENT/ TIMING & DELAY REDUCTION
VLSI_IEEE_03
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : A Residue-to-Binary Converter for the Extended Four-Moduli Set {2n − 1, 2n + 1,
22n + 1, 22n+p}
Abstract : This brief presents a residue-to-binary converter for the moduli set {2n − 1, 2n
+ 1, 22n + 1, 22n+ p}, where n is a positive integer and 0 ≤ p ≤ n − 2. The converter
consists of three simplified 4n-bit carry-save adders (CSAs) along with a modulo (24n −1)
adder. The main contribution of this brief is reducing the requirements of the proposed
CSA network, which has impacted the area, delay, power and energy. Compared with
four-moduli and five-moduli sets that have the dynamic range 2v(24n −1), where v = n
or 2n, the proposed converter resulted in the average area, delay, power, and energy
reductions of 22.7%, 9.2%, 17.8%, and 24.5%, respectively. Moreover, the throughput
rate per unit area has been improved by an average of 48.7%.
VLSI_IEEE_12
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : An Efficient Fault-Tolerance Design for Integer Parallel Matrix–Vector
Multiplications
Abstract : Parallel matrix processing is a typical operation in many systems, and in
particular matrix–vector multiplication (MVM) is one of the most common operations in
the modern digital signal processing and digital communication systems. This paper
proposes a fault tolerant design for integer parallel MVMs. The scheme combines ideas
from error correction codes with the self-checking capability of MVM. Field-
programmable gate array evaluation shows that the proposed scheme can significantly
reduce the overheads compared to the protection of each MVM on its own. Therefore,
the proposed technique can be used to reduce the cost of providing fault tolerance in
practical implementations.
VLSI_IEEE_15
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Extending 3-bit Burst Error-Correction Codes With Quadruple Adjacent Error
Correction
Abstract : The use of error-correction codes (ECCs) with advanced correction capability
is a common system-level strategy to harden the memory against multiple bit upsets
(MBUs). Therefore, the construction of ECCs with advanced error correction and low
redundancy has become an important problem, especially for adjacent ECCs. Existing
codes for mitigating MBUs mainly focus on the correction of up to 3-bit burst errors. As
the technology scales and cell interval distance decrease, the number of affected bits
can easily extend to more than 3 bit. The previous methods are therefore not enough to
satisfy the reliability requirement of the applications in harsh environments. In this
paper, a technique to extend 3-bit burst error-correction (BEC) codes with quadruple
adjacent error correction (QAEC) is presented. First, the design rules are specified and
then a searching algorithm is developed to find the codes that comply with those rules.
The H matrices of the 3-bit BEC with QAEC obtained are presented. They do not require
additional parity check bits compared with a 3-bit BEC code. By applying the new
algorithm to previous 3-bit BEC codes, the performance of 3-bit BEC is also remarkably
improved. The encoding and decoding procedure of the proposed codes is illustrated

NXFEE INNOVATION
with an example. Then, the encoders and decoders are implemented using a 65-nm
library and the results show that our codes have moderate total area and delay
overhead to achieve the correction ability extension.
VLSI_IEEE_19
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : A 588-Gb/s LDPC Decoder Based on Finite-Alphabet Message Passing
Abstract : An ultrahigh throughput low-density paritycheck (LDPC) decoder with an
unrolled full-parallel architecture is proposed, which achieves the highest decoding
throughput compared to previously reported LDPC decoders in the literature. The
decoder benefits from a serial message-transfer approach between the decoding stages
to alleviate the well-known routing congestion problem in parallel LDPC decoders.
Furthermore, a finite-alphabet message passing algorithm is employed to replace the
VN update rule of the standard min-sum (MS) decoder with lookup tables, which are
designed in a way that maximizes the mutual information between decoding messages.
The proposed algorithm results in an architecture with reduced bit-width messages,
leading to a significantly higher decoding throughput and to a lower area compared to
an MS decoder when serial message transfer is used. The architecture is placed and
routed for the standard MS reference decoder and for the proposed finite-alphabet
decoder using a custom pseudo hierarchical backend design strategy to further alleviate
routing congestions and to handle the large design. Post layout results show that the
finite-alphabet decoder with the serial message transfer architecture achieves a
throughput as large as 588 Gb/s with an area of 16.2 mm2 and dissipates an average
power of 22.7 pJ per decoded bit in a 28-nm fully depleted silicon on isulator library.
Compared to the reference MS decoder, this corresponds to 3.1 times smaller area and
2 times better energy efficiency.
VLSI_IEEE_21
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Basic-Set Trellis Min–Max Decoder Architecture for Non binary LDPC Codes With
High-Order Galois Fields
Abstract : Non binary low-density parity-check (NB-LDPC) codes outperform their
binary counterparts in terms of error correction performance. However, the drawback
of NB-LDPC decoders is high complexity, especially for the check node unit (CNU), and
the complexity increases considerably when increasing the Galois-field (GF) order. In this
paper, a novel basic-set trellis min–max algorithm is proposed to greatly reduce not only
the CNU complexity but also the number of messages exchanged between the check
node and the variable node compared with previous studies, which is highly efficient for
higher order GFs. In addition, the proposed CNU is designed to compute the messages in
a parallel way. Layered decoder architectures based on the proposed algorithm were
implemented for the (837, 726) NB-LDPC code over GF(32) and the (1512, 1323) code
over GF(64) using 90-nm CMOS technology, and obtained a reduction in the complexity
by 30% and 37% for the CNU, and 40% and 37.4% for the whole decoder, respectively.
Moreover, the proposed decoder achieves a higher throughput at 1.67 Gbit/s and 1.4
Gbit/s compared with the other state-of-the-art high-rate NB-LDPC decoders with high-
order GFs.

NXFEE INNOVATION
VLSI_IEEE_22
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : Analysis and Design of Cost-Effective, High-Throughput LDPC Decoders
Abstract : This paper introduces a new approach to cost effective, high-throughput
hardware designs for low-density parity-check (LDPC) decoders. The proposed
approach, called non surjective finite alphabet iterative decoders (NS-FAIDs), exploits
the robustness of message-passing LDPC decoders to inaccuracies in the calculation of
exchanged messages, and it is shown to provide a unified framework for several designs
previously proposed in the literature. NS-FAIDs are optimized by density evolution for
regular and irregular LDPC codes, and are shown to provide different tradeoffs between
hardware complexity and decoding performance. Two hardware architectures targeting
high-throughput applications are also proposed, integrating both Min-Sum (MS) and NS-
FAID decoding kernels. ASIC post synthesis implementation results on 65-nm CMOS
technology show that NS-FAIDs yield significant improvements in the throughput to area
ratio, by up to 58.75% with respect to the MS decoder, with even better or only slightly
degraded error correction performance.
VLSI_IEEE_26
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : ULV-Turbo Cache for an Instantaneous Performance Boost on Asymmetric
Architectures
Abstract : An asymmetric architecture is commonly used in modern embedded systems
to reduce energy consumption. The systems tend to execute more applications in the
energy-efficient core, which typically employs ultralow voltage (ULV) to save energy.
However, caches become a reliability and performance barrier that limits the minimum
operating voltage and blocks system performance in the ULV environment. The poor
performance of an ultralow-voltage core causes most workload requirements to awaken
and then execute on the host core, leading to high energy consumption. In this paper,
we propose a ULV-Turbo cache based on a ULV-selective-ally 8T static random access
memory (SRAM) that is able to perform reliable ultralow-voltage operation and provide
the speedup function of SRAM rows ally. The system is able to speed up the ULV core
instantaneously and execute more applications with the ULV-Turbo cache. In our
system-wide evaluation based on a real attitude and heading reference system
workload on an asymmetric wearable system, the ULV-Turbo cache reduces the energy
consumption of the entire system by approximately 36%.

NXFEE INNOVATION
VLSI_IEEE_27
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : Low-Complexity Methodology for Complex Square-Root Computation
Abstract : In this brief, we propose a low-complexity methodology to compute a
complex square root using only a circular coordinate rotation digital computer (CORDIC)
as opposed to the state-of-the-art techniques that need both circular as well as
hyperbolic CORDICs. Subsequently, an architecture has been designed based on the
proposed methodology and implemented on the ASIC platform using the UMC 180-nm
Technology node with 1.0 V at 5MHz. Field programmable gate array (FPGA) prototyping
using Xilinx’ Virtex-6 (XC6v1x240t) has also been carried out. After thorough theoretical
analysis and experimental validations, it can be inferred that the proposed methodology
reduces 21.15% slice look up tables (on FPGA platform) and saves 20.25% silicon area
overhead and decreases 19% power consumption (on ASIC platform) when compared
with the state-of-the-art method without compromising the computational speed,
throughput, and accuracy.
VLSI_IEEE_28
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Securing the PRESENT Block Cipher Against Combined Side-Channel Analysis and
Fault Attacks
Abstract : In this paper, we present and evaluate a hardware implementation of the
PRESENT block cipher secured against both side-channel analysis and fault attacks (FAs).
The side-channel security is provided by the first-order threshold implementation
masking scheme of the serialized PRESENT proposed by Poschmann et al. For the FA
resistance, we employ the Private Circuits II countermeasure presented by Ishai et al. at
Eurocrypt 2006, which we tailor to resist arbitrary 1-bit faults. We perform a side-
channel evaluation using the state-of-the-art leakage detection tests, quantify the
resource overhead of the Private Circuits II countermeasure, subdue the
implementation to established differential FAs against the PRESENT block cipher, and
contemplate on the structural resistance of the countermeasure. This paper provides
the detailed instructions on how to successfully achieve a secure Private Circuits II
implementation for the data path as well as the control logic.
VLSI_IEEE_39
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Multilevel Half-Rate Phase Detector for Clock and Data Recovery Circuits
Abstract : In this brief, a half-rate (HR) bang-bang (BB) phase detector (PD) with multiple
decision levels is proposed for clock and data recovery (CDR) circuits. The combination
allows the oscillator to run at half the input data rate while providing information about
the sign and magnitude of the phase shift between the PD inputs. This allows a finer
control of the frequency of the oscillator in the phase-locked loop (PLL) of the CDR
circuit, which results in up to 30% less output clock jitter than with a conventional two-
levels HR BB PD. Thanks to this, the bit error rate can be decreased by up to 5× in a 5-
Gb/s CDR circuit. The proposed topology was implemented in a 28-nm FDSOI CMOS
technology providing average power consumption below 76 µW with a supply voltage of
1 V. Although multilevel (ML) BB PDs have already been proposed in some PLL-based
CDR with very interesting results, a specific design of the PD has to be implemented for
an HR system. This brief provides the first ML-HR-BBPD.

NXFEE INNOVATION
VLSI_IEEE_40
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 12000/-
TOPIC : Fast Neural Network Training on FPGA Using Quasi-Newton Optimization
Method
Abstract : In this brief, a customized and pipelined hardware implementation of the
quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for
fast artificial neural networks onsite training, targeting at the embedded applications.
The architecture is scalable to cope with different neural network sizes while it supports
batch-mode training. Experimental results demonstrate the superior performance and
power efficiency of the proposed implementation over CPU, graphics processing unit,
and FPGA QN implementations.
VLSI_IEEE_42
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : Feedback-Based Low-Power Soft-Error-Tolerant Design for Dual-Modular
Redundancy
Abstract : Triple-modular redundancy (TMR), which consists of three identical modules
and a voting circuit, is a common architecture for soft-error tolerance. However, the
original TMR suffers from two major drawbacks: the large area overhead and the
vulnerability of the voter. In order to overcome these drawbacks, we propose a new
complementary dual-modular redundancy (CDMR) scheme for mitigating the effect of
soft errors. Inspired by the Markov random field (MRF) theory, a two-stage voting
system is implemented in CDMR, including a first stage optimal MRF structure and a
second-stage high-performance merging unit. The CDMR scheme can reduce the voting
circuit area by 20% while saving the area of one redundant module, achieving at least
26% error-rate reduction at an ultralow supply voltage of 0.25 V with 8.33% faster
timing compared to previous voter designs.
VLSI_IEEE_43
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 10000/-
TOPIC : A Simple Yet Efficient Accuracy Configurable Adder Design
Abstract : Approximate computing is a promising approach for low-power IC design and
has recently received considerable research attention. To accommodate dynamic levels
of approximation, a few accuracy-configurable adder (ACA) designs have been
developed in the past. However, these designs tend to incur large area overheads as
they rely on either redundant computing or complicated carry prediction. Some of these
designs include error detection and correction circuitry, which further increase the area.
In this paper, we investigate a simple ACA design that contains no redundancy or error
detection/correction circuitry and uses very simple carry prediction. The simulation
results show that our design dominates the latest previous work on accuracy-delay-
power tradeoff while using 39% lower area. In the best case, the iso-delay power of our
design is only 16% of accurate adder regardless of degradation in accuracy. One variant
of this design provides finer-grained and larger tunability than that of the previous
works. Moreover, we propose a delay adaptive self-configuration technique to further
improve the accuracy-delay-power tradeoff. The advantages of our method are
confirmed by the applications in multiplication and discrete cosine transform
computing.

NXFEE INNOVATION
Audio, Image and Video Processing
VLSI_IEEE_14
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 20000/-
TOPIC : An Energy-Efficient Programmable Many core Accelerator for Personalized
Biomedical Applications
Abstract : Wearable personalized health monitoring systems can offer a cost-effective
solution for human health care. These systems must constantly monitor patients’
physiological signals and provide highly accurate, and quick processing and delivery of
the vast amount of data within a limited power and area footprint. These personalized
biomedical applications require sampling and processing multiple streams of
physiological signals with a varying number of channels and sampling rates. The
processing typically consists of feature extraction, data fusion, and classification stages
that require a large number of digital signal processing (DSP) and machine learning (ML)
kernels. In response to these requirements, in this paper, a tiny, energy efficient, and
domain-specific many core accelerator referred to as power-efficient nano clusters
(PENC) is proposed to map and execute the kernels of these applications. Simulation
results show that the PENC is able to reduce energy consumption by up to 80% and 25%
for DSP and ML kernels, respectively, when optimally parallelized. In addition, we fully
implemented three compute-intensive personalized biomedical applications, namely,
multichannel seizure detection, multi physiological stress detection, and standalone
tongue drive system (sTDS), to evaluate the proposed many core performance relative
to commodity embedded CPU, graphical processing unit (GPU), and field programmable
gate array (FPGA)-based implementations. For these three case studies, the energy
consumption and the performance of the proposed PENC many core, when acting as an
accelerator along with an Intel Atom processor as a host, are compared with the existing
commercial off-the-shelf general purpose, customizable, and programmable embedded
platforms, including Intel Atom, Xilinx Artix-7 FPGA, and NVIDIA TK1 advanced RISC
machine -A15 and K1 GPU system on a chip. For these applications, the PENC many core
is able to significantly improve throughput and energy efficiency by up to 1872× and
276×, respectively. For the most computational intensive application of seizure
detection, the PENC many core is able to achieve a throughput of 15.22 giga-operations-
per-second (GOPs), which is a 14× improvement in throughput over custom FPGA
solution. For stress detection, the PENC achieves a throughput of 21.36 GOPs and an
energy efficiency of 4.23 GOP/J, which is 14.87× and 2.28× better over FPGA
implementation, respectively. For the sTDS application, the PENC improves a through
put by 5.45× and an energy efficiency by 2.37× over FPGA implementation.

NXFEE INNOVATION
VLSI_IEEE_16
(FRONT-END)
SOFTWARE:
MODELSIM
&
XILINX
STUDENT COST
MRP:
RS. 18000/-
TOPIC : VLSI Design of an ML-Based Power-Efficient Motion Estimation Controller for
Intelligent Mobile Systems
Abstract : In this paper, a machine learning (ML)-based power-efficient motion
estimation (ME) controller algorithm and VLSI architecture incorporating coding
bandwidth and rate distortion (R-D) cost using convex optimization are proposed to
effectuate a smart and bandwidth-efficient ME design for intelligent mobile systems. To
be smart and adapt to time altering coding bandwidth using intelligent power-
management techniques in modern application processor systems, we first propose an
ML-based bandwidth-on-demand ME controller algorithm based on the convex
optimization method to resolve the lack of an awareness of coding bandwidth in prior
ME designs. Then, a hardware-friendly and power-efficient VLSI architecture is
developed to implement an intelligent, high-performance, and low-power ME controller
design that can be combined with prior ME designs to satisfy the bandwidth-efficient
ME design target under bandwidth constraints. The final implementation results show
that the proposed smart ME controller architecture using our proposed bandwidth
control scheme costs 0.816K gate counts, consumes 0.873 mW of power at a working
frequency of 1.1 GHz with Taiwan Semiconductor Manufacture Company (TSMC) 90-nm
CMOS technology, and achieves an average bandwidth reduction of 56.08% compared
with previous non-band width on-demand ME designs for high-definition (HD) videos.

VLSI IEEE Transaction 2018 - IEEE Transaction

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to VLSI IEEE Transaction 2018 - IEEE Transaction

Similar to VLSI IEEE Transaction 2018 - IEEE Transaction (20)

More from Nxfee Innovation

More from Nxfee Innovation (20)

Recently uploaded

Recently uploaded (20)

VLSI IEEE Transaction 2018 - IEEE Transaction