Triple-modular redundancy (TMR), which consists of three identical modules and a voting circuit, is a common architecture for soft-error tolerance. However, the original TMR suffers from two major drawbacks: the large area overhead and the vulnerability of the voter. In order to overcome these drawbacks, we propose a new complementary dual-modular redundancy (CDMR) scheme for mitigating the effect of soft errors. Inspired by the Markov random field (MRF) theory, a two-stage voting system is implemented in CDMR, including a first stage optimal MRF structure and a second-stage high-performance merging unit. The CDMR scheme can reduce the voting circuit area by 20% while saving the area of one redundant module, achieving at least 26% error-rate reduction at an ultralow supply voltage of 0.25 V with 8.33% faster timing compared to previous voter designs.
A closed form expression for minimum operating voltage of cmos d flip-flopNxfee Innovation
In this paper, a closed-form expression for estimating the minimum operating voltage (VDDmin) of D flip-flops (FFs) is proposed. VDDmin is defined as the minimum supply voltage at which the FFs are functional without errors. The proposed expression indicates that VDDmin of FFs is a linear function of the square root of logarithm of the number of FFs, and its slope depends on the within-die variation of the threshold voltage (VTH) and its intercept depends on the balance between PMOS and NMOS, which is mainly due to the die-to-die VTH variation. The proposed expression of VDDmin is validated by the simulation results as well as the silicon measurements. Finally, we discuss the dependence of VDDmin on the device parameters..
Fast neural network training on fpga using quasi newton optimization methodNxfee Innovation
In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.
A fast and low complexity operator for the computation of the arctangent of a...Nxfee Innovation
The computation of the arctangent of a complex number, i.e., the atan2 function, is frequently needed in hardware systems that could profit from an optimized operator. In this brief, we present a novel method to compute the atan2 function and a hardware architecture for its implementation. The method is based on a first stage that performs a coarse approximation of the atan2 function and a second stage that improves the output accuracy by means of a lookup table. We present results for fixed-point implementations in a field-programmable gate array device, all of them guaranteeing last-bit accuracy, which provide an advantage in latency, speed, and use of resources, when compared with well-established fixed-point options.
An efficient fault tolerance design for integer parallel matrix vectorNxfee Innovation
Parallel matrix processing is a typical operation in many systems, and in particular matrix–vector multiplication (MVM) is one of the most common operations in the modern digital signal processing and digital communication systems. This paper proposes a fault tolerant design for integer parallel MVMs. The scheme combines ideas from error correction codes with the self-checking capability of MVM. Field-programmable gate array evaluation shows that the proposed scheme can significantly reduce the overheads compared to the protection of each MVM on its own. Therefore, the proposed technique can be used to reduce the cost of providing fault tolerance in practical implementations.
Multilevel half rate phase detector for clock and data recovery circuitsNxfee Innovation
In this brief, a half-rate (HR) bang-bang (BB) phase detector (PD) with multiple decision levels is proposed for clock and data recovery (CDR) circuits. The combination allows the oscillator to run at half the input data rate while providing information about the sign and magnitude of the phase shift between the PD inputs. This allows a finer control of the frequency of the oscillator in the phase-locked loop (PLL) of the CDR circuit, which results in up to 30% less output clock jitter than with a conventional two-levels HR BB PD. Thanks to this, the bit error rate can be decreased by up to 5× in a 5-Gb/s CDR circuit. The proposed topology was implemented in a 28-nm FDSOI CMOS technology providing average power consumption below 76 µW with a supply voltage of 1 V. Although multilevel (ML) BB PDs have already been proposed in some PLL-based CDR with very interesting results, a specific design of the PD has to be implemented for an HR system. This brief provides the first ML-HR-BBPD.
Design and fpga implementation of a reconfigurable digital down converter for...Nxfee Innovation
This brief presents a field-programmable gate array-based implementation of a reconfigurable digital down converter (DDC) that can process input bandwidth of up to 3.6 GHz and provide a flexible down-converted output. The proposed DDC consists of a mixer and a resampling filter. The resampling filter can work at much higher clock rate. The reason is that all the single-cycle recursive loops in the re sampling filter are pipelined by using either real/imaginary part-time multiplexing or parallel processing technique. With features like arbitrary sampling rate conversion, and dynamic configuration, the proposed design is highly flexible, so that it can generate a down-converted output with sampling rate, selectable within the range of 1 kS/s–225 MS/s. Moreover, the flexibility is further improved by being able to specify the output sampling rate and center frequency to a resolution of less than 1 S/s. The experimental results show that the proposed design can achieve the same functionality as the existing work but with fewer hardware resources.
A reconfigurable ldpc decoder optimized applicationsNxfee Innovation
This paper presents a high data-rate low-density parity-check (LDPC) decoder, suitable for the 802.11n/ac (WiFi) standard. The innovative features of the proposed decoder relate to the decoding algorithms and the interconnection between the processing elements. The reduction of the hardware complexity of decoders based on the min-sum (MS) algorithms comes at the cost of performance degradation, especially at high-noise regions. We introduce more accurate approximations of the log sum-product algorithm that also operate well for low signal-to noise ratio values. Telecommunication standards, including WiFi, support more than one quasi-cyclic LDPC codes of different characteristics, such as codeword length and code rate. A proposed design technique derives networks, capable of supporting a variety of codes and efficiently realizing connectivity between a variable number of processing units, with a relatively small hardware overhead over the single-code case. As a demonstration of the proposed technique, we implemented a reconfigurable network based on barrel rotators, suitable for LDPC decoders compatible with WiFi standard. Our approach achieves low complexity and high clock frequency, compared with related prior works. A 90-nm application-specified integrated circuit implementation of the proposed high-parallel WiFi decoder occupies 4.88 mm2 and achieves an information throughput rate of 4.5 G bit/s at a clock frequency of 555 MHz.
A high accuracy programmable pulse generator with a 10-ps timing resolutionNxfee Innovation
Automatic test equipment must have high-precision and low-power pulse generators (PGs) for testing memory and device-under-test ICs. This paper describes a high-accuracy and wide-data-rate-range PG with a 10-ps time resolution. The PG comprises an edge combiner (EC) and a multiphase clock generator (MPCG). The EC can produce an arbitrary waveform through 32 phase outputs of the MPCG. The EC adopts a one/zero detector and phase selection logic to define an operational data rate range and a timing resolution, respectively. Therefore, the EC uses the phase selection logic to combine the period window of the one/zero detector with the MPCG output phases. The EC also uses a countdown counter for a wide operational range. In the MPCG, a multiphase oscillator (MPO) adopts a ring oscillator scheme with sub feedback loops to extend its maximum operational frequency. The MPO also uses a phase error corrector to reduce the output phase error resulting from process and layout mismatches. Thus, the PG can obtain high accuracy waveforms owing to small phase errors. The test chip was implemented using a 0.13-µm CMOS process. The core area and power consumption of the PG were measured to be 250 × 300 µm2 and 18.7 mW, respectively. The data rate range of the PG was determined to be from 3.2 kHz to 893 MHz. The time resolution and average accuracy of the PG were measured to be 10 ps and ±0.3 LSB, respectively.
A closed form expression for minimum operating voltage of cmos d flip-flopNxfee Innovation
In this paper, a closed-form expression for estimating the minimum operating voltage (VDDmin) of D flip-flops (FFs) is proposed. VDDmin is defined as the minimum supply voltage at which the FFs are functional without errors. The proposed expression indicates that VDDmin of FFs is a linear function of the square root of logarithm of the number of FFs, and its slope depends on the within-die variation of the threshold voltage (VTH) and its intercept depends on the balance between PMOS and NMOS, which is mainly due to the die-to-die VTH variation. The proposed expression of VDDmin is validated by the simulation results as well as the silicon measurements. Finally, we discuss the dependence of VDDmin on the device parameters..
Fast neural network training on fpga using quasi newton optimization methodNxfee Innovation
In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.
A fast and low complexity operator for the computation of the arctangent of a...Nxfee Innovation
The computation of the arctangent of a complex number, i.e., the atan2 function, is frequently needed in hardware systems that could profit from an optimized operator. In this brief, we present a novel method to compute the atan2 function and a hardware architecture for its implementation. The method is based on a first stage that performs a coarse approximation of the atan2 function and a second stage that improves the output accuracy by means of a lookup table. We present results for fixed-point implementations in a field-programmable gate array device, all of them guaranteeing last-bit accuracy, which provide an advantage in latency, speed, and use of resources, when compared with well-established fixed-point options.
An efficient fault tolerance design for integer parallel matrix vectorNxfee Innovation
Parallel matrix processing is a typical operation in many systems, and in particular matrix–vector multiplication (MVM) is one of the most common operations in the modern digital signal processing and digital communication systems. This paper proposes a fault tolerant design for integer parallel MVMs. The scheme combines ideas from error correction codes with the self-checking capability of MVM. Field-programmable gate array evaluation shows that the proposed scheme can significantly reduce the overheads compared to the protection of each MVM on its own. Therefore, the proposed technique can be used to reduce the cost of providing fault tolerance in practical implementations.
Multilevel half rate phase detector for clock and data recovery circuitsNxfee Innovation
In this brief, a half-rate (HR) bang-bang (BB) phase detector (PD) with multiple decision levels is proposed for clock and data recovery (CDR) circuits. The combination allows the oscillator to run at half the input data rate while providing information about the sign and magnitude of the phase shift between the PD inputs. This allows a finer control of the frequency of the oscillator in the phase-locked loop (PLL) of the CDR circuit, which results in up to 30% less output clock jitter than with a conventional two-levels HR BB PD. Thanks to this, the bit error rate can be decreased by up to 5× in a 5-Gb/s CDR circuit. The proposed topology was implemented in a 28-nm FDSOI CMOS technology providing average power consumption below 76 µW with a supply voltage of 1 V. Although multilevel (ML) BB PDs have already been proposed in some PLL-based CDR with very interesting results, a specific design of the PD has to be implemented for an HR system. This brief provides the first ML-HR-BBPD.
Design and fpga implementation of a reconfigurable digital down converter for...Nxfee Innovation
This brief presents a field-programmable gate array-based implementation of a reconfigurable digital down converter (DDC) that can process input bandwidth of up to 3.6 GHz and provide a flexible down-converted output. The proposed DDC consists of a mixer and a resampling filter. The resampling filter can work at much higher clock rate. The reason is that all the single-cycle recursive loops in the re sampling filter are pipelined by using either real/imaginary part-time multiplexing or parallel processing technique. With features like arbitrary sampling rate conversion, and dynamic configuration, the proposed design is highly flexible, so that it can generate a down-converted output with sampling rate, selectable within the range of 1 kS/s–225 MS/s. Moreover, the flexibility is further improved by being able to specify the output sampling rate and center frequency to a resolution of less than 1 S/s. The experimental results show that the proposed design can achieve the same functionality as the existing work but with fewer hardware resources.
A reconfigurable ldpc decoder optimized applicationsNxfee Innovation
This paper presents a high data-rate low-density parity-check (LDPC) decoder, suitable for the 802.11n/ac (WiFi) standard. The innovative features of the proposed decoder relate to the decoding algorithms and the interconnection between the processing elements. The reduction of the hardware complexity of decoders based on the min-sum (MS) algorithms comes at the cost of performance degradation, especially at high-noise regions. We introduce more accurate approximations of the log sum-product algorithm that also operate well for low signal-to noise ratio values. Telecommunication standards, including WiFi, support more than one quasi-cyclic LDPC codes of different characteristics, such as codeword length and code rate. A proposed design technique derives networks, capable of supporting a variety of codes and efficiently realizing connectivity between a variable number of processing units, with a relatively small hardware overhead over the single-code case. As a demonstration of the proposed technique, we implemented a reconfigurable network based on barrel rotators, suitable for LDPC decoders compatible with WiFi standard. Our approach achieves low complexity and high clock frequency, compared with related prior works. A 90-nm application-specified integrated circuit implementation of the proposed high-parallel WiFi decoder occupies 4.88 mm2 and achieves an information throughput rate of 4.5 G bit/s at a clock frequency of 555 MHz.
A high accuracy programmable pulse generator with a 10-ps timing resolutionNxfee Innovation
Automatic test equipment must have high-precision and low-power pulse generators (PGs) for testing memory and device-under-test ICs. This paper describes a high-accuracy and wide-data-rate-range PG with a 10-ps time resolution. The PG comprises an edge combiner (EC) and a multiphase clock generator (MPCG). The EC can produce an arbitrary waveform through 32 phase outputs of the MPCG. The EC adopts a one/zero detector and phase selection logic to define an operational data rate range and a timing resolution, respectively. Therefore, the EC uses the phase selection logic to combine the period window of the one/zero detector with the MPCG output phases. The EC also uses a countdown counter for a wide operational range. In the MPCG, a multiphase oscillator (MPO) adopts a ring oscillator scheme with sub feedback loops to extend its maximum operational frequency. The MPO also uses a phase error corrector to reduce the output phase error resulting from process and layout mismatches. Thus, the PG can obtain high accuracy waveforms owing to small phase errors. The test chip was implemented using a 0.13-µm CMOS process. The core area and power consumption of the PG were measured to be 250 × 300 µm2 and 18.7 mW, respectively. The data rate range of the PG was determined to be from 3.2 kHz to 893 MHz. The time resolution and average accuracy of the PG were measured to be 10 ps and ±0.3 LSB, respectively.
Approximate hybrid high radix encoding for energy efficient inexact multipliersNxfee Innovation
Approximate computing forms a design alternative that exploits the intrinsic error resilience of various applications and produces energy-efficient circuits with small accuracy loss. In this paper, we propose an approximate hybrid high radix encoding for generating the partial products in signed multiplications that encodes the most significant bits with the accurate radix-4 encoding and the least significant bits with an approximate higher radix encoding. The approximations are performed by rounding the high radix values to their nearest power of two. The proposed technique can be configured to achieve the desired energy–accuracy tradeoffs. Compared with the accurate radix-4 multiplier, the proposed multipliers deliver up to 56% energy and 55% area savings, when operating at the same frequency, while the imposed error is bounded by a Gaussian distribution with near-zero average. Moreover, the proposed multipliers are compared with state-of-the-art inexact multipliers, outperforming them by up to 40% in energy consumption, for similar error values. Finally, we demonstrate the scalability of our technique.
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.
Combating data leakage trojans in commercial and asic applications with time ...Nxfee Innovation
Globalization of microchip fabrication opens the possibility for an attacker to insert hardware Trojans into a chip during the manufacturing process. While most defensive methods focus on detection or prevention, a recent method, called Randomized Encoding of Combinational Logic for Resistance to Data Leakage (RECORD), uses data randomization to prevent hardware Trojans from leaking meaningful information even when the entire design is known to the attacker. Both RECORD and its sequential variant require significant area and power overhead. In this paper, a Time-Division Multiplexed version of the RECORD design process is proposed which reduces area overhead by 63% and power by 56%. This time-division multiplexing (TDM) concept is further refined to allow commercial off the shelf (COTS) products and IP cores to be safely operated from a separate chip. These new methods tradeoff latency (5.3× for TDM and 3.9× for COTS) and energy use to accomplish area and power savings and achieve greater security than the original RECORD process.
Design of an area efficient million-bit integer multiplier using double modul...Nxfee Innovation
This brief proposes a double modulus number theoretical transform (NTT) method for million-bit integer multiplication in fully homomorphic encryption. In our method, each NTT point is processed simultaneously under two moduli, and the final result is generated through the Chinese reminder theorem. The employment of double modulus enlarges the permitted NTT sample size from 24 to 32 bits and thus improves the transform efficiency. Based on the proposed double modulus method, we accomplish a VLSI design of million-bit integer multiplier with the Schönhage–Strassen algorithm. Implementation results on Altera Stratix-V FPGA show that this brief is able to compute a product of two 1024k-bit integers every 4.9 ms at the cost of only 7.9k ALUTs and 3.6k registers, which is more area-efficient when compared with the current competitors.
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...Nxfee Innovation
Proportionate-type normalized LMS (Pt-NLMS) family of adaptive filtering algorithms for sparse system identification pose significant implementation challenges due to their high computational complexity especially for real-time applications like network echo cancelation. In this paper, we make the first attempt to implement Pt-NLMS algorithms in hardware. Several reformulations are proposed to simplify the original Pt-NLMS algorithms, thereby making them amenable to real time VLSI implementations and the reformulated algorithms referred as delayed µ-law proportionate LMS (DMPLMS) algorithm for white input and delayed wavelet MPLMS (DWMPLMS) for colored input are then implemented in hardware. Simulation studies demonstrate that the performance loss is very small for the proposed reformulations. We implemented the proposed designs considering 16-bit fixed point representation in hardware, and synthesis results show that the DMPLMS architecture with ≈30% increase in hardware over the state-of-the-art conventional delayed LMS architecture achieves 3× improvement in convergence rate for white input and the DWMPLMS architecture with ≈70% increase in hardware achieves 10× improvement in convergence rate for correlated input conditions.
Analysis and design of cost effective, high-throughput ldpc decodersNxfee Innovation
This paper introduces a new approach to cost effective, high-throughput hardware designs for low-density parity-check (LDPC) decoders. The proposed approach, called nonsurjective finite alphabet iterative decoders (NS-FAIDs), exploits the robustness of message-passing LDPC decoders to inaccuracies in the calculation of exchanged messages, and it is shown to provide a unified framework for several designs previously proposed in the literature. NS-FAIDs are optimized by density evolution for regular and irregular LDPC codes, and are shown to provide different tradeoffs between hardware complexity and decoding performance. Two hardware architectures targeting high-throughput applications are also proposed, integrating both Min-Sum (MS) and NS-FAID decoding kernels. ASIC post synthesis implementation results on 65-nm CMOS technology show that NS-FAIDs yield significant improvements in the throughput to area ratio, by up to 58.75% with respect to the MS decoder, with even better or only slightly degraded error correction performance.
Securing the present block cipher against combined side channel analysis and ...Nxfee Innovation
In this paper, we present and evaluate a hardware implementation of the PRESENT block cipher secured against both side-channel analysis and fault attacks (FAs). The side-channel security is provided by the first-order threshold implementation masking scheme of the serialized PRESENT proposed by Poschmann et al. For the FA resistance, we employ the Private Circuits II countermeasure presented by Ishai et al. at Eurocrypt 2006, which we tailor to resist arbitrary 1-bit faults. We perform a side-channel evaluation using the state-of-the-art leakage detection tests, quantify the resource overhead of the Private Circuits II countermeasure, subdue the implementation to established differential FAs against the PRESENT block cipher, and contemplate on the structural resistance of the countermeasure. This paper provides the detailed instructions on how to successfully achieve a secure Private Circuits II implementation for the data path as well as the control logic.
Approximate sum of-products designs based on distributed arithmeticNxfee Innovation
Approximate circuits provide high performance and require low power. Sum-of-products (SOP) units are key elements in many digital signal processing applications. In this brief, three approximate SOP (ASOP) models which are based on the distributed arithmetic are proposed. They are designed for different levels of accuracy. First model of ASOP achieves an improvement up to 64% on area and 70% on power, when compared with conventional unit. Other two models provide an improvement of 32% and 48% on area and 54% and 58% on power, respectively, with a reduced error rate compared with the first model. Third model achieves the mean relative error and normalized error distance as low as 0.05% and 0.009%, respectively. Performance of approximate units is evaluated with a noisy image smoothing application, where the proposed models are capable of achieving higher peak signal to-noise ratio than the existing state-of-the-art techniques. It is shown that the proposed approximate models achieve higher processing accuracy than existing works but with significant improvements in power and performance.
A 12 bit 40-ms s sar adc with a fast-binary-window dac switching schemeNxfee Innovation
This paper presents a 12-bit 40-MS/s successive approximation register analog-to-digital converter (ADC) for ultrasound imaging systems. By incorporating a fast binary window digital-to-analog converter (DAC) switching technique, the problematic most significant bit transition glitch was removed to improve linearity without increasing the input capacitance or using a calibration scheme. A hybrid DAC was also developed to overcome the yield problem that occurs when a tiny unit capacitance is used in the DAC. Moreover, a reference buffer was used to accelerate the DAC settling to achieve high speed conversion. The prototype ADC was fabricated using a 130-nm CMOS technology. The ADC core occupied an active area of 0.1 mm 2 and consumed a total power of 1.32 mW when a 1.2-V supply was used at a conversion rate of 40 MS/s. The measured peak signal-to-noise-and-distortion ratio and spurious free dynamic range were 64 and 77.5 dB, respectively. The peak effective number of bits was 10.33, which is equivalent to a Walden figure-of-merit of 25.6 fJ/conversion step.
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...Nxfee Innovation
A configurable-bandwidth (BW) filter is presented in this paper for pulsed radar applications. To eliminate dispersion effects in the received waveform, a finite impulse response (FIR) topology is proposed, which has a measured standard deviation of an in-band group delay of 11 ns that is primarily dominated by the inherent, fully predictable delay introduced by the sample-and-hold. The filter operates at an IF of 20 MHz, and is tunable in BW from 1.5 to 15 MHz, which makes it optimal to be used with varying pulse widths in the radar. Employing a total of 128 taps, the FIR filter provides greater than 50-dB sharp attenuation in the stop band in order to minimize all out-of-band noise in the low signal-to-noise received radar signal. Fabricated in a 0.18-µm silicon on insulator CMOS process, the proposed filter consumes approximately 3.5mW/tap with a 1.8-V supply. A 20-MHz two-tone measurement with 200-kHz tone separation shows IIP3 greater than 8.5dBm.
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...Nxfee Innovation
In this brief, we propose a supply noise-insensitive charge pump phase-locked loop (PLL) using a source-follower (SF) regulator and noise cancellation. In order to minimize the voltage drop of the SF regulator while improving supply rejection, a gate-voltage-boosting technique and the body-controlled noise cancellation are proposed. To suppress the phase noise from the ring oscillator, a reference multiplier is employed to maximize the PLL loop bandwidth. Implemented in 65-nm CMOS, a prototype PLL at 3.2 GHz achieves supply noise spur of less than −33 dBc for a 50-mVpp supply noise around the loop bandwidth while consuming 3.12 mW from a 1-V supply.
An energy efficient programmable many core accelerator for personalized biome...Nxfee Innovation
Wearable personalized health monitoring systems can offer a cost-effective solution for human health care. These systems must constantly monitor patients’ physiological signals and provide highly accurate, and quick processing and delivery of the vast amount of data within a limited power and area footprint. These personalized biomedical applications require sampling and processing multiple streams of physiological signals with a varying number of channels and sampling rates. The processing typically consists of feature extraction, data fusion, and classification stages that require a large number of digital signal processing (DSP) and machine learning (ML) kernels. In response to these requirements, in this paper, a tiny, energy efficient, and domain-specific manycore accelerator referred to as power-efficient nano clusters (PENC) is proposed to map and execute the kernels of these applications. Simulation results show that the PENC is able to reduce energy consumption by up to 80% and 25% for DSP and ML kernels, respectively, when optimally parallelized. In addition, we fully implemented three compute-intensive personalized biomedical applications, namely, multichannel seizure detection, multi physiological stress detection, and standalone tongue drive system (sTDS), to evaluate the proposed manycore performance relative to commodity embedded CPU, graphical processing unit (GPU), and field programmable gate array (FPGA)-based implementations. For these three case studies, the energy consumption and the performance of the proposed PENC manycore, when acting as an accelerator along with an Intel Atom processor as a host, are compared with the existing commercial off-the-shelf general purpose, customizable, and programmable embedded platforms, including Intel Atom, Xilinx Artix-7 FPGA, and NVIDIA TK1 advanced RISC machine -A15 and K1 GPU system on a chip. For these applications, the PENC manycore is able to significantly improve throughput and energy efficiency by up to 1872× and 276×, respectively. For the most computational intensive application of seizure detection, the PENC manycore is able to achieve a throughput of 15.22 giga-operations-per-second (GOPs), which is a 14× improvement in throughput over custom FPGA solution. For stress detection, the PENC achieves a throughput of 21.36 GOPs and an energy efficiency of 4.23 GOP/J, which is 14.87× and 2.28× better over FPGA implementation, respectively. For the sTDS application, the PENC improves a through put by 5.45× and an energy efficiency by 2.37× over FPGA implementation.
The implementation of the improved omp for aic reconstruction based on parall...Nxfee Innovation
Sparse signal recovery becomes extremely challenging for a variety of real-time applications. In this paper, we improve the orthogonal matching pursuit (OMP) algorithm based on parallel correlation indices selection mechanism in each iteration and Goldschmidt algorithm. Simulation results show that the improved OMP algorithm with a reduced number of iterations and low hardware complexity of matrix operations has higher success rate and recovery signal-to-noise-ratio (RSNR) for sparse signal recovery. This paper presents an efficient complex valued system hardware architecture of the recovery algorithm for analog-to-information structure based on compressive sensing. The proposed architecture is implemented and validated on the Xilinx Virtex6 field-programmable gate array (FPGA) for signal reconstruction with N = 1024, K = 36, and M = 256. The implementation results showed that the improved OMP algorithm achieved a higher RSNR of 31.04 dB compared with the original OMP algorithm. This synthesized design consumes a few percentages of the hardware resources of the FPGA chip with the clock frequency of 135.4 MHZ and reconstruction time of 170 µs, which is faster than the existing design.
More Related Content
Similar to Feedback based low-power soft-error-tolerant design for dual-modular redundancy
Approximate hybrid high radix encoding for energy efficient inexact multipliersNxfee Innovation
Approximate computing forms a design alternative that exploits the intrinsic error resilience of various applications and produces energy-efficient circuits with small accuracy loss. In this paper, we propose an approximate hybrid high radix encoding for generating the partial products in signed multiplications that encodes the most significant bits with the accurate radix-4 encoding and the least significant bits with an approximate higher radix encoding. The approximations are performed by rounding the high radix values to their nearest power of two. The proposed technique can be configured to achieve the desired energy–accuracy tradeoffs. Compared with the accurate radix-4 multiplier, the proposed multipliers deliver up to 56% energy and 55% area savings, when operating at the same frequency, while the imposed error is bounded by a Gaussian distribution with near-zero average. Moreover, the proposed multipliers are compared with state-of-the-art inexact multipliers, outperforming them by up to 40% in energy consumption, for similar error values. Finally, we demonstrate the scalability of our technique.
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.
Combating data leakage trojans in commercial and asic applications with time ...Nxfee Innovation
Globalization of microchip fabrication opens the possibility for an attacker to insert hardware Trojans into a chip during the manufacturing process. While most defensive methods focus on detection or prevention, a recent method, called Randomized Encoding of Combinational Logic for Resistance to Data Leakage (RECORD), uses data randomization to prevent hardware Trojans from leaking meaningful information even when the entire design is known to the attacker. Both RECORD and its sequential variant require significant area and power overhead. In this paper, a Time-Division Multiplexed version of the RECORD design process is proposed which reduces area overhead by 63% and power by 56%. This time-division multiplexing (TDM) concept is further refined to allow commercial off the shelf (COTS) products and IP cores to be safely operated from a separate chip. These new methods tradeoff latency (5.3× for TDM and 3.9× for COTS) and energy use to accomplish area and power savings and achieve greater security than the original RECORD process.
Design of an area efficient million-bit integer multiplier using double modul...Nxfee Innovation
This brief proposes a double modulus number theoretical transform (NTT) method for million-bit integer multiplication in fully homomorphic encryption. In our method, each NTT point is processed simultaneously under two moduli, and the final result is generated through the Chinese reminder theorem. The employment of double modulus enlarges the permitted NTT sample size from 24 to 32 bits and thus improves the transform efficiency. Based on the proposed double modulus method, we accomplish a VLSI design of million-bit integer multiplier with the Schönhage–Strassen algorithm. Implementation results on Altera Stratix-V FPGA show that this brief is able to compute a product of two 1024k-bit integers every 4.9 ms at the cost of only 7.9k ALUTs and 3.6k registers, which is more area-efficient when compared with the current competitors.
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...Nxfee Innovation
Proportionate-type normalized LMS (Pt-NLMS) family of adaptive filtering algorithms for sparse system identification pose significant implementation challenges due to their high computational complexity especially for real-time applications like network echo cancelation. In this paper, we make the first attempt to implement Pt-NLMS algorithms in hardware. Several reformulations are proposed to simplify the original Pt-NLMS algorithms, thereby making them amenable to real time VLSI implementations and the reformulated algorithms referred as delayed µ-law proportionate LMS (DMPLMS) algorithm for white input and delayed wavelet MPLMS (DWMPLMS) for colored input are then implemented in hardware. Simulation studies demonstrate that the performance loss is very small for the proposed reformulations. We implemented the proposed designs considering 16-bit fixed point representation in hardware, and synthesis results show that the DMPLMS architecture with ≈30% increase in hardware over the state-of-the-art conventional delayed LMS architecture achieves 3× improvement in convergence rate for white input and the DWMPLMS architecture with ≈70% increase in hardware achieves 10× improvement in convergence rate for correlated input conditions.
Analysis and design of cost effective, high-throughput ldpc decodersNxfee Innovation
This paper introduces a new approach to cost effective, high-throughput hardware designs for low-density parity-check (LDPC) decoders. The proposed approach, called nonsurjective finite alphabet iterative decoders (NS-FAIDs), exploits the robustness of message-passing LDPC decoders to inaccuracies in the calculation of exchanged messages, and it is shown to provide a unified framework for several designs previously proposed in the literature. NS-FAIDs are optimized by density evolution for regular and irregular LDPC codes, and are shown to provide different tradeoffs between hardware complexity and decoding performance. Two hardware architectures targeting high-throughput applications are also proposed, integrating both Min-Sum (MS) and NS-FAID decoding kernels. ASIC post synthesis implementation results on 65-nm CMOS technology show that NS-FAIDs yield significant improvements in the throughput to area ratio, by up to 58.75% with respect to the MS decoder, with even better or only slightly degraded error correction performance.
Securing the present block cipher against combined side channel analysis and ...Nxfee Innovation
In this paper, we present and evaluate a hardware implementation of the PRESENT block cipher secured against both side-channel analysis and fault attacks (FAs). The side-channel security is provided by the first-order threshold implementation masking scheme of the serialized PRESENT proposed by Poschmann et al. For the FA resistance, we employ the Private Circuits II countermeasure presented by Ishai et al. at Eurocrypt 2006, which we tailor to resist arbitrary 1-bit faults. We perform a side-channel evaluation using the state-of-the-art leakage detection tests, quantify the resource overhead of the Private Circuits II countermeasure, subdue the implementation to established differential FAs against the PRESENT block cipher, and contemplate on the structural resistance of the countermeasure. This paper provides the detailed instructions on how to successfully achieve a secure Private Circuits II implementation for the data path as well as the control logic.
Approximate sum of-products designs based on distributed arithmeticNxfee Innovation
Approximate circuits provide high performance and require low power. Sum-of-products (SOP) units are key elements in many digital signal processing applications. In this brief, three approximate SOP (ASOP) models which are based on the distributed arithmetic are proposed. They are designed for different levels of accuracy. First model of ASOP achieves an improvement up to 64% on area and 70% on power, when compared with conventional unit. Other two models provide an improvement of 32% and 48% on area and 54% and 58% on power, respectively, with a reduced error rate compared with the first model. Third model achieves the mean relative error and normalized error distance as low as 0.05% and 0.009%, respectively. Performance of approximate units is evaluated with a noisy image smoothing application, where the proposed models are capable of achieving higher peak signal to-noise ratio than the existing state-of-the-art techniques. It is shown that the proposed approximate models achieve higher processing accuracy than existing works but with significant improvements in power and performance.
A 12 bit 40-ms s sar adc with a fast-binary-window dac switching schemeNxfee Innovation
This paper presents a 12-bit 40-MS/s successive approximation register analog-to-digital converter (ADC) for ultrasound imaging systems. By incorporating a fast binary window digital-to-analog converter (DAC) switching technique, the problematic most significant bit transition glitch was removed to improve linearity without increasing the input capacitance or using a calibration scheme. A hybrid DAC was also developed to overcome the yield problem that occurs when a tiny unit capacitance is used in the DAC. Moreover, a reference buffer was used to accelerate the DAC settling to achieve high speed conversion. The prototype ADC was fabricated using a 130-nm CMOS technology. The ADC core occupied an active area of 0.1 mm 2 and consumed a total power of 1.32 mW when a 1.2-V supply was used at a conversion rate of 40 MS/s. The measured peak signal-to-noise-and-distortion ratio and spurious free dynamic range were 64 and 77.5 dB, respectively. The peak effective number of bits was 10.33, which is equivalent to a Walden figure-of-merit of 25.6 fJ/conversion step.
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...Nxfee Innovation
A configurable-bandwidth (BW) filter is presented in this paper for pulsed radar applications. To eliminate dispersion effects in the received waveform, a finite impulse response (FIR) topology is proposed, which has a measured standard deviation of an in-band group delay of 11 ns that is primarily dominated by the inherent, fully predictable delay introduced by the sample-and-hold. The filter operates at an IF of 20 MHz, and is tunable in BW from 1.5 to 15 MHz, which makes it optimal to be used with varying pulse widths in the radar. Employing a total of 128 taps, the FIR filter provides greater than 50-dB sharp attenuation in the stop band in order to minimize all out-of-band noise in the low signal-to-noise received radar signal. Fabricated in a 0.18-µm silicon on insulator CMOS process, the proposed filter consumes approximately 3.5mW/tap with a 1.8-V supply. A 20-MHz two-tone measurement with 200-kHz tone separation shows IIP3 greater than 8.5dBm.
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...Nxfee Innovation
In this brief, we propose a supply noise-insensitive charge pump phase-locked loop (PLL) using a source-follower (SF) regulator and noise cancellation. In order to minimize the voltage drop of the SF regulator while improving supply rejection, a gate-voltage-boosting technique and the body-controlled noise cancellation are proposed. To suppress the phase noise from the ring oscillator, a reference multiplier is employed to maximize the PLL loop bandwidth. Implemented in 65-nm CMOS, a prototype PLL at 3.2 GHz achieves supply noise spur of less than −33 dBc for a 50-mVpp supply noise around the loop bandwidth while consuming 3.12 mW from a 1-V supply.
An energy efficient programmable many core accelerator for personalized biome...Nxfee Innovation
Wearable personalized health monitoring systems can offer a cost-effective solution for human health care. These systems must constantly monitor patients’ physiological signals and provide highly accurate, and quick processing and delivery of the vast amount of data within a limited power and area footprint. These personalized biomedical applications require sampling and processing multiple streams of physiological signals with a varying number of channels and sampling rates. The processing typically consists of feature extraction, data fusion, and classification stages that require a large number of digital signal processing (DSP) and machine learning (ML) kernels. In response to these requirements, in this paper, a tiny, energy efficient, and domain-specific manycore accelerator referred to as power-efficient nano clusters (PENC) is proposed to map and execute the kernels of these applications. Simulation results show that the PENC is able to reduce energy consumption by up to 80% and 25% for DSP and ML kernels, respectively, when optimally parallelized. In addition, we fully implemented three compute-intensive personalized biomedical applications, namely, multichannel seizure detection, multi physiological stress detection, and standalone tongue drive system (sTDS), to evaluate the proposed manycore performance relative to commodity embedded CPU, graphical processing unit (GPU), and field programmable gate array (FPGA)-based implementations. For these three case studies, the energy consumption and the performance of the proposed PENC manycore, when acting as an accelerator along with an Intel Atom processor as a host, are compared with the existing commercial off-the-shelf general purpose, customizable, and programmable embedded platforms, including Intel Atom, Xilinx Artix-7 FPGA, and NVIDIA TK1 advanced RISC machine -A15 and K1 GPU system on a chip. For these applications, the PENC manycore is able to significantly improve throughput and energy efficiency by up to 1872× and 276×, respectively. For the most computational intensive application of seizure detection, the PENC manycore is able to achieve a throughput of 15.22 giga-operations-per-second (GOPs), which is a 14× improvement in throughput over custom FPGA solution. For stress detection, the PENC achieves a throughput of 21.36 GOPs and an energy efficiency of 4.23 GOP/J, which is 14.87× and 2.28× better over FPGA implementation, respectively. For the sTDS application, the PENC improves a through put by 5.45× and an energy efficiency by 2.37× over FPGA implementation.
Similar to Feedback based low-power soft-error-tolerant design for dual-modular redundancy (20)
The implementation of the improved omp for aic reconstruction based on parall...Nxfee Innovation
Sparse signal recovery becomes extremely challenging for a variety of real-time applications. In this paper, we improve the orthogonal matching pursuit (OMP) algorithm based on parallel correlation indices selection mechanism in each iteration and Goldschmidt algorithm. Simulation results show that the improved OMP algorithm with a reduced number of iterations and low hardware complexity of matrix operations has higher success rate and recovery signal-to-noise-ratio (RSNR) for sparse signal recovery. This paper presents an efficient complex valued system hardware architecture of the recovery algorithm for analog-to-information structure based on compressive sensing. The proposed architecture is implemented and validated on the Xilinx Virtex6 field-programmable gate array (FPGA) for signal reconstruction with N = 1024, K = 36, and M = 256. The implementation results showed that the improved OMP algorithm achieved a higher RSNR of 31.04 dB compared with the original OMP algorithm. This synthesized design consumes a few percentages of the hardware resources of the FPGA chip with the clock frequency of 135.4 MHZ and reconstruction time of 170 µs, which is faster than the existing design.
Low complexity methodology for complex square-root computationNxfee Innovation
In this brief, we propose a low-complexity methodology to compute a complex square root using only a circular coordinate rotation digital computer (CORDIC) as opposed to the state-of-the-art techniques that need both circular as well as hyperbolic CORDICs. Subsequently, an architecture has been designed based on the proposed methodology and implemented on the ASIC platform using the UMC 180-nm Technology node with 1.0 V at 5 MHz. Field programmable gate array (FPGA) prototyping using Xilinx’ Virtex-6 (XC6v1x240t) has also been carried out. After thorough theoretical analysis and experimental validations, it can be inferred that the proposed methodology reduces 21.15% slice look up tables (on FPGA platform) and saves 20.25% silicon area overhead and decreases 19% power consumption (on ASIC platform) when compared with the state-of-the-art method without compromising the computational speed, throughput, and accuracy.
Efficient fpga mapping of pipeline sdf fft coresNxfee Innovation
In this paper, an efficient mapping of the pipeline single-path delay feedback (SDF) fast Fourier transform (FFT) architecture to field-programmable gate arrays (FPGAs) is proposed. By considering the architectural features of the target FPGA, significantly better implementation results are obtained. This is illustrated by mapping an R22SDF 1024-point FFT core toward both Xilinx Virtex-4 and Virtex-6 devices. The optimized FPGA mapping is explored in detail. Algorithmic transformations that allow a better mapping are proposed, resulting in implementation achievements that by far outperforms earlier published work. For Virtex-4, the results show a 350% increase in throughput per slice and 25% reduction in block RAM (BRAM) use, with the same amount of DSP48 resources, compared with the best earlier published result. The resulting Virtex-6 design sees even larger increases in throughput per slice compared with Xilinx FFT IP core, using half as many DSP48E1 blocks and less BRAM resources. The results clearly show that the FPGA mapping is crucial, not only the architecture and algorithm choices.
A flexible wildcard pattern matching accelerator via simultaneous discrete fi...Nxfee Innovation
Regular expression matching becomes indispensable elements of Internet of Things network security. However, traditional ternary content addressable memory (TCAM) search engine is unable to handle patterns with wildcards, as it precisely tracks only one active state with single transition. This paper proposes a promising simultaneous pattern matching methodology for wildcard patterns by two separated engines to represent discrete finite automata. A key preprocessing to encode possible postfix pattern by a unique key ensures that follow-up patterns can accurately traverse all possible matches with limited hardware resources. This approach is practical and scalable for achieving good performance and low space consumption in network security, and it can be applicable to any regular expressions even with multi wildcard patterns. The experimental results demonstrate that this scheme can efficiently and accurately recognize wildcard patterns by simultaneously tracking only two active states. By adopting SRAM TCAM in the proposed architecture, the energy consumption is reduced to around 39%, compared with the energy consumption using a computing system that contains a large memory lookup and comparison overhead.
NXFEE Innovation is the Industry of Semiconductor IP Development, IP Designs, and services of developing solution to provide core products and application to customers with a wide range of solution that include custom ASIC/ FPGA/ DSP/ EMBEDDED System/ Wireless Technologies. Having lustrum of expertise and satisfied customers, NXFEE have the capability to deliver solution that is fully meshed with customer’s business requirement, meeting the highest standards.
NXFEE will Provide cost effective outsourcing services for secure and turn key product development in the areas of Bio-Medical/ Wireless/ Robotics/ VLSI/ DSP/ Embedded design & Development from conceptualization to production. Our sound technology and knowledge base have helped us to create products using emerging technology that include FPGA, VHDL, VERILOG HDL, SYSTEM VERILOG HDL, UVM, OVM, VVM, DSP, RTOS, DSP, Bluetooth, WI-FI, RF, CDMA, AXI, AHP, APB, and other related technologies in the area of industrial automation, telecommunications, consumer electronics and automotive applications.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
The Internet of Things (IoT) is a revolutionary concept that connects everyday objects and devices to the internet, enabling them to communicate, collect, and exchange data. Imagine a world where your refrigerator notifies you when you’re running low on groceries, or streetlights adjust their brightness based on traffic patterns – that’s the power of IoT. In essence, IoT transforms ordinary objects into smart, interconnected devices, creating a network of endless possibilities.
Here is a blog on the role of electrical and electronics engineers in IOT. Let's dig in!!!!
For more such content visit: https://nttftrg.com/
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Feedback based low-power soft-error-tolerant design for dual-modular redundancy
1. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Feedback-Based Low-Power Soft-Error-Tolerant Design for Dual-
Modular Redundancy
Abstract:
Triple-modular redundancy (TMR), which consists of three identical modules and a
voting circuit, is a common architecture for soft-error tolerance. However, the original
TMR suffers from two major drawbacks: the large area overhead and the vulnerability of
the voter. In order to overcome these drawbacks, we propose a new complementary dual-
modular redundancy (CDMR) scheme for mitigating the effect of soft errors. Inspired by
the Markov random field (MRF) theory, a two-stage voting system is implemented in
CDMR, including a first stage optimal MRF structure and a second-stage high-
performance merging unit. The CDMR scheme can reduce the voting circuit area by 20%
while saving the area of one redundant module, achieving at least 26% error-rate
reduction at an ultralow supply voltage of 0.25 V with 8.33% faster timing compared to
previous voter designs.
Software Implementation:
Modelsim
Xilinx 14.2
Existing System:
Triple-modular redundancy (TMR) was first proposed by Von Neumann et al, and has
since been adopted as a technique to improve error tolerance at the cost of increased
circuit area. TMR can only tolerate soft errors when the probability of three or two
modules failing simultaneously is much lower than that of a single module. However, one
obvious drawback is the increased area overhead. Therefore, partial TMR (PTMR) was
proposed to reduce the area overhead by trading off reliability. The dual-modular
2. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
redundancy (DMR) scheme presented and uses a three-module structure with self-
feedback. Robust C-elements and multiplexers are used, respectively, to form voters in
two different DMR designs. An algorithmic noise-tolerant (ANT) technique was
proposed to solve the problem of soft errors caused by voltage over scaling. Algorithmic
soft-error tolerance (ASET) and fine-grain soft-error tolerance (FGSET) designs are both
extended ANT designs. The designs suffer from two drawbacks. First, they still consume
large area overhead. Second, reliability loss is incurred by soft errors in the voting design.
The reason is that redundancies and estimator-based redundancies work well only when
voters never fail, which might be an unrealistic assumption if the circuits are designed
using a deep sub microtechnology or an ultralow supply voltage is used. Under such
conditions, it is likely that such a failure could occur in the voting circuit, which is a main
cause of TMR failure. For a multistage design, three identical voters could be used in
each stage to tolerate errors that occur in one of the TMR voters, but this would add
undesirable overhead to the design. Some approaches, such as generalized modular
redundancy, approximate TMR, and a simulation-based synthesis scheme, improve the
original TMR, but they only offer either an optimal implementation strategy or tradeoff
accuracy.
Fig. 1. CDMR design
3. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
A number of error-tolerant methods, such as Markov random field (MRF), differential
cascode voltage switch (DCVS), and DCVS-MRF, have been proposed. In these designs,
the basic elements include feedback loops that help them to achieve high soft-error
tolerance. However, these implementations require higher area overhead than traditional
structures. To solve soft-error issues in the voter and save area overhead, we propose a
new complementary DMR (CDMR) scheme, as shown in Fig. 1. The CDMR scheme
ensures the significance of soft-error tolerance even for the voting circuit. This is
achieved by separately processing one module (M1) through a structure with a stable
logic “1” as output (referred to as structure A in Fig. 1), and processing another identical
module (M2) through a structure with a stable logic “0” as output (shown in Fig. 1 as
structure B). A second-stage feedback structure is then used to merge the stable logic “1”
and stable logic “0” outputs from the first stage, ensuring the best performance from the
first stage (shown in Fig. 1 as structure C). The CDMR scheme outperforms existing
designs in two key aspects by: 1) tolerating many soft errors propagated to the voting
circuit and 2) saving the area overhead.
Disadvantages:
Larger area overheads are present
Soft errors are not reduced
Proposed System:
MRF-Inspired two- stage feedback design
Fig. 2 can complement the loss of the error tolerance in g2 for the first stage using its
latching property. The proposed structure benefits from the presence of stage 2 to
improve its reliability, which is a feature that TMR, DMR, or other designs lack. Let us
extend the single-error assumption for stage 1 by assuming that only one error can
emerge from one of the complementary propagation chains at the same time. In other
4. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
words, when an error occurs from stage 1, the latch structure of g3–g4 in stage 2 does not
propagate errors received from stage 1. With respect to our proposed CDMR, the two
redundant inputs to the voter must be complementary
Fig. 2. Proposed two-stage dual feedback structure
Table I
Values of g3–g4 Feedback
5. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
and will propagate through stages 1 and 2 as complementary signals in the absence of
errors. For example, an ideal input bit stream for xa(xa = xb) is {x0 ∼ x4 = 0 and x5 ∼ x9 =
1}. Four bits, x7 and x9 of xd and x 1 and x2 of x are flipped by noise, as circled by a small
circle in Fig. 2. Their corresponding bits in the other branch are robust “1” because of the
high tolerance of noisy input bit “0” in both NAND gates g1 and g2. This is why we only
consider the cases where errors occur in weak “0” in xd or xe. This condition causes the
second stage g3–g4 to remain in the hold state in Table I acting as an RS latch, thus
protecting the final output results from the influence of the error bits in xd and xe based on
the previous correct outputs. We adopted the widely used double-exponential current
source to simulate the above cases where a charged or ionizing particle hits the output
“0” of stage 1 circuit.
where Qtotal is the total charge caused by the particle strike, and τr and τf are the rising
time constant and the falling time constant, respectively. As τ rand τ f are generally set to
50 and 164 ps for different process technologies, we used the current source Qtotal = 70 f c
in our simulation. Regardless of whether x a and xb are both high or low, when a charged
particle attacks x d or xe, there is one single peak shown in Fig. 2 in output x f . Compared
with a much longer pulse at the output of a TMR voter when an error hits on one of its
6. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
inner branches, it can be regarded to be less harmless in the proposed voter after
sampling, as the error is too short to be sampled multiple times. The results in Fig.
2confirm the same error tolerance as what we deduced from the proposed structure in
Fig. 2. In the extended one error condition, the output of our module can achieve correct
operation as long as the two inner complementary signals are not in error at the same
time.
Fig. 3. Simulation of the intermediate propagation injected by a soft error.
7. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 4. Voting structure in multistage design. (a) TMR [1]. (b) FGSET [8]. (c) DMR [3]. (d) Proposed
voting module.
we see that the proposed design has better soft-error tolerance. Therefore, the proposed
voting circuit has both higher modular soft terror tolerance and reliability than those of
TMR. For multistage logic, the voter is concatenated in each stage to improve the overall
system reliability, as shown in Fig. 4(a)–(d). The original TMR, FGSET, and DMR
voters for multistage are simply duplicated [refer to Fig. 4(a)–(c)]. However, the
proposed voter has enclosed feedback loops and two outputs without voting duplication
between two stages, as shown in Fig. 4(d). Note that this design has two complementary
outputs as references for error correction. Overall, the area overhead is reduced by at least
50% compared to the designs used in TMR and DMR. We consider a 4-bit ripple-carry
adder (RCA) as a case study for the proposed voter in Fig. 4. The input to the proposed
design requires a differential input; thus, we redesigned the full adder (FA) as . We
present two design schemes for adders. Scheme 1 (S1) in Fig. 4 is designed for a single
unit with DMR, in which the outputs of the two modules are connected to a voter.
8. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Scheme 2 (S2) in Fig. 4 is implemented as a multistage design by adding a voter at every
stage
Advantages:
Larger area overheads are reduced
Soft errors are reduced
References:
[1] J. von Neumann, C. E. Shannon, and J. McCarthy, “Probabilistic logics and the synthesis of reliable
organisms from unreliable components,” in Automata Studies (Annals of Mathematics Studies).
Princeton, NJ, USA: Princeton Univ. Press, 1956, pp. 43–98.
[2] R. Parhi, C. H. Kim, and K. K. Parhi, “Fault-tolerant ripple-carry binary adder using partial triple
modular redundancy (PTMR),” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Lisbon, Portugal, May
2015, pp. 41–44.
[3] J. Teifel, “Self-voting dual-modular-redundancy circuits for single event-transient mitigation,” IEEE
Trans. Nucl. Sci., vol. 55, no. 6, pp. 3435–3439, Dec. 2008.
[4] I.-C. Wey, B.-C. Wu, C.-C. Peng, C.-S. A. Gong, and C.-H. Yu, “Robust C-element design for soft-
error mitigation,” IEICE Elect. Exp., vol. 12, no. 10, pp. 1–6, 2015.
[5] F. Smith, “A new methodology for single event transient suppression in flash FPGAs,”
Microprocess. Microsyst., vol. 37, no. 3, pp. 313–318, May 2013.
[6] I.-C. Wey, C.-C. Peng, and F.-Y. Liao, “Reliable low-power multiplier design using fixed-width
replica redundancy block,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 1, pp. 78–
87, Jan. 2015.
[7] B. Shim and N. R. Shanbhag, “Energy-efficient soft error-tolerant digital signal processing,” IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 4, pp. 336–348, Apr. 2006.
9. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
[8] Y.-H. Huang, “High-efficiency soft-error-tolerant digital signal processing using fine-grain subword-
detection processing,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 2, pp. 291–304,
Feb. 2010.
[9] H. Kim and K. G. Shin, “Design and analysis of an optimal instructionretry policy for TMR
controller computers,” IEEE Trans. Comput., vol. 45, no. 11, pp. 1217–1225, Nov. 1996.
[10] A. H. El-Maleh and F. C. Oughali, “A generalized modular redundancy scheme for enhancing fault
tolerance of combinational circuits,” Microelectron. Rel., vol. 54, no. 1, pp. 316–326, 2014.
[11] A. J. Sanchez-Clemente, L. Entrena, R. Hrbacek, and L. Sekanina, “Error mitigation using
approximate logic circuits: A comparison of probabilistic and evolutionary approaches,” IEEE Trans.
Rel., vol. 65, no. 4, pp. 1871–1883, Dec. 2016.
[12] A. H. El-Maleh and K. A. K. Daud, “Simulation-based method for synthesizing soft error tolerant
combinational circuits,” IEEE Trans. Rel., vol. 64, no. 3, pp. 935–948, Sep. 2015.