In this paper, an efficient mapping of the pipeline single-path delay feedback (SDF) fast Fourier transform (FFT) architecture to field-programmable gate arrays (FPGAs) is proposed. By considering the architectural features of the target FPGA, significantly better implementation results are obtained. This is illustrated by mapping an R22SDF 1024-point FFT core toward both Xilinx Virtex-4 and Virtex-6 devices. The optimized FPGA mapping is explored in detail. Algorithmic transformations that allow a better mapping are proposed, resulting in implementation achievements that by far outperforms earlier published work. For Virtex-4, the results show a 350% increase in throughput per slice and 25% reduction in block RAM (BRAM) use, with the same amount of DSP48 resources, compared with the best earlier published result. The resulting Virtex-6 design sees even larger increases in throughput per slice compared with Xilinx FFT IP core, using half as many DSP48E1 blocks and less BRAM resources. The results clearly show that the FPGA mapping is crucial, not only the architecture and algorithm choices.
Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...IOSRJVSP
In Cooley–Tukey algorithm the Radix-2 decimation-in-time Fast Fourier Transform is the easiest form. The Fast Fourier Transform is the mostly used in digital signal processing algorithms. Discrete Fourier Transform (DFT) is computing by the FFT. DFT is used to convert a time domain signal into its frequency spectrum domain. FFT algorithms uses many applications for example, OFDM, Noise reduction, Digital audio broadcasting, Digital video broadcasting. It’s used to design butterflies for different point FFT. In this paper given to design and power measurement 2 and 8 point FFT by using VHDL. Simulation and synthesis of design is done using Xilinx ISE 14.2
Selective fitting strategy based real time placement algorithm for dynamicall...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
FPGA Implementation of Higher Order FIR Filter IJECEIAES
The digital Finite-Impulse-Response (FIR) filters are mainly employed in digital signal processing applications. The main components of digital FIR filters designed on FPGAs are the register bank to save the samples of signals, adder to implement sum operations and multiplier for multiplication of filter coefficients to signal samples. Although, design and implementation of digital FIR filters seem simple but the design bottleneck is multiplier block for speed, power consumption and FPGA chip area occupation. The multipliers are an integral part in FIR structures and these use a large part of the chip area. This limits the number of processing elements (PE) available on the chip to realize a higher order of filter. A model is developed in the Matlab/Simulink environment to investigate the performance of the desired higher order FIR filter. An equivalent FIR filter representation is designed by the Xilinx FIR Compiler by using the exported FIR filter coefficients. The Xilinx implementation flow is completed with the help of Xilinx ISE 14.5. It is observed how the use of higher order FIR filter impacts the resource utilization of the FPGA and it’s the maximum operating frequency.
Design and Power Measurement of 2 And 8 Point FFT Using Radix-2 Algorithm for...IOSRJVSP
In Cooley–Tukey algorithm the Radix-2 decimation-in-time Fast Fourier Transform is the easiest form. The Fast Fourier Transform is the mostly used in digital signal processing algorithms. Discrete Fourier Transform (DFT) is computing by the FFT. DFT is used to convert a time domain signal into its frequency spectrum domain. FFT algorithms uses many applications for example, OFDM, Noise reduction, Digital audio broadcasting, Digital video broadcasting. It’s used to design butterflies for different point FFT. In this paper given to design and power measurement 2 and 8 point FFT by using VHDL. Simulation and synthesis of design is done using Xilinx ISE 14.2
Selective fitting strategy based real time placement algorithm for dynamicall...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
FPGA Implementation of Higher Order FIR Filter IJECEIAES
The digital Finite-Impulse-Response (FIR) filters are mainly employed in digital signal processing applications. The main components of digital FIR filters designed on FPGAs are the register bank to save the samples of signals, adder to implement sum operations and multiplier for multiplication of filter coefficients to signal samples. Although, design and implementation of digital FIR filters seem simple but the design bottleneck is multiplier block for speed, power consumption and FPGA chip area occupation. The multipliers are an integral part in FIR structures and these use a large part of the chip area. This limits the number of processing elements (PE) available on the chip to realize a higher order of filter. A model is developed in the Matlab/Simulink environment to investigate the performance of the desired higher order FIR filter. An equivalent FIR filter representation is designed by the Xilinx FIR Compiler by using the exported FIR filter coefficients. The Xilinx implementation flow is completed with the help of Xilinx ISE 14.5. It is observed how the use of higher order FIR filter impacts the resource utilization of the FPGA and it’s the maximum operating frequency.
Design and fpga implementation of a reconfigurable digital down converter for...Nxfee Innovation
This brief presents a field-programmable gate array-based implementation of a reconfigurable digital down converter (DDC) that can process input bandwidth of up to 3.6 GHz and provide a flexible down-converted output. The proposed DDC consists of a mixer and a resampling filter. The resampling filter can work at much higher clock rate. The reason is that all the single-cycle recursive loops in the re sampling filter are pipelined by using either real/imaginary part-time multiplexing or parallel processing technique. With features like arbitrary sampling rate conversion, and dynamic configuration, the proposed design is highly flexible, so that it can generate a down-converted output with sampling rate, selectable within the range of 1 kS/s–225 MS/s. Moreover, the flexibility is further improved by being able to specify the output sampling rate and center frequency to a resolution of less than 1 S/s. The experimental results show that the proposed design can achieve the same functionality as the existing work but with fewer hardware resources.
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...Nxfee Innovation
A configurable-bandwidth (BW) filter is presented in this paper for pulsed radar applications. To eliminate dispersion effects in the received waveform, a finite impulse response (FIR) topology is proposed, which has a measured standard deviation of an in-band group delay of 11 ns that is primarily dominated by the inherent, fully predictable delay introduced by the sample-and-hold. The filter operates at an IF of 20 MHz, and is tunable in BW from 1.5 to 15 MHz, which makes it optimal to be used with varying pulse widths in the radar. Employing a total of 128 taps, the FIR filter provides greater than 50-dB sharp attenuation in the stop band in order to minimize all out-of-band noise in the low signal-to-noise received radar signal. Fabricated in a 0.18-µm silicon on insulator CMOS process, the proposed filter consumes approximately 3.5mW/tap with a 1.8-V supply. A 20-MHz two-tone measurement with 200-kHz tone separation shows IIP3 greater than 8.5dBm.
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...Hari M
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTIPLIER:
Reduce the maximum height of the partial product columns to [n/4] for n = 64-bit unsigned
operand. This is in contrast to the conventional maximum height of [(n + 1)/4].
The multiplier algorithm is normally used for higher bit length applications and ordinary multiplier is good for lower order bits.
Programmable logic controller performance enhancement by field programmable g...ISA Interchange
PLC, the core element of modern automation systems, due to serial execution, exhibits limitations like slow speed and poor scan time. Improved PLC design using FPGA has been proposed based on parallel execution mechanism for enhancement of performance and flexibility. Modelsim as simulation platform and VHDL used to translate, integrate and implement the logic circuit in FPGA. Xilinx’s Spartan kit for implementation-testing and VB has been used for GUI development. Salient merits of the design include cost-effectiveness, miniaturization, user-friendliness, simplicity, along with lower power consumption, smaller scan time and higher speed. Various functionalities and applications like typical PLC and industrial alarm annunciator have been developed and successfully tested. Results of simulation, design and implementation have been reported.
A high accuracy programmable pulse generator with a 10-ps timing resolutionNxfee Innovation
Automatic test equipment must have high-precision and low-power pulse generators (PGs) for testing memory and device-under-test ICs. This paper describes a high-accuracy and wide-data-rate-range PG with a 10-ps time resolution. The PG comprises an edge combiner (EC) and a multiphase clock generator (MPCG). The EC can produce an arbitrary waveform through 32 phase outputs of the MPCG. The EC adopts a one/zero detector and phase selection logic to define an operational data rate range and a timing resolution, respectively. Therefore, the EC uses the phase selection logic to combine the period window of the one/zero detector with the MPCG output phases. The EC also uses a countdown counter for a wide operational range. In the MPCG, a multiphase oscillator (MPO) adopts a ring oscillator scheme with sub feedback loops to extend its maximum operational frequency. The MPO also uses a phase error corrector to reduce the output phase error resulting from process and layout mismatches. Thus, the PG can obtain high accuracy waveforms owing to small phase errors. The test chip was implemented using a 0.13-µm CMOS process. The core area and power consumption of the PG were measured to be 250 × 300 µm2 and 18.7 mW, respectively. The data rate range of the PG was determined to be from 3.2 kHz to 893 MHz. The time resolution and average accuracy of the PG were measured to be 10 ps and ±0.3 LSB, respectively.
Design of efficient reversible floating-point arithmetic unit on field progr...IJECEIAES
The reversible logic gates are used to improve the power dissipation in modern computer applications. The floating-point numbers with reversible features are added advantage to performing complex algorithms with highperformance computations. This manuscript implements an efficient reversible floating-point arithmetic (RFPA) unit, and its performance metrics are realized in detail. The RFP adder/subtractor (A/S), RFP multiplier, and RFP divider units are designed as a part of the RFP arithmetic unit. The RFPA unit is designed by considering basic reversible gates. The mantissa part of the RFP multiplier is created using a 24x24 Wallace tree multiplier. In contrast, the reciprocal unit of the RFP divider is designed using Newton Raphson’s method. The RFPA unit and its submodules are executed in parallel by utilizing one clock cycle individually. The RFPA unit and its submodules are synthesized separately on the Vivado IDE environment and obtained the implementation results on Artix-7 field programmable gate array (FPGA). The RFPA unit utilizes only 18.44% slice look-up tables (LUTs) by consuming the 0.891 W total power on Artix-7 FPGA. The RFPA unit sub-models are compared with existing approaches with better performance metrics and chip resource utilization improvements.
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...Nxfee Innovation
In this brief, we propose a supply noise-insensitive charge pump phase-locked loop (PLL) using a source-follower (SF) regulator and noise cancellation. In order to minimize the voltage drop of the SF regulator while improving supply rejection, a gate-voltage-boosting technique and the body-controlled noise cancellation are proposed. To suppress the phase noise from the ring oscillator, a reference multiplier is employed to maximize the PLL loop bandwidth. Implemented in 65-nm CMOS, a prototype PLL at 3.2 GHz achieves supply noise spur of less than −33 dBc for a 50-mVpp supply noise around the loop bandwidth while consuming 3.12 mW from a 1-V supply.
More Related Content
Similar to Efficient fpga mapping of pipeline sdf fft cores
Design and fpga implementation of a reconfigurable digital down converter for...Nxfee Innovation
This brief presents a field-programmable gate array-based implementation of a reconfigurable digital down converter (DDC) that can process input bandwidth of up to 3.6 GHz and provide a flexible down-converted output. The proposed DDC consists of a mixer and a resampling filter. The resampling filter can work at much higher clock rate. The reason is that all the single-cycle recursive loops in the re sampling filter are pipelined by using either real/imaginary part-time multiplexing or parallel processing technique. With features like arbitrary sampling rate conversion, and dynamic configuration, the proposed design is highly flexible, so that it can generate a down-converted output with sampling rate, selectable within the range of 1 kS/s–225 MS/s. Moreover, the flexibility is further improved by being able to specify the output sampling rate and center frequency to a resolution of less than 1 S/s. The experimental results show that the proposed design can achieve the same functionality as the existing work but with fewer hardware resources.
The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.
A 128 tap highly tunable cmos if finite impulse response filter for pulsed ra...Nxfee Innovation
A configurable-bandwidth (BW) filter is presented in this paper for pulsed radar applications. To eliminate dispersion effects in the received waveform, a finite impulse response (FIR) topology is proposed, which has a measured standard deviation of an in-band group delay of 11 ns that is primarily dominated by the inherent, fully predictable delay introduced by the sample-and-hold. The filter operates at an IF of 20 MHz, and is tunable in BW from 1.5 to 15 MHz, which makes it optimal to be used with varying pulse widths in the radar. Employing a total of 128 taps, the FIR filter provides greater than 50-dB sharp attenuation in the stop band in order to minimize all out-of-band noise in the low signal-to-noise received radar signal. Fabricated in a 0.18-µm silicon on insulator CMOS process, the proposed filter consumes approximately 3.5mW/tap with a 1.8-V supply. A 20-MHz two-tone measurement with 200-kHz tone separation shows IIP3 greater than 8.5dBm.
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTI...Hari M
PARTIAL PRODUCT ARRAY HEIGHT REDUCTION USING RADIX-16 FOR 64-BIT BOOTH MULTIPLIER:
Reduce the maximum height of the partial product columns to [n/4] for n = 64-bit unsigned
operand. This is in contrast to the conventional maximum height of [(n + 1)/4].
The multiplier algorithm is normally used for higher bit length applications and ordinary multiplier is good for lower order bits.
Programmable logic controller performance enhancement by field programmable g...ISA Interchange
PLC, the core element of modern automation systems, due to serial execution, exhibits limitations like slow speed and poor scan time. Improved PLC design using FPGA has been proposed based on parallel execution mechanism for enhancement of performance and flexibility. Modelsim as simulation platform and VHDL used to translate, integrate and implement the logic circuit in FPGA. Xilinx’s Spartan kit for implementation-testing and VB has been used for GUI development. Salient merits of the design include cost-effectiveness, miniaturization, user-friendliness, simplicity, along with lower power consumption, smaller scan time and higher speed. Various functionalities and applications like typical PLC and industrial alarm annunciator have been developed and successfully tested. Results of simulation, design and implementation have been reported.
A high accuracy programmable pulse generator with a 10-ps timing resolutionNxfee Innovation
Automatic test equipment must have high-precision and low-power pulse generators (PGs) for testing memory and device-under-test ICs. This paper describes a high-accuracy and wide-data-rate-range PG with a 10-ps time resolution. The PG comprises an edge combiner (EC) and a multiphase clock generator (MPCG). The EC can produce an arbitrary waveform through 32 phase outputs of the MPCG. The EC adopts a one/zero detector and phase selection logic to define an operational data rate range and a timing resolution, respectively. Therefore, the EC uses the phase selection logic to combine the period window of the one/zero detector with the MPCG output phases. The EC also uses a countdown counter for a wide operational range. In the MPCG, a multiphase oscillator (MPO) adopts a ring oscillator scheme with sub feedback loops to extend its maximum operational frequency. The MPO also uses a phase error corrector to reduce the output phase error resulting from process and layout mismatches. Thus, the PG can obtain high accuracy waveforms owing to small phase errors. The test chip was implemented using a 0.13-µm CMOS process. The core area and power consumption of the PG were measured to be 250 × 300 µm2 and 18.7 mW, respectively. The data rate range of the PG was determined to be from 3.2 kHz to 893 MHz. The time resolution and average accuracy of the PG were measured to be 10 ps and ±0.3 LSB, respectively.
Design of efficient reversible floating-point arithmetic unit on field progr...IJECEIAES
The reversible logic gates are used to improve the power dissipation in modern computer applications. The floating-point numbers with reversible features are added advantage to performing complex algorithms with highperformance computations. This manuscript implements an efficient reversible floating-point arithmetic (RFPA) unit, and its performance metrics are realized in detail. The RFP adder/subtractor (A/S), RFP multiplier, and RFP divider units are designed as a part of the RFP arithmetic unit. The RFPA unit is designed by considering basic reversible gates. The mantissa part of the RFP multiplier is created using a 24x24 Wallace tree multiplier. In contrast, the reciprocal unit of the RFP divider is designed using Newton Raphson’s method. The RFPA unit and its submodules are executed in parallel by utilizing one clock cycle individually. The RFPA unit and its submodules are synthesized separately on the Vivado IDE environment and obtained the implementation results on Artix-7 field programmable gate array (FPGA). The RFPA unit utilizes only 18.44% slice look-up tables (LUTs) by consuming the 0.891 W total power on Artix-7 FPGA. The RFPA unit sub-models are compared with existing approaches with better performance metrics and chip resource utilization improvements.
Noise insensitive pll using a gate-voltage-boosted source-follower regulator ...Nxfee Innovation
In this brief, we propose a supply noise-insensitive charge pump phase-locked loop (PLL) using a source-follower (SF) regulator and noise cancellation. In order to minimize the voltage drop of the SF regulator while improving supply rejection, a gate-voltage-boosting technique and the body-controlled noise cancellation are proposed. To suppress the phase noise from the ring oscillator, a reference multiplier is employed to maximize the PLL loop bandwidth. Implemented in 65-nm CMOS, a prototype PLL at 3.2 GHz achieves supply noise spur of less than −33 dBc for a 50-mVpp supply noise around the loop bandwidth while consuming 3.12 mW from a 1-V supply.
An efficient fault tolerance design for integer parallel matrix vectorNxfee Innovation
Parallel matrix processing is a typical operation in many systems, and in particular matrix–vector multiplication (MVM) is one of the most common operations in the modern digital signal processing and digital communication systems. This paper proposes a fault tolerant design for integer parallel MVMs. The scheme combines ideas from error correction codes with the self-checking capability of MVM. Field-programmable gate array evaluation shows that the proposed scheme can significantly reduce the overheads compared to the protection of each MVM on its own. Therefore, the proposed technique can be used to reduce the cost of providing fault tolerance in practical implementations.
The implementation of the improved omp for aic reconstruction based on parall...Nxfee Innovation
Sparse signal recovery becomes extremely challenging for a variety of real-time applications. In this paper, we improve the orthogonal matching pursuit (OMP) algorithm based on parallel correlation indices selection mechanism in each iteration and Goldschmidt algorithm. Simulation results show that the improved OMP algorithm with a reduced number of iterations and low hardware complexity of matrix operations has higher success rate and recovery signal-to-noise-ratio (RSNR) for sparse signal recovery. This paper presents an efficient complex valued system hardware architecture of the recovery algorithm for analog-to-information structure based on compressive sensing. The proposed architecture is implemented and validated on the Xilinx Virtex6 field-programmable gate array (FPGA) for signal reconstruction with N = 1024, K = 36, and M = 256. The implementation results showed that the improved OMP algorithm achieved a higher RSNR of 31.04 dB compared with the original OMP algorithm. This synthesized design consumes a few percentages of the hardware resources of the FPGA chip with the clock frequency of 135.4 MHZ and reconstruction time of 170 µs, which is faster than the existing design.
Securing the present block cipher against combined side channel analysis and ...Nxfee Innovation
In this paper, we present and evaluate a hardware implementation of the PRESENT block cipher secured against both side-channel analysis and fault attacks (FAs). The side-channel security is provided by the first-order threshold implementation masking scheme of the serialized PRESENT proposed by Poschmann et al. For the FA resistance, we employ the Private Circuits II countermeasure presented by Ishai et al. at Eurocrypt 2006, which we tailor to resist arbitrary 1-bit faults. We perform a side-channel evaluation using the state-of-the-art leakage detection tests, quantify the resource overhead of the Private Circuits II countermeasure, subdue the implementation to established differential FAs against the PRESENT block cipher, and contemplate on the structural resistance of the countermeasure. This paper provides the detailed instructions on how to successfully achieve a secure Private Circuits II implementation for the data path as well as the control logic.
Multilevel half rate phase detector for clock and data recovery circuitsNxfee Innovation
In this brief, a half-rate (HR) bang-bang (BB) phase detector (PD) with multiple decision levels is proposed for clock and data recovery (CDR) circuits. The combination allows the oscillator to run at half the input data rate while providing information about the sign and magnitude of the phase shift between the PD inputs. This allows a finer control of the frequency of the oscillator in the phase-locked loop (PLL) of the CDR circuit, which results in up to 30% less output clock jitter than with a conventional two-levels HR BB PD. Thanks to this, the bit error rate can be decreased by up to 5× in a 5-Gb/s CDR circuit. The proposed topology was implemented in a 28-nm FDSOI CMOS technology providing average power consumption below 76 µW with a supply voltage of 1 V. Although multilevel (ML) BB PDs have already been proposed in some PLL-based CDR with very interesting results, a specific design of the PD has to be implemented for an HR system. This brief provides the first ML-HR-BBPD.
Low complexity methodology for complex square-root computationNxfee Innovation
In this brief, we propose a low-complexity methodology to compute a complex square root using only a circular coordinate rotation digital computer (CORDIC) as opposed to the state-of-the-art techniques that need both circular as well as hyperbolic CORDICs. Subsequently, an architecture has been designed based on the proposed methodology and implemented on the ASIC platform using the UMC 180-nm Technology node with 1.0 V at 5 MHz. Field programmable gate array (FPGA) prototyping using Xilinx’ Virtex-6 (XC6v1x240t) has also been carried out. After thorough theoretical analysis and experimental validations, it can be inferred that the proposed methodology reduces 21.15% slice look up tables (on FPGA platform) and saves 20.25% silicon area overhead and decreases 19% power consumption (on ASIC platform) when compared with the state-of-the-art method without compromising the computational speed, throughput, and accuracy.
Feedback based low-power soft-error-tolerant design for dual-modular redundancyNxfee Innovation
Triple-modular redundancy (TMR), which consists of three identical modules and a voting circuit, is a common architecture for soft-error tolerance. However, the original TMR suffers from two major drawbacks: the large area overhead and the vulnerability of the voter. In order to overcome these drawbacks, we propose a new complementary dual-modular redundancy (CDMR) scheme for mitigating the effect of soft errors. Inspired by the Markov random field (MRF) theory, a two-stage voting system is implemented in CDMR, including a first stage optimal MRF structure and a second-stage high-performance merging unit. The CDMR scheme can reduce the voting circuit area by 20% while saving the area of one redundant module, achieving at least 26% error-rate reduction at an ultralow supply voltage of 0.25 V with 8.33% faster timing compared to previous voter designs.
Fast neural network training on fpga using quasi newton optimization methodNxfee Innovation
In this brief, a customized and pipelined hardware implementation of the quasi-Newton (QN) method on field-programmable gate array (FPGA) is proposed for fast artificial neural networks onsite training, targeting at the embedded applications. The architecture is scalable to cope with different neural network sizes while it supports batch-mode training. Experimental results demonstrate the superior performance and power efficiency of the proposed implementation over CPU, graphics processing unit, and FPGA QN implementations.
Design of an area efficient million-bit integer multiplier using double modul...Nxfee Innovation
This brief proposes a double modulus number theoretical transform (NTT) method for million-bit integer multiplication in fully homomorphic encryption. In our method, each NTT point is processed simultaneously under two moduli, and the final result is generated through the Chinese reminder theorem. The employment of double modulus enlarges the permitted NTT sample size from 24 to 32 bits and thus improves the transform efficiency. Based on the proposed double modulus method, we accomplish a VLSI design of million-bit integer multiplier with the Schönhage–Strassen algorithm. Implementation results on Altera Stratix-V FPGA show that this brief is able to compute a product of two 1024k-bit integers every 4.9 ms at the cost of only 7.9k ALUTs and 3.6k registers, which is more area-efficient when compared with the current competitors.
Combating data leakage trojans in commercial and asic applications with time ...Nxfee Innovation
Globalization of microchip fabrication opens the possibility for an attacker to insert hardware Trojans into a chip during the manufacturing process. While most defensive methods focus on detection or prevention, a recent method, called Randomized Encoding of Combinational Logic for Resistance to Data Leakage (RECORD), uses data randomization to prevent hardware Trojans from leaking meaningful information even when the entire design is known to the attacker. Both RECORD and its sequential variant require significant area and power overhead. In this paper, a Time-Division Multiplexed version of the RECORD design process is proposed which reduces area overhead by 63% and power by 56%. This time-division multiplexing (TDM) concept is further refined to allow commercial off the shelf (COTS) products and IP cores to be safely operated from a separate chip. These new methods tradeoff latency (5.3× for TDM and 3.9× for COTS) and energy use to accomplish area and power savings and achieve greater security than the original RECORD process.
Approximate sum of-products designs based on distributed arithmeticNxfee Innovation
Approximate circuits provide high performance and require low power. Sum-of-products (SOP) units are key elements in many digital signal processing applications. In this brief, three approximate SOP (ASOP) models which are based on the distributed arithmetic are proposed. They are designed for different levels of accuracy. First model of ASOP achieves an improvement up to 64% on area and 70% on power, when compared with conventional unit. Other two models provide an improvement of 32% and 48% on area and 54% and 58% on power, respectively, with a reduced error rate compared with the first model. Third model achieves the mean relative error and normalized error distance as low as 0.05% and 0.009%, respectively. Performance of approximate units is evaluated with a noisy image smoothing application, where the proposed models are capable of achieving higher peak signal to-noise ratio than the existing state-of-the-art techniques. It is shown that the proposed approximate models achieve higher processing accuracy than existing works but with significant improvements in power and performance.
Approximate hybrid high radix encoding for energy efficient inexact multipliersNxfee Innovation
Approximate computing forms a design alternative that exploits the intrinsic error resilience of various applications and produces energy-efficient circuits with small accuracy loss. In this paper, we propose an approximate hybrid high radix encoding for generating the partial products in signed multiplications that encodes the most significant bits with the accurate radix-4 encoding and the least significant bits with an approximate higher radix encoding. The approximations are performed by rounding the high radix values to their nearest power of two. The proposed technique can be configured to achieve the desired energy–accuracy tradeoffs. Compared with the accurate radix-4 multiplier, the proposed multipliers deliver up to 56% energy and 55% area savings, when operating at the same frequency, while the imposed error is bounded by a Gaussian distribution with near-zero average. Moreover, the proposed multipliers are compared with state-of-the-art inexact multipliers, outperforming them by up to 40% in energy consumption, for similar error values. Finally, we demonstrate the scalability of our technique.
Analysis and design of cost effective, high-throughput ldpc decodersNxfee Innovation
This paper introduces a new approach to cost effective, high-throughput hardware designs for low-density parity-check (LDPC) decoders. The proposed approach, called nonsurjective finite alphabet iterative decoders (NS-FAIDs), exploits the robustness of message-passing LDPC decoders to inaccuracies in the calculation of exchanged messages, and it is shown to provide a unified framework for several designs previously proposed in the literature. NS-FAIDs are optimized by density evolution for regular and irregular LDPC codes, and are shown to provide different tradeoffs between hardware complexity and decoding performance. Two hardware architectures targeting high-throughput applications are also proposed, integrating both Min-Sum (MS) and NS-FAID decoding kernels. ASIC post synthesis implementation results on 65-nm CMOS technology show that NS-FAIDs yield significant improvements in the throughput to area ratio, by up to 58.75% with respect to the MS decoder, with even better or only slightly degraded error correction performance.
An energy efficient programmable many core accelerator for personalized biome...Nxfee Innovation
Wearable personalized health monitoring systems can offer a cost-effective solution for human health care. These systems must constantly monitor patients’ physiological signals and provide highly accurate, and quick processing and delivery of the vast amount of data within a limited power and area footprint. These personalized biomedical applications require sampling and processing multiple streams of physiological signals with a varying number of channels and sampling rates. The processing typically consists of feature extraction, data fusion, and classification stages that require a large number of digital signal processing (DSP) and machine learning (ML) kernels. In response to these requirements, in this paper, a tiny, energy efficient, and domain-specific manycore accelerator referred to as power-efficient nano clusters (PENC) is proposed to map and execute the kernels of these applications. Simulation results show that the PENC is able to reduce energy consumption by up to 80% and 25% for DSP and ML kernels, respectively, when optimally parallelized. In addition, we fully implemented three compute-intensive personalized biomedical applications, namely, multichannel seizure detection, multi physiological stress detection, and standalone tongue drive system (sTDS), to evaluate the proposed manycore performance relative to commodity embedded CPU, graphical processing unit (GPU), and field programmable gate array (FPGA)-based implementations. For these three case studies, the energy consumption and the performance of the proposed PENC manycore, when acting as an accelerator along with an Intel Atom processor as a host, are compared with the existing commercial off-the-shelf general purpose, customizable, and programmable embedded platforms, including Intel Atom, Xilinx Artix-7 FPGA, and NVIDIA TK1 advanced RISC machine -A15 and K1 GPU system on a chip. For these applications, the PENC manycore is able to significantly improve throughput and energy efficiency by up to 1872× and 276×, respectively. For the most computational intensive application of seizure detection, the PENC manycore is able to achieve a throughput of 15.22 giga-operations-per-second (GOPs), which is a 14× improvement in throughput over custom FPGA solution. For stress detection, the PENC achieves a throughput of 21.36 GOPs and an energy efficiency of 4.23 GOP/J, which is 14.87× and 2.28× better over FPGA implementation, respectively. For the sTDS application, the PENC improves a through put by 5.45× and an energy efficiency by 2.37× over FPGA implementation.
Algorithm and vlsi architecture design of proportionate type lms adaptive fil...Nxfee Innovation
Proportionate-type normalized LMS (Pt-NLMS) family of adaptive filtering algorithms for sparse system identification pose significant implementation challenges due to their high computational complexity especially for real-time applications like network echo cancelation. In this paper, we make the first attempt to implement Pt-NLMS algorithms in hardware. Several reformulations are proposed to simplify the original Pt-NLMS algorithms, thereby making them amenable to real time VLSI implementations and the reformulated algorithms referred as delayed µ-law proportionate LMS (DMPLMS) algorithm for white input and delayed wavelet MPLMS (DWMPLMS) for colored input are then implemented in hardware. Simulation studies demonstrate that the performance loss is very small for the proposed reformulations. We implemented the proposed designs considering 16-bit fixed point representation in hardware, and synthesis results show that the DMPLMS architecture with ≈30% increase in hardware over the state-of-the-art conventional delayed LMS architecture achieves 3× improvement in convergence rate for white input and the DWMPLMS architecture with ≈70% increase in hardware achieves 10× improvement in convergence rate for correlated input conditions.
A reconfigurable ldpc decoder optimized applicationsNxfee Innovation
This paper presents a high data-rate low-density parity-check (LDPC) decoder, suitable for the 802.11n/ac (WiFi) standard. The innovative features of the proposed decoder relate to the decoding algorithms and the interconnection between the processing elements. The reduction of the hardware complexity of decoders based on the min-sum (MS) algorithms comes at the cost of performance degradation, especially at high-noise regions. We introduce more accurate approximations of the log sum-product algorithm that also operate well for low signal-to noise ratio values. Telecommunication standards, including WiFi, support more than one quasi-cyclic LDPC codes of different characteristics, such as codeword length and code rate. A proposed design technique derives networks, capable of supporting a variety of codes and efficiently realizing connectivity between a variable number of processing units, with a relatively small hardware overhead over the single-code case. As a demonstration of the proposed technique, we implemented a reconfigurable network based on barrel rotators, suitable for LDPC decoders compatible with WiFi standard. Our approach achieves low complexity and high clock frequency, compared with related prior works. A 90-nm application-specified integrated circuit implementation of the proposed high-parallel WiFi decoder occupies 4.88 mm2 and achieves an information throughput rate of 4.5 G bit/s at a clock frequency of 555 MHz.
A flexible wildcard pattern matching accelerator via simultaneous discrete fi...Nxfee Innovation
Regular expression matching becomes indispensable elements of Internet of Things network security. However, traditional ternary content addressable memory (TCAM) search engine is unable to handle patterns with wildcards, as it precisely tracks only one active state with single transition. This paper proposes a promising simultaneous pattern matching methodology for wildcard patterns by two separated engines to represent discrete finite automata. A key preprocessing to encode possible postfix pattern by a unique key ensures that follow-up patterns can accurately traverse all possible matches with limited hardware resources. This approach is practical and scalable for achieving good performance and low space consumption in network security, and it can be applicable to any regular expressions even with multi wildcard patterns. The experimental results demonstrate that this scheme can efficiently and accurately recognize wildcard patterns by simultaneously tracking only two active states. By adopting SRAM TCAM in the proposed architecture, the energy consumption is reduced to around 39%, compared with the energy consumption using a computing system that contains a large memory lookup and comparison overhead.
A fast and low complexity operator for the computation of the arctangent of a...Nxfee Innovation
The computation of the arctangent of a complex number, i.e., the atan2 function, is frequently needed in hardware systems that could profit from an optimized operator. In this brief, we present a novel method to compute the atan2 function and a hardware architecture for its implementation. The method is based on a first stage that performs a coarse approximation of the atan2 function and a second stage that improves the output accuracy by means of a lookup table. We present results for fixed-point implementations in a field-programmable gate array device, all of them guaranteeing last-bit accuracy, which provide an advantage in latency, speed, and use of resources, when compared with well-established fixed-point options.
A closed form expression for minimum operating voltage of cmos d flip-flopNxfee Innovation
In this paper, a closed-form expression for estimating the minimum operating voltage (VDDmin) of D flip-flops (FFs) is proposed. VDDmin is defined as the minimum supply voltage at which the FFs are functional without errors. The proposed expression indicates that VDDmin of FFs is a linear function of the square root of logarithm of the number of FFs, and its slope depends on the within-die variation of the threshold voltage (VTH) and its intercept depends on the balance between PMOS and NMOS, which is mainly due to the die-to-die VTH variation. The proposed expression of VDDmin is validated by the simulation results as well as the silicon measurements. Finally, we discuss the dependence of VDDmin on the device parameters..
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
6th International Conference on Machine Learning & Applications (CMLA 2024)
Efficient fpga mapping of pipeline sdf fft cores
1. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Efficient FPGA Mapping of Pipeline SDF FFT Cores
Abstract:
In this paper, an efficient mapping of the pipeline single-path delay feedback (SDF) fast
Fourier transform (FFT) architecture to field-programmable gate arrays (FPGAs) is
proposed. By considering the architectural features of the target FPGA, significantly
better implementation results are obtained. This is illustrated by mapping an R22SDF
1024-point FFT core toward both Xilinx Virtex-4 and Virtex-6 devices. The optimized
FPGA mapping is explored in detail. Algorithmic transformations that allow a better
mapping are proposed, resulting in implementation achievements that by far outperforms
earlier published work. For Virtex-4, the results show a 350% increase in throughput per
slice and 25% reduction in block RAM (BRAM) use, with the same amount of DSP48
resources, compared with the best earlier published result. The resulting Virtex-6 design
sees even larger increases in throughput per slice compared with Xilinx FFT IP core,
using half as many DSP48E1 blocks and less BRAM resources. The results clearly show
that the FPGA mapping is crucial, not only the architecture and algorithm choices.
Software Implementation:
Modelsim
Xilinx 14.2
Existing System:
FIELD programmable gate array (FPGA) technology keeps becoming even more mature
and continuously finds applications where applications specific integrated circuits
(ASICs) and application specific standard products (ASSPs) were used earlier.
Advantages of FPGAs include shorter design time and lower nonrecurring expenses
compared with ASICs, and higher performance and lower power consumption compared
2. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
with ASSPs. It is crucial for an efficient FPGA implementation that the hardware
structure of the FPGA family is considered. In this paper, we propose transformations for
a single-stream pipeline fast Fourier transform (FFT) core that not only reduce the
amount of resources, but also reduce the critical path, when mapped to two different
FPGA families. The discrete Fourier transform (DFT) is a commonly used transform in
signal processing applications. It is used in communication schemes that use orthogonal
frequency division multiplexing and in spectral analysis. The FFT is a collection of
algorithms for efficient computation of the DFT. Many different FFT architectures have
been proposed, processing a different number of samples per iteration. For many high-
speed applications, it is useful to process one or a few samples per iteration in a
streaming manner. Suitable architectures for this are often referred to as pipeline FFT
architectures. As the name suggests, they consist of a pipeline of butterfly (BF) stages
that consume either one (single stream ) or several (parallel stream) samples per clock
cycle. In this paper, single-stream pipeline FFT implementation is considered. However,
many of the presented techniques are possible to utilize when mapping parallelstream
FFT architectures to FPGAs. Most of the different FFT architectures can use any of the
many possible algorithms due to the only difference being the coefficients of the twiddle
factor multipliers. The potential benefit of these different algorithms is that some of the
multipliers only need to use few simple coefficients, and, hence, can be simplified or that
smaller coefficient memories are required. These earlier attempts at optimizing the FPGA
implementations of FFT cores are, however, mainly concentrated at the algorithmic
and/or architectural level, or on the mapping between algorithm and architecture.
However, as is evident from this paper, significant improvements can be obtained when
mapping a given FFT architecture, utilizing he architectural features of the FPGA.
Hence, it is not only the FFT architecture that affects the results but also the mapping to
the hardware. We illustrate this by mapping a radix-22 SDF FFT to two contemporary
FPGAs. One reason for selecting the radix-22 SDF FFT is that it is well established and
3. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
often used. Hence, the benefits provided in this paper come from the proposed
techniques. In this paper, we propose transformations to map butterflies to fewer lookup
tables (LUTs), propose transformations efficiently enabling using DSP block pre adders
for implementing BF adders, propose efficient mapping of data and twiddle factor storage
to BRAM and distributed resources, propose efficient sharing of twiddle factor memories
for radix-2k algorithms, and carefully discuss how retiming and pipelining are added to
improve timing. This results in large reductions in FPGA slice resources needed for
implementing the SDF BF processing elements, efficient use of FPGA embedded
multiplier and memory resources, and very high FPGA clock frequencies reaching the
maximal frequencies for hard FPGA components. We choose to target two different
FPGA families. One with four-input LUTs, Virtex-4, and one with six-input LUTs,
Virtex-6. Both FPGA families are from Xilinx. Since those are the sizes of FPGA LUTs
commonly used, it should be easy to generalize our results to other FPGA families given
this choice. Later, seven-series and Ultrascale/Ultrascale+ FPGAs from Xilinx use
virtually the same slice architecture as Virtex-6, so in that case, the results should be very
easy to generalize. Since all FFT architecture stages (as well as many other DSP
algorithms) consist of the same building blocks, i.e., adders, multiplexers, and delay
elements, it should be possible to apply similar transformations to improve mapping
efficiency also in their implementation as well.
Disadvantages:
Mapping efficiency is less
Power consumption is higher
Proposed System:
SDF pipeline FFT architecture
4. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
The structure of an N-point radix-22 SDF pipeline is shown in Fig. 1. As can be seen, it
consists of a pipeline of FFT core
Fig. 1. N-point radix-22 SDF pipeline FFT core
Fig. 2. Structure of (a) SDF BF and (b) trivial multiplier (W4).
5. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
S = log2 N radix-2 SDF BF processing elements, i.e., ordinary radix-2 BF elements with
data management multiplexers at their outputs. The total amount of memory required is
S−1 i=0 2i = N − 1, which is minimal for single-stream FFT architectures. Between every
stage, there is a twiddle factor multiplier. Every second of these is considered as trivial,
since they only involve a conditional multiplication of − j (where j is the imaginary unit)
that can be achieved with swapping the real and imaginary components and negating the
resulting imaginary part.
Trivial multipliers are indicated with × and general multipliers are indicted with in Fig. 1.
The general multipliers have a 75% utilization. The internal structures of the SDF BF and
the trivial multiplier (W4) are shown in Fig. 2. The control path of this architecture is
simple. To generate the control signals, a log2(N)-bit synchronous counter is used. This
counter can also be used to address the twiddle factor memories. The least significant bit
of the counter controls the data management multiplexers of the last stage (as the control
signal S).
Similarly, the other bits in order control the output multiplexers of the other stages. The
trivial multipliers are controlled by the control signals of the next stage (as S) and the
previous stage (as T ). SDF architectures like this, where the length of the feedback shift
registers is halved for each stage, operates on normal input order data, and returns the
output in a bit-reversed order. If an architecture with a bit-reversed input is used, the size
of the feedback shift registers will be in opposite order, as well as different twiddle factor
values and control signals, but all the proposed techniques can be applied
6. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 3. Simplified view of the implementation of a full adder with registered sum in Virtex-4 FPGAs.
To describe the mapping of the FFT architecture to the chosen FPGAs, the constituting
parts of it are first studied in isolation. Then, some optimization opportunities that only
are relevant for FPGAs with preadder equipped hard multipliers are considered for the
Virtex-6 design. After this, pipelining and retiming actions are performed to increase the
maximum clock frequency, f max, of the Virtex-4 and Virtex-6 designs, respectively.
Complex Multipliers
Virtex-4 DSP48 blocks contain hard 18 × 18 bit multipliers followed by 48-bit
accumulators. In the Virtex6 DSP48E1 block, the hard multiplier is 18 × 25 bit instead. If
such blocks are available, they are suitable for the implementation of the complex
multipliers in the FFT core. We have implemented the complex multipliers using four
real multipliers. This requires four DSP48(E1) units per complex multiplier, and the
latency of one complex multiplier is four clock cycles. This level of pipelining will
enable the complex multiplier to work at the maximal frequency for the multiplier blocks
in both Virtex-4 and Virtex-6. The option to implement the complex multiplication with
three multiplier was not considered, since this introduces an additional tradeoff between
area and performance that is better addressed in alternative works. It is also worth noting
that twiddle factor multipliers with few angles can be efficiently implemented without
DSP blocks using add-and-shift techniques (using slices logic). For the same tradeoff
reasons, this is not considered here. However, such techniques can readily be combined
with the proposed.
Butterflies
Adders Implemented in FPGAs: To allow implementation of fast ripple-carry adders,
contemporary FPGAs have dedicated logic in each fundamental building block for this
purpose. In both Virtex-4 and Virtex-6, each bit in an adder is implemented partly in an
7. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
LUT and partly in dedicated logic located at the LUT output. The Virtex-4
implementation of 1 bit of an adder is outlined in Fig. 3. For brevity, significant
simplifications are made. As can be seen, the XOR-operation of the two input bits is
performed in the LUT. The output of the LUT is then used to produce both the carry out
and the sum bits. A subtractor is implemented with an inversion on the B input and a one
as the least significant carry input bit
Advantages:
Power consumption is less
Mapping efficiency is higher
References:
[1] S. M. Trimberger, “Three ages of FPGAs: A retrospective on the first thirty years of FPGA
technology,” Proc. IEEE, vol. 103, no. 3, pp. 318–331, Mar. 2015.
[2] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 26, no. 2, pp. 203–215, Feb. 2007.
[3] A. Ehliar, “Optimizing Xilinx designs through primitive instantiation: Guidelines, techniques, and
tips,” in Proc. 7th FPGAworld Conf., 2010, pp. 20–27.
[4] E. H. Wold and A. M. Despain, “Pipeline and parallel-pipeline FFT processors for VLSI
implementations,” IEEE Trans. Comput., vol. C-33, no. 5, pp. 414–426, May 1984.
[5] H. L. Groginsky and G. A. Works, “A pipeline fast Fourier transform,” IEEE Trans. Comput., vol.
C-19, no. 11, pp. 1015–1019, Nov. 1970.
[6] B. Gold and T. Bially, “Parallelism in fast Fourier transform hardware,” IEEE Trans. Audio
Electroacoust., vol. 21, no. 1, pp. 5–16, Feb. 1973.
[7] A. M. Despain, “Fourier transform computers using CORDIC iterations,” IEEE Trans. Comput.,
vols. C–23, no. 10, pp. 993–1001, Oct. 1974.
8. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
[8] G. Bi and E. V. Jones, “A pipelined FFT processor for word-sequential data,” IEEE Trans. Acoust.,
Speech, Signal Process., vol. 37, no. 12, pp. 1982–1985, Dec. 1989.
[9] L. Yang, K. Zhang, H. Liu, J. Huang, and S. Huang, “An efficient locally pipelined FFT processor,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 7, pp. 585–589, Jul. 2006.
[10] Y.-N. Chang, “An efficient VLSI architecture for normal I/O order pipeline FFT design,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 55, no. 12, pp. 1234–1238, Dec. 2008.
[11] X. Liu, F. Yu, and Z.-K. Wang, “A pipelined architecture for normal I/O order FFT,” J. Zhejiang
Uni. Sci. C., vol. 12, no. 1, pp. 76–82, Jan. 2011.
[12] Z. Wang, X. Liu, B. He, and F. Yu, “A combined SDC-SDF architecture for normal I/O pipelined
radix-2 FFT,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 5, pp. 973–977, May
2015.
[13] M. Garrido, S.-J. Huang, S.-G. Chen, and O. Gustafsson, “The serial commutator FFT,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 63, no. 10, pp. 974–978, Oct. 2016.
[14] S. He and M. Torkelson, “A new approach to pipeline FFT processor,” in Proc. Int. Conf. Parallel
Process., Apr. 1996, pp. 766–770.
[15] S. He and M. Torkelson, “Designing pipeline FFT processor for OFDM (de)modulation,” in Proc.
URSI Int. Symp. Signals Syst. Electron., Sep. 1998, pp. 257–262.