In this paper, an efficient mapping of the pipeline single-path delay feedback (SDF) fast Fourier transform (FFT) architecture to field-programmable gate arrays (FPGAs) is proposed. By considering the architectural features of the target FPGA, significantly better implementation results are obtained. This is illustrated by mapping an R22SDF 1024-point FFT core toward both Xilinx Virtex-4 and Virtex-6 devices. The optimized FPGA mapping is explored in detail. Algorithmic transformations that allow a better mapping are proposed, resulting in implementation achievements that by far outperforms earlier published work. For Virtex-4, the results show a 350% increase in throughput per slice and 25% reduction in block RAM (BRAM) use, with the same amount of DSP48 resources, compared with the best earlier published result. The resulting Virtex-6 design sees even larger increases in throughput per slice compared with Xilinx FFT IP core, using half as many DSP48E1 blocks and less BRAM resources. The results clearly show that the FPGA mapping is crucial, not only the architecture and algorithm choices.
1. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Efficient FPGA Mapping of Pipeline SDF FFT Cores
Abstract:
In this paper, an efficient mapping of the pipeline single-path delay feedback (SDF) fast
Fourier transform (FFT) architecture to field-programmable gate arrays (FPGAs) is
proposed. By considering the architectural features of the target FPGA, significantly
better implementation results are obtained. This is illustrated by mapping an R22SDF
1024-point FFT core toward both Xilinx Virtex-4 and Virtex-6 devices. The optimized
FPGA mapping is explored in detail. Algorithmic transformations that allow a better
mapping are proposed, resulting in implementation achievements that by far outperforms
earlier published work. For Virtex-4, the results show a 350% increase in throughput per
slice and 25% reduction in block RAM (BRAM) use, with the same amount of DSP48
resources, compared with the best earlier published result. The resulting Virtex-6 design
sees even larger increases in throughput per slice compared with Xilinx FFT IP core,
using half as many DSP48E1 blocks and less BRAM resources. The results clearly show
that the FPGA mapping is crucial, not only the architecture and algorithm choices.
Software Implementation:
Modelsim
Xilinx 14.2
Existing System:
FIELD programmable gate array (FPGA) technology keeps becoming even more mature
and continuously finds applications where applications specific integrated circuits
(ASICs) and application specific standard products (ASSPs) were used earlier.
Advantages of FPGAs include shorter design time and lower nonrecurring expenses
compared with ASICs, and higher performance and lower power consumption compared
2. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
with ASSPs. It is crucial for an efficient FPGA implementation that the hardware
structure of the FPGA family is considered. In this paper, we propose transformations for
a single-stream pipeline fast Fourier transform (FFT) core that not only reduce the
amount of resources, but also reduce the critical path, when mapped to two different
FPGA families. The discrete Fourier transform (DFT) is a commonly used transform in
signal processing applications. It is used in communication schemes that use orthogonal
frequency division multiplexing and in spectral analysis. The FFT is a collection of
algorithms for efficient computation of the DFT. Many different FFT architectures have
been proposed, processing a different number of samples per iteration. For many high-
speed applications, it is useful to process one or a few samples per iteration in a
streaming manner. Suitable architectures for this are often referred to as pipeline FFT
architectures. As the name suggests, they consist of a pipeline of butterfly (BF) stages
that consume either one (single stream ) or several (parallel stream) samples per clock
cycle. In this paper, single-stream pipeline FFT implementation is considered. However,
many of the presented techniques are possible to utilize when mapping parallelstream
FFT architectures to FPGAs. Most of the different FFT architectures can use any of the
many possible algorithms due to the only difference being the coefficients of the twiddle
factor multipliers. The potential benefit of these different algorithms is that some of the
multipliers only need to use few simple coefficients, and, hence, can be simplified or that
smaller coefficient memories are required. These earlier attempts at optimizing the FPGA
implementations of FFT cores are, however, mainly concentrated at the algorithmic
and/or architectural level, or on the mapping between algorithm and architecture.
However, as is evident from this paper, significant improvements can be obtained when
mapping a given FFT architecture, utilizing he architectural features of the FPGA.
Hence, it is not only the FFT architecture that affects the results but also the mapping to
the hardware. We illustrate this by mapping a radix-22 SDF FFT to two contemporary
FPGAs. One reason for selecting the radix-22 SDF FFT is that it is well established and
3. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
often used. Hence, the benefits provided in this paper come from the proposed
techniques. In this paper, we propose transformations to map butterflies to fewer lookup
tables (LUTs), propose transformations efficiently enabling using DSP block pre adders
for implementing BF adders, propose efficient mapping of data and twiddle factor storage
to BRAM and distributed resources, propose efficient sharing of twiddle factor memories
for radix-2k algorithms, and carefully discuss how retiming and pipelining are added to
improve timing. This results in large reductions in FPGA slice resources needed for
implementing the SDF BF processing elements, efficient use of FPGA embedded
multiplier and memory resources, and very high FPGA clock frequencies reaching the
maximal frequencies for hard FPGA components. We choose to target two different
FPGA families. One with four-input LUTs, Virtex-4, and one with six-input LUTs,
Virtex-6. Both FPGA families are from Xilinx. Since those are the sizes of FPGA LUTs
commonly used, it should be easy to generalize our results to other FPGA families given
this choice. Later, seven-series and Ultrascale/Ultrascale+ FPGAs from Xilinx use
virtually the same slice architecture as Virtex-6, so in that case, the results should be very
easy to generalize. Since all FFT architecture stages (as well as many other DSP
algorithms) consist of the same building blocks, i.e., adders, multiplexers, and delay
elements, it should be possible to apply similar transformations to improve mapping
efficiency also in their implementation as well.
Disadvantages:
Mapping efficiency is less
Power consumption is higher
Proposed System:
SDF pipeline FFT architecture
4. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
The structure of an N-point radix-22 SDF pipeline is shown in Fig. 1. As can be seen, it
consists of a pipeline of FFT core
Fig. 1. N-point radix-22 SDF pipeline FFT core
Fig. 2. Structure of (a) SDF BF and (b) trivial multiplier (W4).
5. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
S = log2 N radix-2 SDF BF processing elements, i.e., ordinary radix-2 BF elements with
data management multiplexers at their outputs. The total amount of memory required is
S−1 i=0 2i = N − 1, which is minimal for single-stream FFT architectures. Between every
stage, there is a twiddle factor multiplier. Every second of these is considered as trivial,
since they only involve a conditional multiplication of − j (where j is the imaginary unit)
that can be achieved with swapping the real and imaginary components and negating the
resulting imaginary part.
Trivial multipliers are indicated with × and general multipliers are indicted with in Fig. 1.
The general multipliers have a 75% utilization. The internal structures of the SDF BF and
the trivial multiplier (W4) are shown in Fig. 2. The control path of this architecture is
simple. To generate the control signals, a log2(N)-bit synchronous counter is used. This
counter can also be used to address the twiddle factor memories. The least significant bit
of the counter controls the data management multiplexers of the last stage (as the control
signal S).
Similarly, the other bits in order control the output multiplexers of the other stages. The
trivial multipliers are controlled by the control signals of the next stage (as S) and the
previous stage (as T ). SDF architectures like this, where the length of the feedback shift
registers is halved for each stage, operates on normal input order data, and returns the
output in a bit-reversed order. If an architecture with a bit-reversed input is used, the size
of the feedback shift registers will be in opposite order, as well as different twiddle factor
values and control signals, but all the proposed techniques can be applied
6. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 3. Simplified view of the implementation of a full adder with registered sum in Virtex-4 FPGAs.
To describe the mapping of the FFT architecture to the chosen FPGAs, the constituting
parts of it are first studied in isolation. Then, some optimization opportunities that only
are relevant for FPGAs with preadder equipped hard multipliers are considered for the
Virtex-6 design. After this, pipelining and retiming actions are performed to increase the
maximum clock frequency, f max, of the Virtex-4 and Virtex-6 designs, respectively.
Complex Multipliers
Virtex-4 DSP48 blocks contain hard 18 × 18 bit multipliers followed by 48-bit
accumulators. In the Virtex6 DSP48E1 block, the hard multiplier is 18 × 25 bit instead. If
such blocks are available, they are suitable for the implementation of the complex
multipliers in the FFT core. We have implemented the complex multipliers using four
real multipliers. This requires four DSP48(E1) units per complex multiplier, and the
latency of one complex multiplier is four clock cycles. This level of pipelining will
enable the complex multiplier to work at the maximal frequency for the multiplier blocks
in both Virtex-4 and Virtex-6. The option to implement the complex multiplication with
three multiplier was not considered, since this introduces an additional tradeoff between
area and performance that is better addressed in alternative works. It is also worth noting
that twiddle factor multipliers with few angles can be efficiently implemented without
DSP blocks using add-and-shift techniques (using slices logic). For the same tradeoff
reasons, this is not considered here. However, such techniques can readily be combined
with the proposed.
Butterflies
Adders Implemented in FPGAs: To allow implementation of fast ripple-carry adders,
contemporary FPGAs have dedicated logic in each fundamental building block for this
purpose. In both Virtex-4 and Virtex-6, each bit in an adder is implemented partly in an
7. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
LUT and partly in dedicated logic located at the LUT output. The Virtex-4
implementation of 1 bit of an adder is outlined in Fig. 3. For brevity, significant
simplifications are made. As can be seen, the XOR-operation of the two input bits is
performed in the LUT. The output of the LUT is then used to produce both the carry out
and the sum bits. A subtractor is implemented with an inversion on the B input and a one
as the least significant carry input bit
Advantages:
Power consumption is less
Mapping efficiency is higher
References:
[1] S. M. Trimberger, “Three ages of FPGAs: A retrospective on the first thirty years of FPGA
technology,” Proc. IEEE, vol. 103, no. 3, pp. 318–331, Mar. 2015.
[2] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 26, no. 2, pp. 203–215, Feb. 2007.
[3] A. Ehliar, “Optimizing Xilinx designs through primitive instantiation: Guidelines, techniques, and
tips,” in Proc. 7th FPGAworld Conf., 2010, pp. 20–27.
[4] E. H. Wold and A. M. Despain, “Pipeline and parallel-pipeline FFT processors for VLSI
implementations,” IEEE Trans. Comput., vol. C-33, no. 5, pp. 414–426, May 1984.
[5] H. L. Groginsky and G. A. Works, “A pipeline fast Fourier transform,” IEEE Trans. Comput., vol.
C-19, no. 11, pp. 1015–1019, Nov. 1970.
[6] B. Gold and T. Bially, “Parallelism in fast Fourier transform hardware,” IEEE Trans. Audio
Electroacoust., vol. 21, no. 1, pp. 5–16, Feb. 1973.
[7] A. M. Despain, “Fourier transform computers using CORDIC iterations,” IEEE Trans. Comput.,
vols. C–23, no. 10, pp. 993–1001, Oct. 1974.
8. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
[8] G. Bi and E. V. Jones, “A pipelined FFT processor for word-sequential data,” IEEE Trans. Acoust.,
Speech, Signal Process., vol. 37, no. 12, pp. 1982–1985, Dec. 1989.
[9] L. Yang, K. Zhang, H. Liu, J. Huang, and S. Huang, “An efficient locally pipelined FFT processor,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 7, pp. 585–589, Jul. 2006.
[10] Y.-N. Chang, “An efficient VLSI architecture for normal I/O order pipeline FFT design,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 55, no. 12, pp. 1234–1238, Dec. 2008.
[11] X. Liu, F. Yu, and Z.-K. Wang, “A pipelined architecture for normal I/O order FFT,” J. Zhejiang
Uni. Sci. C., vol. 12, no. 1, pp. 76–82, Jan. 2011.
[12] Z. Wang, X. Liu, B. He, and F. Yu, “A combined SDC-SDF architecture for normal I/O pipelined
radix-2 FFT,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 5, pp. 973–977, May
2015.
[13] M. Garrido, S.-J. Huang, S.-G. Chen, and O. Gustafsson, “The serial commutator FFT,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 63, no. 10, pp. 974–978, Oct. 2016.
[14] S. He and M. Torkelson, “A new approach to pipeline FFT processor,” in Proc. Int. Conf. Parallel
Process., Apr. 1996, pp. 766–770.
[15] S. He and M. Torkelson, “Designing pipeline FFT processor for OFDM (de)modulation,” in Proc.
URSI Int. Symp. Signals Syst. Electron., Sep. 1998, pp. 257–262.