D.Venkata Kishore, C.Ram Kumar / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www...
D.Venkata Kishore, C.Ram Kumar / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www...
D.Venkata Kishore, C.Ram Kumar / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www...
D.Venkata Kishore, C.Ram Kumar / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www...
Upcoming SlideShare
Loading in …5



Published on

International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. D.Venkata Kishore, C.Ram Kumar / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.1152-11551152 | P a g eDesign and Implementation of Pipelined FFT Processor*D.Venkata Kishore, **C.Ram Kumar*Lecturer, ECE Department, JNTUA College of Engineering Pulivendula**Lecturer, ECE Department, JNTUA College of Engineering PulivendulaAbstractIt is important to develop a high-performance FFT processor to meet therequirements of real time and low cost in manydifferent systems. So a radix-2 pipelined FFTprocessor based on Field Programmable GateArray (FPGA) for Wireless Local Area Networks(WLAN) is proposed. Unlike being stored in thetraditional ROM, the twiddle factors in ourpipelined FFT processor can be accessed directly.A novel simple address mapping scheme is alsoproposed. The FFT processor has two pipelines,one is in the execution of complex multiplicationof the butterfly unit, and the other is between theRAM modules, which read input data, storetemporary variables of butterfly unit and outputthe final results. Finally, the pipelined 64-pointFFT processor can be completely implementedwithin only 67 clock cycles.Keywords-FFT; FPGA; address mappingI. INTRODUCTIONFast Fourier Transform (FFT) processor iswidely used in different applications, such asWLAN, image process, spectrum measurements,radar and multimedia communication services [1].However, the FFT algorithm is a demanding taskand it must be precisely designed to get an efficientimplementation. If the FFT processor is madeflexible and fast enough, a portable device equippedwith wireless transmission system is feasible.Therefore, an efficient FFT processor is required forreal-time operations [2] and designing a fast FFTprocessor is a matter of great significance.In the past twenty years, FPGA hasdeveloped rapidly and gradually become universal.Compared with design flow of traditional ASIC,designs based on FPGA have the advantages offlexibility and high performance price ratio. Manyresearchers have studied on pipelined FFT based onFPGA [3], [4], [5]. For instance, in [3], theyproposed an approach to design an FFT processorfor wireless applications, but his design has toomany clock cycles and isn’t fast enough. Incomparison to their designs, we propose a simpleand feasible pipelined implementation of a 32-bit64-point FFT processor based on FPGA for WLAN.This paper is organized as follows. In thenext section, we introduce a basic radix-2 FFTalgorithm to briefly discuss which decimation isbetter to the system, a three-multiplication method,and a novel address mapping scheme which reducesdelay and increases the speed of the system. Insection III, the pipelined FFT architecture isproposed and each unit is also illustrated. Section IVis the implementation of the 64-point FFT processorbased on FPGA, and hardware resources areexplicitly listed out. The last section gives theconclusion.II. FFT ALGORITHM AND ADDRESSMAPPING SCHEMEA. Radix-2 FFT AlgorithmThe FFT algorithm can compute theDiscrete Fourier Transform (DFT) effectively.Given a sequence {x(n)} of N complex numbers, wecan compute its DFT, another sequence {X(k)} of Ncomplex numbers, according to the followingformula [6]N-1NX(k) = ∑ x(n)W nk, k= 0,1,……,N-1. (1)n = 0And according to the different way todecimate, it can be divided into two types, DIF(Decimation in Frequency) and DIT (Decimation inTime). The DIF algorithm is easier to design thanDIT. And considering the finite word length effect,DIF has much more advantages than DIT, such asreducing the additive noise, which is introduced bythe multiplication when it is implemented with thefixed point [7] and reducing the complexity of thewhole system. Consequently, we use the DIFalgorithm to design radix-2 FFT module and most ofcurrent FFT processors are also based on thisalgorithm [8].B. Three-multiplication MethodIt’s undeniable that complex multiplicationis the dominant factor affecting the speed and thethroughput of FFT processor. Computing a complexmultiplication requires four real multipliers and tworeal adders. As we all know, the hardware area of areal multiplier is larger than that of a real adder inFPGA. So we should do our best to convert thecomplex multiplication into addition and subtractionto optimize the whole performance as high aspossible. Having taken into account all operands are32-bit complex numbers, the difference of two
  2. 2. D.Venkata Kishore, C.Ram Kumar / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.1152-11551153 | P a g einputs X m (i) and X m ( j) can beexpressed by z1 = x1 + jy1 and the twiddlefactor WNek= exp(2n π)can be expressed byNz 2 = x2 + jy 2 . So the product of them can be alsoexpressed by z = x + jy . That’s to say, the 16-bit realpart x of the product is equivalent to x1 x2 − y1 y2and the 16-bit imaginary part y is equivalent to x1 y2 + x2 y1 .Therefore, we can transform the product zeasily as the following equationsx = x1 ( x 2 + y 2 ) − y 2 ( x1 + y1 ), (2)y = x1 ( x 2 + y 2 ) − x 2 ( x1 − y1 ). (3)Obviously, using this factorization scheme,the system has some advantages [9]. The number ofreal multiplications is reduced from four to three.And addition has less consumption thanmultiplication. So the system power consumption isalso reduced. In this study, we can save sixteenembedded multiplier 9-bit elements in FPGA. As forthis 64-point FFT processor, the numerical values ofx2 + y2 , x1 − y1 , x1 + y1 , x2 and y2 can be gottenbefore they participate in the real multiplication.C. The Novel Address Mapping SchemeIn this paper, the block size of our systemis 64 points. Having considered the properties ofradix-2 64-point FFT, it needs to read 8 operandsfrom memories at a time so as to achieve a high-speed FFT. As we all know, parallel accessing datais crucial to a system [10]. Thus, these 8 operands inour design are located in different row or column ofmemory blocks and this arrangement ensures that 8conflict-free memory accesses can be performed inparallel. Initially, we use 8 32-bit dual-portmemories to store 64 operands in sequence. Andthen a new linear shift conflict-free address mappingscheme is adopted to change the addresses ofoperands. The primary two-dimensional addressesof operands will be mapped to new ones.For instance, we assume that the originaltwo-dimensional coordinate is (a, b), in which a andb represent the address of the data in one memoryand the number of the 8 memories, respectively.Then, we obtain a new conflict-free address (A, B)by means of the following equationsA=b, (4)B=(a+b)%8. (5)In our design, it can be ensured that nomemory location is read from or written to at thesame time, and this new mapping scheme isfeasible, effective and simple.III. THE PIPELINED FFT PROCESSORARCHITECTUREFor high throughput systems, pipelinedarchitecture is a good choice, and it is also an idealmethod to implement high-speed long-size FFTowing to its regular structure and simple control.The performance of pipelined FFT processor can beimproved by optimizing the structure and savinghardware resources. The block diagram of ourproposed FFT processor is illustrated in Fig.1. Itconsists of four essential units. Control unit, thekernel of the FFT processor, harmonizes the wholesystem. Butterfly unit (BU), which has three-stagepipelined structure, carries out the complexmultiplication. Two dual-port RAMs are used tostore and output data. And AGU, the abbreviationFig 1. The pipelined FFT architecture.of address generator unit, produces 8 3-bit readaddresses and write addresses.A. Control UnitControl unit, which generates all controlsignals for the whole system, is responsible foroperation control of the processor. A 48-bit signalw_con controls the whole FFT processor. And thissignal w_con generates two parameters, write_enand read_en, to control AGU. It also generates sel1and sel2 signals to select data from two RAMs, eachof which is made up of 8 32-bit registers. The BUand the remaining parts are controlled by w_con aswell. This control unit harmonizes all steps of theFFT processor based on a 7-bit counter.B. The DIF Butterfly UnitFor FFT algorithm, the central componentis the BU that calculates the sum and difference oftwo input data, and plays an extremely importantrole in computing the product of the difference andtwiddle factors.
  3. 3. D.Venkata Kishore, C.Ram Kumar / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.1152-11551154 | P a g eWe only use 11 factors W064 , W164 , W264 , W364 ,W464, W564 , W664 , W764 , W864 , W1664 , W2464 toexpress all 32 factors that we need in this FFTprocessor owing to the nk fact that the twiddle factorW N can be separated into two components. Forinstance, W2464 can be derived from the product ofW464 and W2464 . Moreover, the value of W064 is aconstant 1, and for W1664 , only a negative sign isneeded to add to the real part of the relevant data,and then to inverse the real part and imaginary part.So we can eliminate these two factors, that’s to say,actually we just need 9 twiddle factors. In addition,we use 16-bit fixed point decimal to express these 9twiddle factors. Although the fixed point decimalarithmetic isn’t precise enough, it can satisfy therequirements of general systems.Fig 2. Block diagram of BU.Due to the fact that the multiplication oftwiddle factors and corresponding data is veryimportant, three-stage pipeline structure is usedfor the complex multiplication to obtain a highspeed computation. The architecture of BU is shownin Fig. 2.In the BU, both of the complex inputs are32 bits, including 16-bit real part and 16-bitimaginary part. The sum of them needs to be scaleddown by a factor of 2 to avoid arithmetic overflow,and the same operation is applied to the differenceof them. On the other hand, the factor W064 and thefirst parameter con_s of the second multiplexer arenot involved in the complex multiplier, and theycan be used as a constant 1, just as Fig.2 hasdepicted. Thus, the power consumption of thecomplex multiplier can be reduced and the hardwareresources will be saved. There’re some points to beemphasized. The difference and twiddle factors areboth 32 bits, so the result of the first complexmultiplier will be 64 bits. But because we adopt thefixed point decimal computation, we shouldintercept it to a 32-bit parameter f_mult as the inputof the second complex multiplier.The most remarkable advantage in this unitis that we use 3 32-bit registers to realize the three-stage pipeline of butterfly transform, using register1to store the difference, register2 to store theintercepted result of second stage f_mult andregister3 to store the final result cmul_b.C. The RAM UnitThe RAM1 and RAM2 are made up of 832-bit registers respectively. And data is alwayswritten to the outside memories from RAM2, and itis always read to RAM1 from the outside memories.Then let introduce the key algorithm used in thisunit. Considering the properties of 64-point FFT, wecan use the radix-2 DIF 8-point FFT as a whole unit,so thereFig 3 The example of radix-2 8-point FFT.are only two stages to accomplish the 64-point FFT.And these two stages are identical to the six stagesof the standard radix-2 DIF 64-point FFT. Systemparallel reads 8 32-bit operands from outermemories to RAM1 at a time, and we need onlyread sixteen times. An example of this algorithm isshown in Fig.3.The pipeline in RAM units is brieflydiscussed as follows. Firstly, system reads 8operands from outer memories and writes them toRAM1, and then the results will be stored in RAM2after they are computed. Meanwhile, system readsthe next 8 operands. Subsequently, system willaccess operands from RAM1 or RAM2 to computethese 16 operands. Furthermore, within a singleclock four butterfly computations will be executedsimultaneously. The final results of these 16operands will be all stored in RAM2, and then theyare written to outer memories. At the same time,another 8 operands will be read. Accordingly, only76 clock cycles are needed to complete this radix-2DIF 64-point FFT.D. AGUCompared with other units, AGU is alsoquite important. It will create 8 read and 8 writeaddresses, which determine the data access to outermemories.
  4. 4. D.Venkata Kishore, C.Ram Kumar / International Journal of Engineering Research andApplications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.1152-11551155 | P a g eIn this FFT processor, we adopt the in-place computation method so as to make it a moresimplified and faster system. That’s to say, we writethe results into where they are read. In contrast tothe sequence of 64 input operands configured byaddress mapping, the final output sequence of thisFFT processor are in bit-reversed order and need tobe adjusted to normal order. And we can make theseappropriate adjustments before the FFTcomputation. So, although we adjust the sequence ofinputs or outputs, the performance of our FFTprocessor won’t be degraded.IV. HARDWARE RESOURCESThe functional simulation and timingsimulation are successfully made. The mainhardware resources of this design are given asfollows. The device is the EP2C70F896C6 ofCyclone II family. The total logic elements are562/68,416 (8%), the total pins are 563/622(91%)and the total embedded multiplier 9-bit elements are48/300(16%). Meanwhile, the clock frequency is31.69MHz. As in [3], they proposed a fixed-pointl6-bit 64-point FFT processor with 92 clock cyclesin total, but our clock cycles is 67. And our FFTprocessor has a higher speed and lower powerconsumption.V. CONCLUSIONThis paper proposes a novel radix-2 FFTprocessor based on FPGA for WLAN, using VerilogHDL as hardware description language and QuartusII as design and synthesis tool. To achieve high-throughput, pipelined architectures have been usedin the butterfly unit and the dual-port RAM. Thededicated parallel-pipelined FFT processorarchitecture can process input data at high speed,and the whole system performance can be greatlyimproved due to adopting a novel simple addressmapping scheme. For radix-two system, thismapping scheme is better and simpler than most ofothers. The design is implemented on a FPGA chip.And this pipelined FFT completes a complex 64-point FFT within 2.1μs. The hardware testing resultexplains that it can meet the requirements of theWLAN.REFERENCES[1] J. A. C. Bingham, “Multicarrier modulationfor data transmission: an idea whosetime has come,” IEEE CommunicationMagazine, vol. 28, no. 5, pp. 5-14, May1990.[2] J. Palicot and C. Roland, “FFT: a basicfunction for a reconfigurable receiver,”10th International Conference onTelecommunications, vol. 1, pp. 898-902,March 2003.[3] Min Jiang, Bing Yang, Yiling Fu, et al.,“Design Of FFT processor with LowPower complex mutliplier for OFDM-based high-speed wireless applications,”International Symposium onCommunications and InformationTechnology, vol. 2, pp. 639-641, Oct.2004.[4] Kai Zhong, Hui He, and Guangxi Zhu, “Anultra-high speed FFT processor,”International Symposium on Signals,Circuits and Systems, vol. 1, pp. 37 -40, July 2003.[5] Hongjiang He and Hui Guo, “TheRealization of FFT Algorithm based onFPGA Co-processor,” Second InternationalSymposium on IntelligentInformation Technology Application, vol.3, pp. 239-243, Dec. 2008.[6] J. G. Proakis and D. G. Manolakis,“Introduction to Digital Signal Processing,”New York: Macmillan, 1988.[7] R. B. Perlow and T. C. Denk, “Finite wordlength design for VLSI FFT processors,”Conference Record of the Thirty-FifthAsilomar Conference on Signals, Systemsand Computers, vol. 2, pp. 1227– 1231,Nov. 2001.[8] J. W. Cooky and J. W. Tukey, “Analgorithm for the machinecalculation of complex Fourier series,”Math. of Comp., vol. 19, No. 90, pp.297-301, April 1965.[9] S. Oraintara, Y. J. Chen, and T. Q. Nguyen,“Integer fast Fourier transform,” IEEETrans. Acoustics, Speech, SignalProcessing, vol. 50, No. 3, pp. 607-618,March 2002.[10] J. H. Takala, T. S. Jarvinen, and H. T.Sorokin, "A Conflict-free parallel memoryaccess scheme for FFT processors,"International Symposium on Circuits andSystems, vol. 4, pp. 524-527, May2003