FPGA Based Design of High Performance Decimator using DALUT Algorithm


Published on

This paper presents a multiplier less approach
to implement high speed and area efficient decimator for
down converter of Software Defined Radios. This
technique substitutes multiply-and-accumulate (MAC)
operations with look up table (LUT) accesses. Proposed
decimator has been implemented using Partitioned
distributed arithmetic look up table (DALUT) algorithm
by taking optimal advantage of embedded LUTs of target
FPGA device. This method is useful to enhance the system
performance in terms of speed and area. The proposed
decimator has used half band polyphase decomposition
FIR structure. The decimator has been designed with
Matlab 7.6, simulated with Modelsim 6.3XE simulator,
synthesized with Xilinx Synthesis Tool (XST) 10.1 and
implemented on Spartan-3E based 3s500efg320-4 FPGA
device. The proposed DALUT approach has shown an
improvement of 24% in speed by saving almost 50%
resources of target device as compared to MAC based

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

FPGA Based Design of High Performance Decimator using DALUT Algorithm

  1. 1. ACEEE International Journal on Signal and Image Processing Vol 1, No. 2, July 2010 FPGA Based Design of High Performance Decimator using DALUT Algorithm Rajesh Mehra1, Swapna Devi2 1 National Institute of Technical Teachers’ Training & Research, Chandigarh, India Email: rajeshmehra@yahoo.com 2 National Institute of Technical Teachers’ Training & Research, Chandigarh, India Email: swapna_devi_p@yahoo.co.inAbstract—this paper presents a multiplier less approach ASICs and DSP chips have been the traditional solutionto implement high speed and area efficient decimator for for high performance applications, now the technologydown converter of Software Defined Radios. This and the market demands are looking for changes.Ontechnique substitutes multiply-and-accumulate (MAC) one hand, high development costs and time-to-marketoperations with look up table (LUT) accesses. Proposed factors associated with ASICs can be prohibitive fordecimator has been implemented using Partitioneddistributed arithmetic look up table (DALUT) algorithm certain applications while, on the other hand,by taking optimal advantage of embedded LUTs of target programmable DSP processors can be unable to meetFPGA device. This method is useful to enhance the system desired performance due to their sequential-executionperformance in terms of speed and area. The proposed architecture [7]. In this context, embedded FPGAs offerdecimator has used half band polyphase decomposition a very attractive solution that balance high flexibility,FIR structure. The decimator has been designed with time-to-market, cost and performance. Therefore, inMatlab 7.6, simulated with Modelsim 6.3XE simulator, this paper, a decimator is designed and implemented onsynthesized with Xilinx Synthesis Tool (XST) 10.1 and FPGA device. An impulse response of an FIR filterimplemented on Spartan-3E based 3s500efg320-4 FPGA Kdevice. The proposed DALUT approach has shown an may be expressed as: Y =¥ k Ck x (1) k=1improvement of 24% in speed by saving almost 50% where C1,C2…….CK are fixed coefficients and the x 1,resources of target device as compared to MAC based x2……… xK are the input data words. A typical digitalapproach. implementation will require K multiply-and-accumulateIndex Terms— ASIC, DALUT, FPGA, MAC, SDR (MAC) operations, which are expensive to compute in hardware due to logic complexity, area usage, and I. INTRODUCTION throughput. Alternatively, the MAC operations may be replaced by a series of look-up-table (LUT) accesses The widespread use of digital representation of and summations. Such an implementation of the filtersignals for transmission and storage has created is known as distributed arithmetic (DA).challenges in the area of digital signal processing [1]. The digital signal processing application by usingThe applications of digital FIR filter and up/down variable sampling rates can improve the flexibility of asampling techniques are found everywhere in modem software defined radio. It reduces the need forelectronic products. For every electronic product, lower expensive anti-aliasing analog filters and enablescircuit complexity is always an important design target processing of different types of signals with differentsince it reduces the cost [2]. There are many sampling rates. It allows partitioning of the high-speedapplications where the sampling rate must be changed. processing into parallel multiple lower speedInterpolators and decimators are utilized to increase or processing tasks which can lead to a significant savingdecrease the sampling rate. Up sampler and down in computational power and cost. Wideband receiverssampler are used to change the sampling rate of digital take advantage of multirate signal processing forsignal in multi rate DSP systems. This rate conversion efficient channelization and offers flexibility forrequirement leads to production of undesired signals symbol synchronization.associated with aliasing and imaging errors. So somekind of filter should be placed to attenuate these errors II. DECIMATORS[3]-[5].Today’s consumer electronics such as cellularphones and other multi-media and wireless devices Typically lowpass filters are used to reduce theoften require digital signal processing (DSP) algorithms bandwidth of a signal prior to reducing the samplingfor several crucial operations[6] in order to increase rate. This is done to minimize aliasing due to thespeed, reduce area and power consumption. Due to a reduction in the sampling rate. Down sampler is basicgrowing demand for such complex DSP applications, sampling rate alteration device used to decrease thehigh performance, low-cost Soc implementations of sampling rate by an integer factor [8]. An down-DSP algorithms are receiving increased attention sampler with a down-sampling factor M, where M is aamong researchers and design engineers. Although positive integer, develops an output sequence y[n] with 9© 2010 ACEEEDOI: 01.ijsip.01.02.02
  2. 2. ACEEE International Journal on Signal and Image Processing Vol 1, No. 2, July 2010a sampling rate that is (1/M)-th of that of the input Ye jω = 1{X e jω /2 +X −e jω/2  }sequence x[n]. The down sampler is shown in Figure1. 2 (12) The two terms have an overlap due to which original “shape” of X(ejω/2) is lost when x[n] is down-sampled. This overlap causes the aliasing that takes place due to under-sampling. There is no overlap, i.e., no aliasing, Figure1. Down Sampler only if Down-sampling operation is implemented by jω X  e =0 for ∣ω∣≥π /2 (13)keeping every Mth sample of x[n] and removing M-1in-between samples to generate y[n]. The input and In general, Aliasing is absent if and only ifoutput relation of down sampler can be expressed as: X e jω =0 for ∣ω∣≥π / M y[n] = x[nM] (2) (14) Applying the z-transform to the input-output relation To overcome the effect of aliasing decimation filtersof a factor-of-M down-sampler, we get are used. The specifications for the lowpass decimation ∞ filter is given by (3) { } −n 1, ∣ω∣≤ω / M Y  z= ∑ x [ Mn] z ∣H  e jω ∣= c n=−∞ 0, π / M ≤∣ω∣≤π (15) The expression on the right-hand side of Eq (3)cannot be directly expressed in terms of X(z). To getaround this problem, a new sequence x int [n] can be III. DALUT ALGORITHMexpressed as: DALUT algorithm is an efficient method forx int 0, { [ n]= x [n ], n= 0, ± M, ±2M ,  otherwise } (4) computing inner products when one of the input vectors is fixed. It uses look-up tables and accumulators insteadThen ∞ ∞ of multipliers for computing inner products and has −n −n been widely used in many DSP applications such asY  z= ∑ x [ Mn] z = ∑ x int [ Mn] z n=−∞ n=−∞ DFT, DCT, convolution, and digital filters. The ∞ example of direct DA inner-product generation is −k / M 1/ M shown in Eq. (1) where xk is a 2s-complement binary = ∑ x int [k] z =X int z  (5) k=−∞ number scaled such that |xk| < 1. We may express each xk asNow, xint [n] can be formally related to x[n] as follows: (16) x int [ n ]=c [n ]⋅x [ n ] (6) where the bkn are the bits, 0 or 1, bk0 is the sign bit.Where Now combining Eq. (1) and (16) in order to express y in terms of the bits of xk ; we see c [ n]= 1, 0, { n= 0, ± M, ±2M ,  otherwise } (7) (17)A convenient representation of c[n] is given by The above Eq.(17) is the conventional form of M −1 1 kn expressing the inner product. Interchanging the order of c [ n]= ∑ W (8) M k= 0 M the summations, gives us:Where W M =e− j2π /M (18) (9) Eq.(18) shows a DA computation where the bracketedTaking the z-transform of Eq.(6) and by making use of term is given byEq.(8), we get   ∞ M −1 1 W kn x [n ] z−n (19) X int  z = M ∑ ∑ M n=−∞ k= 0 Each bkn can have values of 0 and 1 so Eq.(19) can (10) have 2K possible values. Rather than computing these   M −1 ∞ values on line, we may pre-compute the values and ¿1 ∑ ∑ x [ n ] W size6kn z −n M store them in a ROM. The input data can be used to M k= 0 n=−∞ M −1 directly address the memory and the result. After N 1 M ∑ X z W −k k= 0 M   such cycles, the memory contains the result, y. As an example, let us consider K = 4, C1 = 0.45, C2 = -0.65, (11) C3 = 0.15, and C4 = 0.55. The memory must contain all The spectrum of a factor-of-2 down-sampler with an possible combinations (24 = 16 values) and theirinput x[n] is shown in Fig2. The DTFTs of the output negatives in order to accommodate the term whichand the input sequences of this down-sampler are then occurs at the sign-bit time.related as 10© 2010 ACEEEDOI: 01.ijsip.01.02.02
  3. 3. ACEEE International Journal on Signal and Image Processing Vol 1, No. 2, July 2010 (20) Nyquist decimators provide same stop band attenuation and transition width with a much lower The structure that can be used to compute these order. An Lth-band Nyquist filter with L = 2 is called aequations is shown in Fig6. The term xk may be written half-band filter. The transfer function of a half-bandas filter is thus given by 1 −1 2 (29) xk = [ xk − ( −xk )] (21) H  z =α+z E z  2 1 with its impulse response satisfyingand in 2s-complement notation the negative of xk may n= 0be written as {} h[ 2n ]= α, 0, otherwise (30) (22)where the over score symbol indicates the complementof a bit. By substituting Eq.(16) & (21) into Eq.(22), we (23)In order to simplify the notation later, it is convenientto define the new variables as − akn = bkn − bkn for n=0 (24)and − ak 0 = b k 0 − b k 0 (25) Figure3. MAC based Multiplier Implementationwhere the possible values of the akn , including n=0, are1. Then Eq.(23) may be written as In Half band filters about 50% of the coefficients of (26) h[n] are zero. This reduces the hardware requirement of the proposed decimator significantly. The firstBy substituting the value of xk from Eq.(26) into Eq. decimator design is implemented by using multiplier(1), we obtain technique where 67 coefficients are processed MAC unit as shown in Figure3. The second decimator design (27) replaces MAC unit with LUT unit which is proposed multiplier less technique as shown in Figure4. (28) It may be seen that Q(bn) has only 2(K-1) possibleamplitude values with a sign that is given by theinstantaneous combination of bits. The computation ofy is obtained by using a 2(K-1) word memory, a one-wordinitial condition register for Q(O) , and a single paralleladder sub tractor with the necessary control-logic gates. Figure4. LUT based Multiplier Less Implementation IV. PROPOSED DECIMATOR DESIGN All 67 coefficients are divided in two parts by using Equiripple based half band polyphase decimator is polyphase decomposition. The 2 branch polyphasedesigned and implemented using Matlab [9]. The decomposition of an FIR decimator is shown in Figure5length of the proposed decimator filter is 66 with 0.1 and can be expressed as:transition widths 60 dB stop band attenuation whose (31) H  z =E  z 2 +z−1 E  z 2 output is shown Figure2. 0 1 Ma gn itude Res ponse (dB ) 0 -10 -20 -30 Magnitude (dB) -40 -50 -60 -70 0 0.1 0 .2 0.3 0.4 0.5 0 .6 0.7 0.8 0.9 Figure5. Polyphase Decomposition Norma liz ed Freque nc y ( × π rad/sa mp le ) Figure2. Decimator Output 11© 2010 ACEEEDOI: 01.ijsip.01.02.02
  4. 4. ACEEE International Journal on Signal and Image Processing Vol 1, No. 2, July 2010 reduce the size in this proposed work, we can subdivide the LUT into a number of LUTs, called LUT partitions. Each LUT partition operates on a different set of taps. The results obtained from the partitions are summed. For example, for a 160 tap filter, the LUT size is Figure6. Computationally Efficient Structure (2^160)*W bits, where W is the word size of the LUT data. Dividing this into 16 LUT partitions, each taking The proposed computationally efficient equivalent 10 inputs (taps), the total LUT size is reduced tostructure is shown in Figure6. In a DA realization of a 16*(2^10)*W bits, a significant reduction. So in thisFIR filter structure, a sequence of input data words of proposed design 67 coefficients are divided into twowidth W is fed through a parallel to serial shift register, sections with 34 and 33 coefficients respectively toproducing a serialized stream of bits. The serialized perform polyphase decomposition. Then 34 coefficientsdata is then fed to a bit-wide shift register. This shift of one part have been processed by using (6 6 6 6 6 4)register serves as a delay line, storing the bit serial data DALUT partitioning to limit the size of LUTs. Thissamples. The delay line is tapped (based on the input multiplier less DALUT technique consists of inputword size W), to form a W-bit address that indexes into registers, 4-input LUT unit and shifter/accumulatora lookup table (LUT). The LUT stores all possible unit.sums of partial products over the filter coefficientsspace. The LUT is followed by a shift and adder V. IMPLEMENTATION RESULTS & DISCUSSION(scaling accumulator) that adds the values obtainedfrom the LUT sequentially. A lookup table is The multiplier based and multiplier less decimatorsperformed sequentially for each bit (in order of are implemented and synthesized on Spartan-3E basedsignificance starting from the LSB). On each clock 3s500efg320-4 target device. The modelsim basedcycle, the LUT result is added to the accumulated and simulated output of the proposed decimator with 16 bitshifted result from the previous cycle. For the last bit precision is shown in Figure7.(MSB), the lookup table result is subtracted, accountingfor the sign of the operand. This basic form of DA isfully serial, operating on one bit at a time. If the inputdata sequence is W bits wide, then a FIR structure takesW clock cycles to compute the output. Symmetric andasymmetric FIR structures are an exception, requiringW+ 1 cycle, because one additional clock cycle isneeded to process the carry bit of the pre-adders. The inherently bit serial nature of DA can limitthroughput. To improve throughput, the basic DAalgorithm can be modified to compute more than one Figure7. Simulated Decimator Outputbit sum at a time. The number of simultaneouslycomputed bit sums is expressed as a power of two Table1 show the area, and speed comparison of bothcalled the DA radix. For example, a DA radix of 2 techniques. The proposed DA based design shows 24%(2^1) indicates that one bit sum is computed at a time; a enhancement in speed by saving almost 50% of theDA radix of 4 (2^2) indicates that two bit sums are resources as compared to MAC based design.computed at a time, and so on. To compute more thanone bit sum at a time, the LUT is replicated. For Table1. Resource Utilizationexample, to perform DA on 2 bits at a time (radix 4), Logic Multiplier Approach Multiplier Lessthe odd bits are fed to one LUT and the even bits are Utilization Approachsimultaneously fed to an identical LUT. The LUT # of Slices 1055 out of 4656 472 out of 4656results corresponding to odd bits are left-shifted before (22%) (10%) # of Flip Flops 1210 out of 9312 515 out of 9312they are added to the LUT results corresponding to (12%) (5%)even bits. This result is then fed into a scaling # of LUTs 857 out of 9312 (9%) 590 out of 9312accumulator that shifts its feedback value by 2 places. (6%)Processing more than one bit at a time introduces a # of Multipliers 1 out of 20 0 out of 20 (5%) (0%)degree of parallelism into the operation, improving Speed (MHz) 49.574 61.215performance at the expense of area. The size of the LUT grows exponentially with theorder of the filter. For a filter with N coefficients, theLUT must have 2^N values. For higher order filters,LUT size must be reduced to reasonable levels. To 12© 2010 ACEEEDOI: 01.ijsip.01.02.02
  5. 5. ACEEE International Journal on Signal and Image Processing Vol 1, No. 2, July 2010 1400 REFERENCES 1200 1000 [1] Vijay Sundararajan, Keshab K. Parhi, “Synthesis of 800 600 Minimum-Area Folded Architectures for Rectangular 400 Multiplier Multidimensional”, IEEE TRANSACTIONS ON SIGNAL 200 Multiplier Less PROCESSING, pp. 1954-1965, VOL. 51, NO. 7, JULY 0 2003. [2] ShyhJye Jou, Kai-Yuan Jheng*, Hsiao-Yun Chen and An- Yeu Wu, “Multiplierless Multirate Decimator I Interpolator Module Generator”, IEEE Asia-Pacific Conference on Advanced System Integrated Circuits, pp. 58-61, Aug-2004. Figure8. Resource Comparison [3] Amir Beygi, Ali Mohammadi, Adib Abrishamifar. “AN FPGA-BASED IRRATIONAL DECIMATOR FOR The resource comparison of both multiplier and DIGITAL RECEIVERS” in 9th IEEE Internationalmultiplier less techniques have been shown in Figure8. Symposium on Signal Processing and its Applications, pp. 1-The multiplier approach has consumed 9-22 % 4, ISSPA-2007.resources as compared to 5-10% in case of multiplier [4] Zhao Yiqiang; Xing Dongyang; Zhao Hongliang;less approach in due to efficient LUT partitioning by “Optimized Design of Digital Filter in Sigma-Delta AIDusing proposed DALUT algorithm. Converter”, International Conference on Neural Networks and Signal Processing, pp. 502 – 505, 2008. [5] Nerurkar, S.B.; Abed, K.H.; “Low-Power Decimator CONCLUSION Design Using Approximated Linear-Phase N-Band IIR In this paper, an optimized half band polyphase Filter”, IEEE Trans. on signal processing, vol. 54 , pp. 1550 –decomposition technique has been presented to 1553,2006.implement the decimator for wireless applications. DA [6] D.J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. Anderson, “A Novel High Performance Distributedalgorithm has been used to further enhance the speed Arithmetic Adaptive Filter Implementation on an FPGA”, inand area utilization of proposed design by taking an Proc. IEEE Int. Conference on Acoustics, Speech, and Signaloptimal advantage of look up table structure of target Processing (ICASSP’04), Vol. 5, pp. 161-164, 2004FPGA. The proposed multiplier approach has shown an [7] Patrick Longa and Ali Miri “Area-Efficient FIR Filterimprovement of 24% in speed by saving almost 50% Design on FPGAs using Distributed Arithmetic”, pp248-252resources of target device as compared to multiplier IEEE International Symposium on Signal Processing andbased approach. So proposed design is optimal one to Information Technology,2006.provide cost effective solution for down converter [8] S K Mitra, Digital Signal Processing, Tata Mc Graw Hill, Third Edition, 2006.section of Software Defined Radios [9] Mathworks, “Users Guide Filter Design Toolbox”, March-2007. 13© 2010 ACEEEDOI: 01.ijsip.01.02.02