1. A Modified Radix-24
SDF Pipelined OFDM Module
for FPGA based MB-OFDM UWB Systems
M.Santhi, S.Arun Kumar, G.S.Praveen Kalish, K.Murali, S.Siddharth, G.Lakshminarayanan
Department ofECE, National Institute ofTechnology, Thiruchirapalli.
santhiphd@gmaiI.com laksh@nitt.edu
Abstract - The OFDM module in the MB-oFDM UWB
transmitter is necessarily operated at 528 MHz. This is really a
challenging task because the OFDM in the UWB module has to
calculate 128-point IFFT. Earlier papers used radix-24
SDF
algorithm with parallel processing architectures of block size two
to achieve the required speed and implemented the module on
ASIC. In this paper a novel scheme "modified radix-24
SDF
algorithm" is proposed to achieve the calculation of 128-point
IFFT. In the proposed scheme, the order of the twiddle factor
sequence is different compared to the earlier radix-24
SDF
algorithm. The change in twiddle factor sequence achieves easier
implementation of the CSD multiplier used for IFFT calculation.
It is also proposed that the required speed can be achieved on
FPGA itself without using paraDel processing architectures. This
can be done by pipelining the OFDM module as well as using
LPMs. This leads to reduction in area compared to the earlier
approach of using parallel processing architectures of block size
two. For improving the accuracy, in the proposed scheme the
internal wordlength is maintained at 13bits which is 7 bits more
than the input, to account for the overflows at each of the 7 stages
of the OFDM module. The proposed scheme with increased
complexity for better accuracy is tested on ALTERA Stratix III
EP3SL50F484C2 device. From the implementation, it is verified
that the OFDM module achieves a maximum clock speed of 528
MSamplesls. In general ASICs are three times faster than FPGA,
operating the ASIC based OFDM module in 528 MHz with the
proposed modified radix-24
SDF pipelined algorithm is very
much easier.
Keywords - MB-OFDM, SDF, FFT, FPGA.
I. INTRODUCTION
Ultra wideband (UWB) communication systems, which
enable the delivery of data from a rate of 110 Mb/s at a
distance of 10m to a rate of480 Mb/s at a distance of2 m, are
ideally suited to application in short range wireless
communications because they can share a frequency band with
existing narrowband systems and offer a higher data rate than
802.11 or Bluetooth [1]. One of the communication methods
for IEEE 802.15.3a standard is Multiband Orthogonal
Frequency Division Multiplexing (MB-OFDM), which offers
528 MHz bandwidth [2][3]. MB-OFDM-based UWB not only
has reliably high-data-rate transmission in time-dispersive or
frequency-selective channels without having complex time-
domain channel equalizers but also can provide high-spectral
efficiency.
The FFT/IFFT processor is one of the modules having high
computational complexity in the physical layer of the UWB
system, and the execution time of the 128-point FFT/IFFT in
UWB system is only 312.5 ns. The power consumption and
hardware cost can be saved in our processor by using the
higher radix FFT algorithm and less memory and complex
multipliers.
This paper is organized as follows. Section II describes the
design issues of MB-OFDM UWB communication systems.
Section III describes the proposed 128-point radix-24
FFT/IFFT algorithm. Section IV describes the proposed 128-
point radix-24
FFT/IFFT architecture. In Section V, the
implementation and performance of the proposed FFT/IFFT
architecture are discussed. Conclusions and further work are
presented in Sections VI and VII respectively.
II. DESIGN ISSUES OF THE FFf PROCESSOR
A block diagram of the proposed physical layer of OFDM-
based UWB system is shown in Fig. 1[4]. In the UWB system,
the data rate is from 53.3 Mb/s to 480 Mb/s with code rates of
113, 11/32, 112, 5/8, and 3/4. The bandwidth of the transmitted
signal is 528 MHz and the OFDM symbol duration is 312.5
ns, including 60.61 ns for cyclic prefix duration and 9.47 ns
for guard interval duration [2][3]. Thus, an FFT/IFFT has to
compute one OFDM symbol within 312.5 ns and the
throughput rate of this specification in 128-point FFf/IFFT is
up to 409.6 MSamples/s.
Various FFT architectures, such as single-memory
architecture, dual-memory architecture, pipelined architecture,
array architecture, and cached-memory architecture, have been
proposed in the last three decades. In our view, the pipelined
architecture should be the best choice for UWB systems since
it can provide high throughput rate with acceptable hardware
cost.
The pipelined FFT architecture typically falls into one ofthe
two following categories: multipath delay commutator (MDC)
and single-path delay feedback (SDF)[5]. In general, the Moe
scheme can achieve a higher throughput rate, while the SOF
scheme needs less memory and hardware cost. In addition, the
higher radix FFT algorithm is difficult to be implemented in
the traditional MOC architecture. Table 1 compares the
hardware requirements for various architectures. The proposed
architecture based on radix 24
SOF architecture was selected
for implementation owing to the low hardware cost and
greater area efficiency and can also provide an available
throughput rate to meet the UWB specifications.
Proceedings ofthe 2008 International Conference on Computing, Communication and Networking (ICCCN 2008)
978-1-4244-3595-1/08/$25.00 <02008 IEEE
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on May 14, 2009 at 02:04 from IEEE Xplore. Restrictions apply.
2. Fig. 1. Block diagram ofthe MB-OFDM UWB receiver system
TABLE 1 COMPARISON OF HARDWARE REQUIREMENTS FOR N-LENGTH FFT
WITH DIFFERENT ARCHITECTURES
Architecture
Complex Complex Memory Control
Multiplier # Adder # size circuit
R2SDF log2(N)-2 log2(N) N-l simple
R2MDC lo~(N)-2 4Iog4(N) 3N/2-2 simple
R4SDF lo~(N)-1 log4(N) N-l medium
R4MDC 3(lo~(N)-1 ) 8Iog4(N) 5N/2-4 simple
R22SDF lo~(N)-1 4Iog4(N) N-l simple
R23
SDF logg(N)-1 4Iog4(N) N-l simple
R24SDF log16(N)-1 41og4(N) N-l simple
III. PROPOSED RADIX 24SDF ALGORITHM
A Discrete Fourier transform (DFf) of length of N (=128)
is defined as
N-f.
x(k) =Lx(n)Wlk .k:: O.l.....N -1 (1)
'tat
Where WN, the so called "twiddle factor", denotes the N-th
primitive root ofunity, with its exponent evaluated modulo N.
The k is the frequency index, and the n is the time index. In
order to derive the radix-24 algorithm, consider the first 4
steps of decomposition [6]. Applying a 5-dimensional linear
index map, wherein the 5th
dimension in itself is decomposed
into a 2 bit and 1 bit index, we have,
11 N .V .V .'i
n =<"2ftt +"4J1.: +8"; + 16114 + 64 n,+ftt >
k = < k1 + 2k: +4k; +SkI. +32k, .... 16k, > (2)
The common factor algorithm (CFA) takes the form of
XOct +2k: +4k; + 8k~ +32ks +16k~)
~ f ~ ~ ~ ~ (i~V .v :N N N ~
=L £, L. l.. i.. ~ x >2"~ +."-- +it':J -+- ii~'" +ii,r.t +"a'
t....r .......,"'J~ ':-0 ..~-o
::L L[G(JII•.1'1,.kt' ":. Iei' k4}it:l"....Jlo)(k,....:k••~.j;(J]
n.-Ot'la-O
.'."';:'lI.....J(;:cI...~ (3)
if
(5)
Where H (n) denotes the second butterfly unit
H(1'I):: H(ra.kt·Ic:>::B(tLkJ +(-j)(.i4...:ft:JB(ft +i.kl)
Where B (n,kl) denotes the first butterfly unit as follows.
B(n.k1) =x(n) +(-l)tt'x(n+~)
~
The algorithm can take complex constant multiplier instead of
programmable complex multiplier. The Canonic Signed Digit
(CSD) constant multiplier contains the fewest number of non-
zero bits, so it can be used to reduce the area and power
consumption [7]. Fig. 2 shows the signal flow graph (SFG) of
the 128-point radix-~4 SDF FFT alg~rithm.
Fig 2. Signal flow graph ofthe proposed R24SDP algorithm
IV. PROPOSED FFT ARCHITECTURE FOR THE MB-OFDM
UWB SYSTEM
A block diagram of the proposed single data-path 128-point
R24SDF FFT/IFFT processor is shown in Fig. 3. The
proposed architecture consists of a memory block, butterfly
units (BFl, BF2), programmable complex multipliers, CSD
complex constant multipliers, register files, and some
multiplexers. The FFT processor can be transformed to an
IFFT block by performing the operation as shown in the Fig
4. The output results of butterfly units are complex addition
and complex subtraction of two input data x[n] and x[N/2+n],
where N=l28.
Due to the spatial regularity of Radix-24 algorithm, the
synchronization control of the processor is very simple. A
(log2N)-bit binary counter serves two purposes:
synchronization controller and address counter for twiddle
factor reading in each stage. For first N/2 cycles, the 2-to-l
multiplexers in the butterfly module I (as shown in Fig.5.i)
switch to position "0", and the butterfly is idle. The input data
from left is directed to the shift registers until they are filled.
On next N/2 cycles, the multiplexers tum to position "1", the
butterfly computes a 2-point DFT with incoming data and the
data stored in the shift registers.
ZI(n) = x(n) + x(n+N/2), 0 ~ n < N/2 (6)
ZI(n + N/2) = x(n) - x(n+N/2)
The butterfly output ZI(n) is sent to apply the twiddle
factor, and ZI(n + N/2) is sent back to the shift registers to be
"multiplied" in still next N/2 cycles when the first half of the
next frame oftime sequence is loaded in. The operation ofthe
second butterfly is similar to that ofthe first one, except the
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on May 14, 2009 at 02:04 from IEEE Xplore. Restrictions apply.
3. Output
X(D~2.)..X(4)..-.J((eo)..X(52)C(1)..)C(3)r··,J((61)C(l53)r
X(64)rX(Mi)r··,)((:l24)..x<U6).X(IIi5))C(fi7)......)({127)
:>
A(O)"A(n)"A(Uii)r-.A(62).,A(64)..··-,.Il(94},A(t26),
A(J.)"A(")"A(J.7)r·.A(~~·..,.A(9S),A(127)
Tlf11e>
c8> PragriImmabIe Ca~ Muitipler
o CSD Complex Multiplier
_____Data path
TllTle
Fig 3. Block diagram ofFFT/IFFT processor
"distance" of butterfly input sequence are just N/4 and the
trivial twiddle factor multiplication has been implemented by
real-imaginary swapping with a commutator and controlled
add/subtract operations, as in Fig. 5-ii, which requires two bit
control signal from the synchronizing counter. The data then
goes through a full complex multiplier, working at 75%
IFFT 11M
Fig 5.i Structure ofBFl
Fig 4 Block diagram ofthe proposed 128-point R24
SDF FFT/IFFT processor
R
11N
R
utility, accomplishes the result of first level of radix-4 OFT
word by word. Further processing repeats this pattern with the
distance of the input data decreases by half at each
consecutive butterfly stages. After N-l clock cycles, the
complete OFT transform result streams out to the right, in bit-
reversed order. The next frame of transform can be computed
without pausing due to the pipelined processing ofeach stage.
Radix-24
FFT algorithm based single-data-path architectures
has fewer multipliers than those of lower radix FFT
algorithms. For example, radix-24
algorithm has the same
number ofmultipliers as the radix-22
algorithm but can reduce
an amount ofmultiplicative complexity by means ofreplacing
a half of full complex multipliers with trivial constant
multipliers [8].In the CSD complex constant multiplier, the
multiplication ofthe twiddle factors is processed according to
their scheduling in the signal flow graph. The output data
generated by the BF in the sixth stage are multiplied by a
trivial twiddle factor, -j, W(16) or W(48) before they are fed
to the last stage.
The Simplification ofthe Complex Multiplication
Complex multiplication is the main design key in the FFT
algorithm. Consider the complex multiplication, the two
inputs should be the xr + i xi and the coefficient W =
exp(j21t1N) = cosa + i sin a, and the result can be expressed by
Y = yr + i yi , where,
yr= xr cos a - xi sin a = xi(cos a + sina) + (xi - xr) cos a
yi = xi cos a+ xr sin a= xr(cos a - sin a)-(xi - xr) cos a (7)
Fig 5.ii Structure ofBF2
After the transform of the Eq.7, the complex multiplication
only needs 3 real multiplications, 1 addition and 2 subtraction
when the sum and the difference between the real and the
imaginary parts are precomputed and stored in the ROM .This
algorithm is used for the programmable complex multiplier to
reduce the hardware complexity and to increase the speed.
CSD Multiplier
Since the twiddle factors in the FFT processor are known in
advance, we propose the use of a multiplier-less architecture
to perform the multiplication with the twiddle factors using
shift-and-add operations. The canonical sign digit (CSD)
algorithm has been applied to this architecture to further
reduce the number ofshift and-add operations required. In this
architecture trivial multiplications are implemented without
any multipliers by either passing the data, swapping the real
and imaginary parts ofthe complex data or a sign change. The
design presented in the paper takes advantage of the
symmetries ofthe twiddle factors in the complex plane.
When the real and imaginary values of twiddle factors are
same, two CSO constant multipliers and two adder
/subtractors are used to generate the output. When the real and
imaginary values are not same, three CSO constant multipliers
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on May 14, 2009 at 02:04 from IEEE Xplore. Restrictions apply.
4. are used. If inputs don't need to multiply with twiddle factor
the output results are generated from the input directly.
Pipelining
The radix 24
architecture was thoroughly analyzed to find
possible areas to be pipelined based on the design and the
critical path delays between various implemented blocks. The
processor was extensively pipelined to achieve the high
working frequency to meet the UWB specification.
Shimming registers are also needed for control signals to
comply with thus revised timing.
v. IMPLEMENTATION AND PERFORMANCE
The word length of the proposed FFTIIFFT is 6-bit external
FFT data [9] for both the real and imaginary parts. The 2's
complement representation of numbers is used in the
processor. Due to overflow in each adder ofthe butterfly unit,
13-bit internal FFT precision has been maintained. The
determined word length not only keeps the quantization noise
to the least but also can minimize the hardware complexity.
After the appropriate word length of the proposed FFT/IFFT
processor is chosen, the architecture of the processor was
modeled in Verilog in an ALTERA Stratix III FPGA. Some of
the modules were generated from the ALTERA Megawizard
Plug-in Manager and others were written at the RTL level,
including the top level wrapper file. It contains all the
instantiated modules and the connectivity information in RTL
(VerilogHDL). The Timequest timing analyzer and Chip
planner (Floorplan and Chip editor) of QUARTUS II 8.0 were
applied to analyze timing, hardware expenditure and so on.
Vector waveforms associated with the RTL description were
created and the stimulus provided in an external file. Using the
vector waveform file, simulations were carried out for the
design to validate the behavioral description. The results were
obtained incrementally, first for a sub block comprising of one
module of the FFT. Finally the results were obtained for the
whole design comprising of seven such sub blocks, global
clock and dual port RAMs. The output ofthe Verilog coded
TABLE 2 IMPLEMENTATION RESULTS OF THE PROPOSED PROCESSOR
Family
ALTERA Stratix
ALTERA Stratix II
III
Device EP3SL50F484C2 EP2s60FI020C4
ALUTs 7972/38000 (3%) 7822/48352 (16%)
ALMs 3986/19000(3%) 4375/19000(3%)
DSP block
6/216 «3%) 6/288 (2%)
elements
Total memory bits
3328/1880064«1
8192/2544192«1%)
%)
Word length
1:6 bits 1:6 bits
Q:6 bits Q:6 bits
Number of
7580/38000(20%0 7697/38000(20%)
reldsters
Programmable
complex 1 1
multipliers #
Constant complex
2 2
multipliers #
Number of
28 28
complex adders
Clock rate 528 MHz 350 MHz
Throughput rate 528 Msamples/s 350 Msamples/s
Critical path delay 1.87 ns 2.87 ns
architecture agreed with the output data of MATLAB and the
FFT/IFFT in our UWB platform, which was designed on a
EXCEL worksheet which clearly depicts the outputs with the
signal flow graph.
The implementation of the proposed FFT/IFFT processor
was carried out on a Stratix II EP2S60FI020C4 device and
simulated for ALTERA Stratix III EP3SL50F484C2. The
input data is given through a dual port RAM and a PLL unit is
used to give the required clock frequency. The output is
checked using a dual port RAM and the in-system memory
content editor. Table 2 shows the performance and resource
usage of the implemented processor. This shows the processor
is area efficient and so the entire MB-OFDM receiver
Itransmitter with the other modules can be accommodated in a
single chip. It has a significantly reduced number of complex
multiplication and complex addition. The critical path delay
occurs between the input RAM and first butterfly unit and so
the processor is capable of running at UWB speeds if
implemented within a larger system.
All the previous implementations were on ASIC [9] and
so comparison with them is not meaningful. Table 3 shows the
comparisons of performance of the different FFT processors
implemented on FPGA. The validity and efficiency of the
proposed architecture has been verified by extensive
simulation and implementation. Fig 6 shows the
implementation results ofthe proposed FFTIIFFT processor.
TABLE 3 COMPARISIONS OF THE Performance of DIFFERENT PROCESSORS
Family Frequency max
Altera FFT Megacore function on
456 MHz
Stratix III [10]
Proposed processor on ALTERA
350 MHz
Stratix II EP2s60FI020C4
Proposed processor on ALTERA
528 MHz
Stratix III EP3SLSOF484C2
VI. CONCLUSION
An OFDM module implemented as 128-point FFT/IFFT
processor for a FPGA-based MB-OFDM UWB system using
the proposed modified radix-24
SDF pipelined algorithm has
been successfully implemented on ALTERA STRATIX III
and STRATIX II FPGAs without using parallel processing
architectures. The high speed is achieved by using extensive
pipelining on Altera's LPM. The hardware costs of memory
and complex multiplier is saved by adopting delay feedback
and data scheduling approaches. In addition, the number of
complex multiplications is reduced effectively by using a
higher radix algorithm and using CSD complex multipliers.
Also for improving the accuracy in the proposed scheme, the
internal wordlength is maintained at 13bits which is 7 bits
more than the input, to account for the overflows at each ofthe
7 stages of the OFDM module. The implementation results
show that the throughput rate is 350 MSamples/s at 350 MHz
on ALTERA STRATIX II and 528 MSamples/s at 528 MHz
on ALTERA STRATIX III device. The high throughput rate
ofthe OFDM module with increased internal wordlength of 13
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on May 14, 2009 at 02:04 from IEEE Xplore. Restrictions apply.
5. bits from 6bits to improve accuracy is very well meeting the
MB-OFDM UWB system's specifications.
Fig 6. Results ofthe implemented processor
VII. REFERENCES
[1] Time Domain, "UWB Applications, Demonstration & Regulatory
Update," Sept 2001 workshop, March 20,2001.
[2] A. Batra et aI., "Multi-band OFDM Physical Layer Proposal for IEEE
802.15 Task Group 3a," IEEE P802.15-Q3/268r3, March 2004.
[3] A. Batra, J. Balakrishnan, G. R. Aiello, J. R. Foerster, A. Dabak, Design
of Multiband OFDM System for Realistic UWB Channel Environment,"
IEEE Trans. On Microwave Theory and Techniques, vol. 52, no. 9, pp.
2123-2138, Sept. 2004.
[4] Y-W. Lin, H-Y. Liu, and C-Y. Lee, "A I-GS/s FFT/IFFT processor for
UWB applications," IEEE Journal of Solid-State Circuits, vol. 40, no. 8,
pp. 1726-1735, August 2005.
[5] S. He and M. Torkelson, iODesigning pipeline FFT processor for
OFDM(de)modulation,i± in Proc. DRSI Int. Symp. Signals, Systems,
and Electronics, vol. 29, Oct. 1998, pp. 257.262.
[6] J. Lee, H. Lee, S-I. Cho, S-S. Choi, "A High-Speed, Low-Complexity
Radix-24
FFT Processor for MB-OFDM UWB Systems," IEEE Inter.
Symp. on Circuits and Systems, pp. 4719-4722,
[7] S-M. Kim, J-G. Chung, and K. K. Parhi, "Low Error Fixed-width CSD
Multiplier with Efficient Sign Extension," IEEE Transactions on
Circuits and Systems-II, vol. 50, no. 12, Dec. 2003.
[8] H.Lee, M.Shin "A High-Speed Low-Complexity Two-Parallel Radix-2
4
FFT/IFFT Processor for UWB Applications, " IEEE Asian Solid-State
Circuits Conference, November 2007
[9] R. S. Sherratt, S. Makino,"Numerical Precision Requirements on the
Multiband Ultra-Wideband System for Practical Consumer Electronic
Devices" IEEE Transactions on Consumer Electronics, Vol. 51, No.2,
MAY 2005.
[10] FFT MegaCore Function User Guide MegaCore Version 7.2
www.altera.com
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on May 14, 2009 at 02:04 from IEEE Xplore. Restrictions apply.