06340356

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 921
A High-Performance Energy-Efficient Architecture
for FIR Adaptive Filter Based on New Distributed
Arithmetic Formulation of Block LMS Algorithm
Basant K. Mohanty, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE
Abstract—In this paper, we present an efficient distributed-
arithmetic (DA) formulation for the implementation of block
least mean square (BLMS) algorithm. The proposed DA-based
design uses a novel look-up table (LUT)-sharing technique for
the computation of filter outputs and weight-increment terms of
BLMS algorithm. Besides, it offers significant saving of adders
which constitute a major component of DA-based structures. Also,
we have suggested a novel LUT-based weight updating scheme
for BLMS algorithm, where only one set of LUTs out of sets
need to be modified in every iteration, where , , and
are, respectively, the filter length and input block-size. Based
on the proposed DA formulation, we have derived a parallel
architecture for the implementation of BLMS adaptive digital
filter (ADF). Compared with the best of the existing DA-based
LMS structures, proposed one involves nearly times adders and
times LUT words, and offers nearly times throughput of the
other. It requires nearly 25% more flip-flops and does not involve
variable shifters like those of existing structures. It involves less
LUT access per output (LAPO) than the existing structure for
block-size higher than 4. For block-size 8 and filter length 64, the
proposed structure involves 2.47 times more adders, 15% more
flip-flops, 43% less LAPO than the best of existing structures, and
offers 5.22 times higher throughput. The number of adders of the
proposed structure does not increase proportionately with block
size; and the number of flip-flops is independent of block-size.
This is a major advantage of the proposed structure for reducing
its area delay product (ADP); particularly, when a large order
ADF is implemented for higher block-sizes. ASIC synthesis result
shows that, the proposed structure for filter length 64, has almost
14% and 30% less ADP and 25% and 37% less EPO than the best
of the existing structures for block size 4 and 8, respectively.
Index Terms—Adaptive filters, block LMS, distributed arith-
metic, VLSI.
I. INTRODUCTION
ADAPTIVE DIGITAL FILTERS (ADFs) are widely used
in various signal-processing applications, such as echo
cancellation, system identification, noise cancellation and
Manuscript received June 18, 2012; accepted October 07, 2012. Date of pub-
lication October 25, 2012; date of current version January 25, 2013. The as-
sociate editor coordinating the review of this manuscript and approving it for
publication was Prof. Zhiyuan Yan.
B. K. Mohanty is with the Department of Electronics and Communication En-
gineering, Jaypee University of Engineering and Technology, Raghogarh, Guna,
Madhya Pradesh, India-473226 (e-mail: bk.mohanti@juet.ac.in).
P. K. Meher is with the Institute for Infocomm Research, 1 Fusionopolis Way,
Singapore-138632 (e-mail: pkmeher@i2r.a-star.edu.sg, url: http://www1.i2r.a-
star.edu.sg/~pkmeher/).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSP.2012.2226453
channel equalization etc. [1]. Amongst the existing ADFs,
least mean square (LMS)-based finite impulse response (FIR)
adaptive filter is the most popular one due to its inherent sim-
plicity and satisfactory convergence performance. However,
the delay in availability of the feedback-error for updating the
weights according to the LMS algorithm does not favor its
pipeline implementation when sampling rate is high. Haimi
et al. [2] have proposed the delayed LMS (DLMS) algorithm
for pipeline implementation of LMS-based ADF. The delayed
LMS is similar to the LMS algorithm except that the correction
terms for updating the filter weights of the current iteration are
calculated from the error corresponding to a past iteration.
Several schemes have been proposed to implement the
DLMS-based ADFs efficiently in a systolic VLSI with min-
imum adaptation delay [2]–[4], [7], [8]. To avoid adaptation
delay in pipelined LMS ADF, Poltmann [5] has proposed a
modified DLMS algorithm which is used by Douglas et al.
[6] to derive a systolic architecture. But, the structure of [6]
involves large amount of hardware resources compared to the
earlier one [2].
The block LMS (BLMS) ADF [9] is one of the useful deriva-
tives of the LMS ADF for fast and computationally-efficient
implementation of ADFs. Unlike the conventional LMS ADF,
BLMS ADF accepts a block of input for computing a block of
output and updates the weights using a block of errors in every
training cycle. The BLMS ADF has convergence performance
similar to the LMS ADF, but the BLMS ADF of block-length
offers fold higher throughput compared with the other.
Keeping this in view, many variant of BLMS algorithm like time
and frequency-domain block filtered-X LMS (BFXLMS) has
been proposed for specific applications [20]. Das et al. [21] have
proposed efficient BFXLMS using FFT and fast Hartley trans-
form (FHT), which is computationally more efficient. We have
proposed a delayed block LMS (DBLMS) algorithm [15], and
a concurrent multiplier-based architecture for high-throughput
pipeline implementation of BLMS ADFs. The structure of [15]
provides fold higher throughput rate and demands times
more resources compared to those of DLMS ADF. Baghel et al.
[17], [18] have suggested a distributed-arithmetic (DA)-based
structure for FPGA implementation of BLMS ADFs. A low-
complexity design has been proposed in [19] for BLMS ADFs.
This structure supports a very low sampling rate since it uses
single multiply-accumulate (MAC) cell for the computation of
filter output and weight-increment term.
To take the advantage of DA-based hardware designs [12],
Allred et al. [10] have suggested a scheme to derive a DA-based
design for LMS-ADF. The structure of [10] requires separate
1053-587X/$31.00 © 2012 IEEE

922 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013
look-up-tables (LUTs) for the calculation of filter output and
weight-increment terms. The LUT used for the computation of
filter output and weight-increment term of DA LMS-ADF is
named as DA-F-LUT and DA-A-LUT, respectively. In every it-
eration, entire content of DA-F-LUT is updated to compute the
weight-increment term, where half the content of DA-A-LUT
is updated to accommodate the new input sample arriving at
the current iteration. Updating the LUTs is the most time con-
suming operation in DA-based LMS-ADF, since the updating
is performed sequentially at different LUT locations. The LUT
update time, therefore, depends on the size of the LUT to be
updated. For most practical adaptive filters, we need to use a
decomposition scheme, where small size LUTs can be used in
DA-based LMS-ADF which not only helps in reducing the LUT
size but also in reducing LUT-update time. Recently Guo et
al. [16] have suggested a scheme to avoid the DA-A-LUT in
DA-based LMS-ADF, where both filtering and weight-updating
are performed using DA-F-LUT. On the other hand, throughput
rate of existing DA-LMS ADFs could be slow for real-time ap-
plications due to bit-serial nature of DA computation. Although,
there are some interesting work on DA-based LMS ADF [10],
[16], we find that the potential application of DA for the imple-
mentation of BLMS ADF is yet to be explored.
In order to reduce the power consumption of DA-based de-
signs, we aim at reducing the number of words in the LUT and
less LUT-access. DA-based BLMS ADF structure can be de-
rived by extending the scheme of [10], but this structure would
demand times more hardware (memory and combinational
logic) for times more throughput rate. The scheme of [16] of-
fers sharing of LUT for the computation of both filter output and
weight-increment term, but this scheme can not be applied to
derive a DA-based structure for BLMS ADFs, because separate
inner-product computation (IPC) is performed for calculation of
filter output and weight-increment term of BLMS ADF whereas
in case of LMS ADF, IPC is performed to calculate the filter
output only. In this paper, we have formulated the DA-BLMS al-
gorithm for sharing of LUTs for the computation of filter output
and weight-increment terms.
The key contributions of this paper are:
• DA-based formulation of BLMS algorithm where both
convolution operation to compute filter output and corre-
lation operation to compute weight-increment term could
be performed by using the same LUT.
• A novel approach for minimization of number of LUT
words to be updated per output. This helps to save external
logic and power consumption.
We have derived a DA-based structure for BLMS-ADF
using the proposed DA-formulation and a novel LUT updating
scheme. The most remarkable aspect of the proposed scheme
is that the number of adders required by the structure does
not increase proportionately with filter order, and the number
of flip-flops required by the structure is independent of the
block-size. Apart from that, the proposed structure has signifi-
cantly less LUT access than the existing DA-LMS structure for
higher block-sizes.
The rest of this paper is organized as follows: Mathematical
formulation is presented in Section II. The new-LUT update
scheme is discussed in Section III, and the proposed structure for
DA-based BLMS ADF is presented in Section IV. Hardware-
and time-complexities of the proposed structure are discussed
in Section V. Conclusion is presented in Section VI.
II. MATHEMATICAL FORMULATION
The BLMS algorithm for updating the filter weights in the
-th iteration is given by
(1)
where is defined as
(2)
and are, respectively, the weight-vector and the error-
vector of the -th iteration defined as:
where is the step-size; and the input matrix is derived from
the current input block
of length , and past samples, given by
The error-vector is computed as
(3)
where the desired response vector is defined as
The -th block of filter output is computed by the matrix-
vector product:
(4)
A. Computation of Filter Output
The input matrix of size can be decomposed
into square matrices of size each, where
. Similarly, the weight vector can be decomposed into
short weight-vectors of size , for .
The computation of (4) can then be expressed as the sum of
matrix-vector products:
(5)
where and are defined as

MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 923
for , and
Each filter output now can be written as the sum of inner-
products as
(6)
where is an -point inner-product of an input-vector
and are given by
(7)
and is the -th row of given by
for , , and
. Note that we have dropped the subscript of
in (7) only for convenience of further discussion, without loss
of generality.
B. Computation of Weight Increment Term
The weight-increment vector can be decomposed into
short vectors of size each, for .
Computation of (2) can be performed through independent
matrix-vector multiplication using the relation
(8)
where , and defined as
(9)
Using (8), the individual weight increment terms could be eval-
uated by the following equation
(10)
where is the inner-product between the vector and ,
given by
(11)
Here also we have dropped the subscript of for con-
venience of further discussions. As shown in (7) and (11), the
input-vector is the same for a pair of inner-products
and . This is a major advantage in order to optimize the
LUTs when the inner-products of (7) and (11) are performed
using the DA principle.
C. DA-Formulation
Let and , respectively, be the -th compo-
nents of the -point vectors and , and assumed to be -bit
numbers in 2’s complement representation:
(12a)
(12b)
and are the -th bit of and , respec-
tively. Substituting (12a) in (7), we have
(13)
Rearranging the order of summation, (13) may otherwise be ex-
pressed as:
(14)
where , for , and
for . Each term in the inner sum in (14) represents the
inner-product of with a bit-vector (or bit-slice) of weight-
vector . Corresponding to possible values of a bit-vector
of length , there could be possible values of such inner-
products of with any possible bit-vector of length . All
those possible inner-products could be pre-computed and stored
in an LUT, such that when the -th bit-vector (or bit-slice) of
weight vector
for , is fed to the LUT as address, its
inner-product with , is read from the LUT. The computation
of inner sum of (14), therefore, could be expressed in the form
of memory read operation as:
(15)
where is a memory-read operation, and its argument
for , is used as LUT-address. The
inner-product of (11) may, similarly, be expressed in the form
of memory-read operation as
(16)
where is the -th bit-vector of error-vector defined as:
, which is used
as address of an LUT to read its inner-products with . LUT
contents for the computation of and are exactly
the same, since the LUT content depends on the input-vector ,
and generated for all possible bit-slices of -bit length, irrespec-
tive of whether that is of the weight-vector or the error-vector.
When the bit-vector is used as address, the partial results
of are read from the LUT, and when is used as
address, then partial results of are read from the same
LUT. Therefore, by using the proposed scheme, a common set
of LUTs could be used for the computation of filter outputs
and weight-increment terms. Since, the block of input samples
changes after every iteration, the LUTs are required to be up-
dated in every iteration to accommodate the new input-block.
In the next Section, we have presented a novel LUT-updating
scheme for the DA-based BLMS ADFs.

Fig. 1. (a) Inner-products of FIR filter of length , and block-size .
The input-vectors corresponding to inner-product is shown inside the
box. (b) LUT arrangement for DA-based computation of the FIR filter of
, and . Each LUT here stores possible values of partial inner-product
of input-vector and bit-vector of of length , for
and .
III. LUT-UPDATING SCHEME
Before, we discuss the proposed LUT-updating scheme, we
summarize here the proposed decomposition of input-matrix
and weight-vector into small vectors, and their participation in
the inner-product computation for filtering operation. The input-
matrix of size is decomposed into square matrices
of size and is decomposed into short-vec-
tors of size , for where .
Each of rows of represents an input-vector, so that such
input-vectors ( , for ) are derived form ,
and such input-vectors are derived from , for
. All these input-vectors are arranged in rows and
columns such that, input-vectors of belong to -th
column. According to (5), weight-vectors are multiplied
independently with matrices which, in total, involves
inner-products. According to (6), results of inner-products
corresponding to each row of input-vectors are added together
for obtaining a filter output. From such rows of inner-prod-
ucts, filter outputs are obtained.
We have illustrated here the aforementioned scheme for the
implementation of FIR filter of length and block-size
. Suppose, during the -th iteration the filter receives an
input-block and computes a block of output
. As discussed above, the input-matrix of
size 2 6 is decomposed into 3 square-matrices , and
of size 2 2. consists of a pair of input-vectors (
and ), and similarly and consist of pair of input-vec-
tors and , respectively. The 6-point weight-
vector is decomposed into 3 number of 2-point weight-vectors
. Fig. 1(a) shows the arrangement of input-vectors
and weight-vectors; and the corresponding inner-products are
shown on the top of the rectangular boxes for clarity. Results
of odd-numbered inner-products (on upper row) and even-num-
bered inner-products (on lower row) are added separately (not
shown in the figure) to obtain and , respectively.
Fig. 2. DA-based computation of the block FIR filter for and .
(a) for -th iteration, (b) for -th iteration.
As shown in Fig. 1(a), the same weight-vector is used for
the computation of inner-product of a particular column of
input-vectors. For DA realization, LUT corresponding to each
and stores partial inner-products generated by the
inner-product of the corresponding input-vector with all
possible values of a bit-vector of length . DA-based parallel
computation of filter outputs of Fig. 1(a) for the -th iteration
is shown in Fig. 1(b). As shown in Fig. 2(a), the DA-based
structure receives an input-block during
the -th iteration, so that two new samples enter into
the set of 7 samples, and two oldest samples are discarded.
Consequently, samples of the all 6 input-vectors are changed.
But, it occurs in a particular order. We can find from Fig. 1(b)
and Fig. 2(a), that the contents of only the first column of LUTs
of Fig. 2(a) are changed by the new samples while in other
columns, the LUT values remain the same. But the position of
those unchanged LUTs are shifted right by one-column. For in-
stance, values stored in the LUTs of second column of Fig. 2(a)
are the same as values stored in LUTs of the first-column of
Fig. 1(b), and similarly values stored in LUTs of third column
of Fig. 2(a) are the same as those LUTs of second-column
Fig. 1(b). This feature can be observed in the LUT contents
of Fig. 2(b) for the -th iteration also. In other word,
contents of a particular column of LUTs during a particular
iteration are simply transferred to the adjacent column of LUTs
on its right during the next iteration. In this way, the oldest
input samples of particular set are shifted out through the -th
column ( in the example) of LUTs, and new values are
entered at the first column of LUTs.
Shifting of values physically from one LUT to the next across
the array of LUTs is highly time consuming and power con-
suming. Therefore, we have proposed a novel LUT updating
scheme, where the LUT content need not be shifted. Since, each
column of LUTs uses the same weight-vector as LUT-address,
the column-wise right-shift of LUT values can be achieved by a
left-shift of the weight-vectors. This technique could save a lot
of time and power, since the shifting of weight-vectors is sig-
nificantly less expensive than the shifting of LUT contents. In
the proposed LUT update scheme, contents of only one column

Fig. 3. (a) Equivalent DA-based structure of Fig. 2(a) which is derived from
structure of Fig. 1(b) by changing the content of 5th and 6th LUT (shown in
grey color) and left shifting the weight-vectors by one-position. (b) Equiva-
lent DA-based structure of Fig. 2(b) derived from the structure of Fig. 3(a) by
changing content of 3rd and 4th LUT (shown in grey color) and left-shifting the
weight-vectors by one position.
of LUTs out of 3 such columns (for ) need to be up-
dated in every iteration. We can find from Fig. 1(b) and Fig. 2(a)
that, the values of the third-column LUTs of the -th itera-
tion are not used during -th iteration, since they corre-
spond to the oldest block of samples . The
LUTs of the third column are updated as shown in Fig. 3(a) in
grey-color. To feed weight-vectors to LUTs of Fig. 3(a) in the
same order as that of Fig. 2(a), weight-vectors of Fig. 1(b) are
simply left-shifted by one location. As shown in Fig. 3(a), the
second-column of LUTs contain the values corresponding to the
samples , which is the oldest block of sam-
ples in the -th iteration, and this input-block is discarded
and corresponding LUTs are updated by the partial inner-prod-
ucts of new input-block . Weight-vectors
of Fig. 3(a) are left-shifted by one column, and fed to LUTs
of Fig. 3(b) as addresses. In the following, we summarize the
proposed scheme for updating LUTs of BLMS-based adaptive
filter:
• LUTs are updated column-by-column in every iteration in
cyclic order.
• The LUTs which store the values of partial inner-prod-
ucts corresponding to samples of the oldest input block are
overwritten by those of the new input block.
• The weight-vectors are circularly left-shifted after
every iteration to change the columns of LUT to be read
circularly.
• The values required for updating a column of LUTs for any
particular iteration are calculated from samples of the
current input-block and samples of the most recent
past samples of the previous block.
Based on the above scheme, LUT-matrix is updated
column-by-column from right to left after every iteration. The
updating process starts from the -th column of LUTs and goes
to the first column on a cyclic manner, and then again from the
first column it goes to the -th column and then to the
Fig. 4. Proposed DA-based structure for implementation of BLMS adaptive
FIR filters (for and ), where
, , and
.
-th column and so on. Hence, LUTs of one particular column
are updated once in a period of iterations.
IV. PROPOSED ARCHITECTURE
Proposed DA-BLMS structure is comprised of one
DA-module, one error bit-slice generator (EBSG) and one
weight-update cum bit-slice generator (WBSG). WBSG up-
dates the filter weights and generates the required bit-vectors
in accordance with the DA-formulation. EBSG computes the
error block according to (3) and generates its bit-vectors. The
DA-module updates the LUTs and makes use of the bit-vectors
generated by WBSG and EBSG to compute the filter output
and weight-increment terms according to (15) and (16).
A. Structure for Block-Size
The proposed structure for DA-based BLMS adaptive filter
for and is shown in Fig. 4. The DA-module
receives a block of input samples
in every iteration, and computes a block
of filter output
. It also receives a block of errors in every iteration, and
computes the weight-increment term for all the components
of the weight-vector .
The structure of proposed DA-module is shown in Fig. 5. It
consists of 4 identical processing elements (PEs) for ,
one LUT-update block and one MUX-array. Structure of the PE
is shown in Fig. 6. It consists of 4 identical subcells (SCs) for
. Internal structure and function of the -th SC is
shown in Fig. 7. As required by (15), LUT of the -th SC of
this PE stores 16 possible values corresponding to the samples
,
where .
The LUT-update block of the DA-module generates the re-
quired values to update LUTs of a particular PE. Structure of
the LUT-update block is shown in Fig. 8. It consists of one
adder-block and an input delay unit (IDU), which stores
samples of the previous block. During each iteration, the adder

Fig. 5. Structure of DA-module of the proposed DA-BLMS ADF (for and ). The subscript of , and varies from 0 to
in cycles.
Fig. 6. Internal structure of -th processing element (PE) of the
DA-module for block-size , where .
block receives samples ( samples from the current
input block and past samples from the IDU), and feeds
these samples to adder-cells (ACs) (see Fig. 8) such that
each AC receives samples, and input blocks of adjacent ACs
are overlapped by samples. During the -th iteration,
AC- receives input samples
and AC- receives the sam-
ples . For
block-size , each AC receives a block of four samples in
every iteration (shown in Fig. 9). As shown in the ﬁgure, each
of the four inputs of the AC is ANDed with a bit of the four-bit
address by four AND cells. Each AND cell
consists of AND gates, where is the word-length of input
samples. All those AND gates are fed with a bit of the address,
Fig. 7. Internal structure and function of -th subcell (SC) of a PE, where
and , . Convergence factor is
assumed to be power of 2.
while the other input of the AND gates are fed with a bit of input
sample. The output of AND cells are fed to an adder-tree (AT).
AC receives 16 possible values of in 16 clock cycles, and cal-
culates 16 values of to be stored in the LUT, where is
used as the address of the LUT location and is the equivalent
integer value of . All the ACs of the adder block (see Fig. 8)
work in parallel, and generate all the required values to update
LUT of SCs of a PE. According to the proposed LUT-update
scheme, LUTs of one PE out of PEs are updated in every it-
eration. LUTs of all the PEs are updated once in cycles

Fig. 8. Internal structure of LUT-update block for block-size , where
.
Fig. 9. Internal structure of -th AC of the LUT-update block for block-
size .
Fig. 10. Internal structure of MUX-array for and .
in a cyclic order. Each PE uses separate control signal ( , for
) to enable the specific column of LUTs to be
updated. LUT-update operation of proposed structure is com-
pleted during the first clock cycles of every iteration.
Each PE receives the bit-vectors , and through
the MUX array (shown in Fig. 9) for updating the LUTs or
computation of filter outputs or weight-increment terms, respec-
tively. After completion of the LUT-update, filtering computa-
tion follows immediately for the next clock cycles by a series
of LUT-read operations using the bit-slices of corresponding
weight-vector in LSB to MSB order, as successive addresses
according to (15). During the -th cycle of filtering, the
WBSG generates parallel bit-vectors of width bits
Fig. 11. Structure of error computation cum bit-slice generator (EBSG) for
block-size , where ,
and .
each for the PEs to perform the filtering operation. Each SC
receives a sequence of bit-vectors , (for
where is the wordlength of the filter-coefficients) from the
WBSG in clock cycles. The LUT-read values are shift-ac-
cumulated in an accumulator (ACC) to obtain a partial filter
output. During the -th cycle the LUT output is subtracted from
the accumulated result since the bit-vector during this
cycle contains the sign-bits of weight-vector . Each SC uses
control signal CTR1 to control add/substract operation in the
ACC. At the end of the -th cycle, ACC contents are sent to
the DMUX as input, and the ACC register (not shown in Fig. 7)
is cleared to be used for the computation of weight-increment
term from the next cycle (CTR1 is used for clearing the reg-
ister). Finally the DMUX sends the computed partial results of
inner-products to the output line using the select
signal CTR6. From SCs of each PE, partial results are ob-
tained in parallel, the corresponding output of each SC from
PEs are added by an AT (Fig. 5) to obtain (the -th
component of -th block of filter output). A block of parallel
filter output ( ) are obtained from ATs of the DA-module in
each cycle.
EBSG receives one block of filter output ( ) from the
DA-module, and calculates a block of error ( ) in every
iteration using one block of desired response according to
(3). As shown in Fig. 11, error values are loaded in parallel-in
serial-out (PISO) shift-registers of the bit-slice-generater (BSG)
to generate bit-vectors of error-vector . CTR4 enables
the clock for the BSG and CTR2 controls load-shift operation
of each SR.
Bit-vectors , for , fed serially in LSB
to MSB order to the DA-module in successive clock cycles
to compute weight-increment terms for the -th itera-
tion. According to (16), LUT values of the -th block of filter
output are also used to compute weight-increment terms for the
-th iteration. In general, LUT values of -th SC
of -th PE (for , )
are used to compute the weight-increment term .
The -th PE, therefore, computes the weight-increment
terms . The
computation of weight-increment-terms is similar to the par-
tial filter outputs. But in this case the same bit-vector is used

TABLE I
LUT UPDATING SCHEME FOR AND , WHERE , : BLOCK SIZE, : FILTER ORDER
by all the PEs of the DA-module to compute the weight-incre-
ment terms. In each SC (see Fig. 7), the ACC contents corre-
sponding to the weight-increment term is sent to the output line
of the DMUX. The weight-increment terms are scaled by a
factor . Here we have assumed is a power of 2, so that the
scaling of by is realized by a right-shift operation using
a fixed-shifter (see Fig. 7).
According to (1), the WBSG of the proposed DA-BLMS
structure requires only the weight-increment terms of the cur-
rent iteration to update the weight-vector for the next iteration.
It does not require the LUT values of the current iteration.
Therefore, once the weight-increment terms of the current
iteration are computed, the LUT-updating operation for the
next iteration can be started immediately in the next clock
cycle. As we discussed earlier, the filter computation follows
the LUT-update operation, and first clock cycles of every it-
eration are used to complete the LUT-update operation. During
this period, weight-update operation of WBSG also can be
performed concurrently. A bit-parallel (word-serial) structure
of WBSG requires one clock-cycle to complete the weight-up-
date operation, while a bit-serial structure of WBSG requires
clock-cycles to complete the same task. If wordlength of
filter-coefficients ( ) is less than or equal to the LUT-size
, then bit-serial realization of WBSG does not increase
the iteration period of the DA-BLMS structure, but it certainly
helps to reduce the hardware complexity of the DA-BLMS
structure. For and , we can have a bit-serial
structure for the WBSG. Bit-serial structure of WBSG receives
the weight-increment terms from the DA-module in bit-serial
LSB to MSB order, and updates the weight-vector accordingly.
For bit-serial realization of WBSG, weight-increment terms
computed by each PE of the DA-module are finally loaded into
a separate BSG (see Fig. 5) to generate the weight-increment
terms in bit-serial order. All the BSG of the DA-module uses
common control signals CTR6 and CTR5 to perform the
loading and sifting operations, respectively.
WBSG is an important block of the proposed structure. It
performs three operations: (i) updates filter weights using the
weight-increment values calculated by the DA-module, (ii)
generates bit-vectors for the DA-module to compute
-th block of filter output, (iii) gives one left-shift
(circularly) to the weight-vectors as necessitated by the
proposed LUT-update scheme. We have shown LUT updating
of the DA-BLMS ADF for and in Table I for
the first 5 iterations using the proposed LUT-updating scheme.
As shown in Table I, for and , the LUT-matrix
has 4 columns (for ). LUTs of all these 4 columns
are updated once in a period of 4 iterations. At any given
iteration, the LUT-matrix contains the values corresponding
to recent past input samples to compute
a block of 4 filter output. As shown in Table I, during the
5-th iteration, LUT-matrix ( to ) contain the values
corresponding to set of input samples ( to ). These
set of 19 samples are exactly required to compute the filter
output ( to ). Similarly, the LUT-matrix contain the
values corresponding to the set of samples
during 6-th iteration, and these samples are exactly required to
compute filter outputs ( to ).
The bit-serial structure of WBSG is shown in Fig. 12. It con-
sists of serial-in serial-out (SISO) SRs and carry-save
full-adders (CSFAs) corresponding to filter weights. SRs
are arranged in matrix form; and filter-weights are
stored in the SR matrix column-wise, such that weight-vector
is stored in -th column of SRs. As shown in Table I for
, that bit-slices of the weight-vector are re-
ceived by the PE whose LUTs are to be updated during the -th
iteration, and are generated from
the first column of filter weights
. The weight-vector to be aligned with the corresponding
PE. If during the -th iteration, LUT of PE-1 is to be updated,
then the first column of SR-matrix is required to contain the
components of weight-vector and the -th column of SRs
should contain components of weight-vector . As shown
in Fig. 12, weight-increment values of the -th column of
filter coefficients (available in the -th column of SR-ma-
trix) are obtained from the -th PE, and these values are
added with the corresponding filter-weights bit-serially using a
carry-save full-adder (CSFA). Results of CSFA of -th
column constitute a bit-vector of . SR contents are shifted
left for clock cycles, to generated the shifted weight-vectors
in accordance with the proposed LUT-update scheme. Shifting
operation of the SRs starts at -th clock cycle of every
iteration, and continue for clock cycles. The control signal
CTR5 is used in WBSG to enable the shifting operation. D
flip-flop of each CSFA is cleared during the first clock cycle
of every iteration to flush-out the final carry of the previous it-
eration of weight-update operation.
B. Structure for Higher Block-Size
To derive DA-based BLMS structure for higher block sizes
using LUT of 16 words, we can take the block-size to be an
multiple of 4, i.e. , where is an integer. The structures

Fig. 12. Bit-serial structure of weight-update cum bit-slice generator (WBSG) for , and .
of EBSG and WBSG of the DA-BLMS filter for (for
) are the same as those of block-size shown in
Fig. 11 and Fig. 12, respectively. However, the AC of the LUT-
update block and the SC of each PE of the DA-module need to
be modified according to the value of . Each SC in this case,
is comprised of LUTs of size 16 words each. The bit-vectors
of weight-vectors and error-vectors of bits each are splitted
into segments of 4-bit size, and fed to LUTs of each SC
to read the LUTs in parallel. The values read from the LUTs
are added using an AT and subsequently shift-accumulated in
the ACC for obtaining a partial output. To generate the weight
update-values for LUTs, each AC of the LUT-update block in
this case is comprised of AND-TA blocks of size 4 (as shown
in Fig. 9). For block-size , each SC involves RAM
words and adders along with one ACC and 2 DMUX.
Similarly, the LUT-update block involves AND-gates and
adders.
V. HARDWARE-TIME COMPLEXITY AND
PERFORMANCE COMPARISON
A. Hardware Complexity
Proposed structure is comprised of one DA-module, one
WBSG, one EBSG and a control unit. The DA-module consists
of one LUT-update block, PEs, adder-trees of words
each, one MUX-array and BSG, where and the
block-size . LUT-update block consists one IDU and
ACs, where the IDU is comprised of registers of size ,
and each AC is comprised of AND-gates and adders.
LUT-block, therefore, involves registers, adders
and AND-gates. Each PE consists of SCs, where each
SC is comprised of LUTs of 16 words each, adders,
one ACCs, one 1-to-2 line DMUX and number of 2-input
XOR-gates (used by ACC (not shown in Fig. 7) to compute 1’s
complement of the LUT-outputs when the bit-vector contains
sign-bits), where ACC involves one adder, one register and
one 2-to-1 line DMUX ( ). Each PE, there-
fore, involves memory words, adders, registers,
DMUXes (2-to-1 line) and XOR-gates. Each BSG is
comprised of SRs (bit-level) of size . MUX-array involves
2-to-1 line MUXes. The DA-module, therefore, in-
volves memory words, adders,
D-type flip-flops (FFs),
2-to-1-line MUXes/DMUXes (word-level), AND-gates
and XOR-gates. WBSG involves D-type FFs
and FAs. EBSG involves D-type FFs and adders.
Proposed structure, therefore, requires memory words,
adders, FAs,
D-type FFs, MUXes/DMUXes (word-level),
AND-gates and and XOR-gates.
B. Time Complexity
The proposed structure performs four operations sequen-
tially in every iteration. Those are (i) LUT update, (ii) filter
output computation, (iii) error calculation and (iii) compu-
tation of weight-increment term. It involves 16 clock cycles
to complete LUT-update operation. It takes clock cycles
to calculate partial results of a block of filter output. It cal-
culates a block of filter output from the partial results and
then block of error in one clock cycle. Finally it takes
clock cycles to compute the weight-increment term for the
weight vector. In every iteration, proposed structure pro-
cesses one block of samples, where one iteration involves
clock cycles and duration of one clock cycle is
,
where is the delay of one -bit adder. For comparison
purpose, we have also estimated number of clock cycles re-
quired by the structure of [10] and [16] for one iteration. We
assumed the read and write operations are performed in two
separate clock cycles in a LUT to maintain uniformity in the
comparison. Structure of [10] requires 16 clock cycles to update
the DA-A-LUT of size 16 words, clock cycles to compute
one filter output and 32 clock cycles to update the DA-F-LUT

TABLE II
GENERAL COMPARISON OF HARDWARE COMPLEXITY OF THE PROPOSED STRUCTURE (FOR BLOCK-SIZE ) AND THE STRUCTURE OF
[10] AND [16] (WITH DECOMPOSITION FACTOR 4) AND THE DA-BLMS STRUCTURE OF [18]
LEGEND: ADD: adder, MULT: multiplier, FF: flip-flop, VSH: variable shifter, TR: throughput rate, LAPO: LUT access per output.
, , , , ,
. In addition to the above list of components the proposed structure involves FAs, 2-input AND-gates and 2-input XOR-gates,
where : word length of the sequence , and , : word-length of input sequence, . For the proposed structure, ,
and in case of [10] and [16], , and in case of [18], , , and block-size , where and are
relatively prime to each other.
of size 16 words. It involves 48 clock cycles for one iteration
and computes one output per iteration, where the duration of
one clock cycle is and . Since,
the structure of [16] does not involve DA-F-LUT, it requires 16
clock cycles for updating the DA-A-LUT and clock cycles
to compute one filter output. The structure of [16], therefore,
involves clock cycles for one iteration, where the
duration of the clock period is the same as that of [10].
C. Number of LUT Access
During every iteration, proposed structure computes filter
outputs, and performs write operations for updating the
LUTs, LUT read operations for filter output computation
and LUT read operations for the computation of weight-in-
crement terms. The number of LUT access per output (LAPO)
of the structure is, therefore, . Similarly, the
number of LAPO of [10] and [16] are found to be
and , respectively, where is the bit-width of the
input samples and is the bit-width of all the intermediate and
output samples. Note that, LUTs of DA-based ADF are required
to be implemented by RAM, and the total energy consumption
of the structure, therefore, increases significantly with LAPO.
D. Performance Comparison
Hardware and time complexities the proposed structure and
the DA-LMS structures of [10], [16], and DA-BLMS structure
of [18] are listed in Table II for comparison. The structure of
[16] is the most efficient one amongst the existing DA-LMS
structures. Compared with [16], proposed structure requires
times more LUT words, nearly times more adders, 4/3 times
more FFs and offers nearly times higher throughput rate. It in-
volves 16 more LAPO for block-size 4 and less
LAPO for block-size 8 than those of [16] for 16-bit internal
bit precision. Interestingly, number of adders of the proposed
structure does not increase proportionately with block-size in
the proposed structure and number of flip-flops is independent
of block-size. Besides, it does not require variable shifters un-
like those of [10] and [16].
We have estimated hardware and time complexity of pro-
posed structure for and 8, and that of [10] and [16] for
filter size ( , 32 and 64) using the complexity counts
of Table II. The estimated values are listed in Table III for
comparison. Compared with the structure of [16], proposed
structure for involves 8 times more LUT words; 3.27
times more adders on average for different filter orders, and
offers 5.22 times higher throughput. But, it involves, respec-
tively, 37.5%, 24.4%, 17.8% more flip-flops and 25%, 37.5%,
47.6% less LAPO than those of [16] for filter orders 16, 32, 64,
respectively.
E. Simulation Result
To validate the proposed design, we have coded it in VHDL
for filter order 16, 32 and 64 with block-size 4 and 8. We have
also coded the design of [10] and [16] for the same filter orders.
We have considered and , and synthesized
both the designs by Synopsys Design Compiler using TMSC 90
nm CMOS library. Synthesis reports obtained from the Design
Compiler are listed in Table IV.
Synthesis results are in accordance with the theoretical es-
timation given in Table III. The minimum clock period of the
proposed structure and the structure of [16] are slightly higher
than those of [10] due to the extra MUX/DMUX in the critical
path. As shown in Table IV, structure of [16] is the most efficient
amongst the existing structures. Compared with [16], proposed
structure for block size and 8 involve, respectively, 2.13
and 3.69 times more area on average for different filter orders
and offers nearly 2.61 and 5.22 times higher throughput rate, re-
spectively.

TABLE III
HARDWARE AND TIME-COMPLEXITY OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16] FOR DIFFERENT SIZE FILTERS. ,
TABLE IV
COMPARISON OF AREA, DELAY, AND POWER COMPLEXITIES OBTAINED FROM SYNTHESIS RESULT OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16]
We have estimated ADP1, PPO2 and energy per output
(EPO3) at 20 MHz clock. As shown in Table IV, for block-size
4, the proposed structure has 17.47%, 18.49%, 13.66% less
ADP than that of [16] for filter order 16, 32 and 64, respectively.
For block-size 8, it has 31/6% less ADP than [16] on average
for different filter orders. For block-size 4, it consumes 27.5%,
28.8% and 24.6% less EPO than that of [16] for filter order 16,
32 and 64, respectively. Similarly, for block-size 8, it consumes
respectively, 40%, 39.8% and 37.4% less EPO than other for
16, 32 and 64 order filters. One can extrapolate these results to
obtain the approximate values of ADP, PPO and EPO of the
proposed structure for filter order greater than 64. One can
also extrapolate these observations to obtain the approximate
estimate of the advantages of proposed structure for filter order
greater than 64.
1
2
3
VI. CONCLUSION
We have derived a DA formulation of BLMS algorithm where
both convolution and correlation are performed using a common
LUT for the computation of filter outputs and weight increment
terms, respectively. This results in a significant saving of LUT
words and adders which constitute the major hardware com-
ponents in DA-based computing structures. Also we have sug-
gested a novel LUT updating scheme to update the LUT con-
tents for DA-based BLMS ADF, where only one set of LUTs out
of sets need to be modified in every iteration such that LUT
contents are modified once in every iterations, where
, is the filter length and is the input block-size. Using
the proposed scheme, we have derived a parallel architecture for
the implementation of DA-based BLMS ADF. Unlike the ex-
isting DA-based LMS structure, number of adders required by
the proposed structure does not increase linearly with . Com-
pared with the best of the existing DA-based LMS designs, pro-
posed one involves nearly times more adders, and times

more LUT words and offers nearly times throughput. It re-
quires nearly 25% more flip-flops irrespective of the block-size,
but does not involve variable shifters like others. It involves
less number of LUT access per output than the existing struc-
ture for block-size higher than 4. This is a major advantage of
the proposed structure for reducing its ADP and EPO when im-
plemented for large order ADF, and for higher block-sizes. For
block-size 8 and filter length 64, the proposed structure involves
2.47 times more adders, 15% more flip-flops, 43% less LAPO
than the best of the existing structures, and offers 5.22 times
higher throughput. ASIC synthesis result shows that, the pro-
posed structure for filter order 64, has almost 14% and 30% less
ADP and 25% and 37% less EPO than the best of the existing
structures for block size 4 and 8, respectively.
REFERENCES
[1] S. Haykin and B. Widrow, Least-Mean-Square Adaptive Fil-
ters. Hoboken, NJ: Wiley-Interscience, 2003.
[2] R. Haimi-Cohen, H. Herzberg, and Y. Beery, “Delayed adaptive LMS
filtering: Current results,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Process., Albuquerque, NM, Apr. 1990, pp. 1273–1276.
[3] M. D. Meyer and D. P. Agrawal, “A modular pipelined implementa-
tions of a delayed LMS transversal adaptive filter,” in Proc. IEEE Int.
Symp. Circuits Syst., New Orleans, LA, May 1990, pp. 1943–1946.
[4] V. Visvnathan and S. Ramanathan, “A modular systolic architecture
for delayed least mean square adaptive filtering,” in Proc. IEEE Int.
Conf. VLSI Des., Bangalore, 1995, pp. 332–337.
[5] R. D. Poltmann, “Conversion of the delayed LMS algorithm into the
LMS algorithm,” IEEE Signal Process. Lett., vol. 2, p. 223, Dec. 1995.
[6] S. C. Douglas, Q. Zhu, and K. F. Smith, “A pipelined LMS adap-
tive FIR filter architecture without adaptive delay,” IEEE Trans. Signal
Process., vol. 46, pp. 775–779, Mar. 1998.
[7] L. D. Van and W. S. Feng, “Efficient systolic Architectures for 1-D and
2-D DLMS adaptive digital filters,” in Proc. IEEE Asia Pacific Conf.
Circuits Syst., Tianjin, China, Dec. 2000, pp. 399–402.
[8] L. D. Van and W. S. Feng, “An efficient architecture for the DLMS
adaptive filters and its applications,” IEEE Trans. Circuits Syst. II,
Analog Digit. Signal Process., vol. 48, no. 4, pp. 359–366, Apr. 2001.
[9] G. A. Clark, S. K. Mitra, and S. R. Parker, “Block implementation of
adaptive digital filters,” IEEE Trans. Circuit Syst., vol. 28, pp. 584–592,
Jun. 1981.
[10] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. An-
derson, “LMS adaptive filters using distributed arithmetic for high
throughput,” IEEE Trans. Circuits Syst., vol. 52, no. 7, pp. 1327–1337,
Jul. 2005.
[11] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson,
“A novel high performance distributed arithmetic adaptive filter im-
plementation on an FPGA,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Process. (ICASSP), 2004, vol. 5, p. V-161-4.
[12] S. A. White, “Applications of distributed arithmetic to digital signal
processing: A tutorial review,” IEEE ASSP Mag., vol. 6, pp. 4–19, Jul.
1989.
[13] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson,
“An FPGA implementation for a high throughput adaptive filter using
distributed arithmetic,” in Proc. 12th Annu. IEEE Symp. Field-Pro-
grammable Custom Comput. Mach., 2004, pp. 324–325.
[14] W. Huang and D. V. Anderson, “Adaptive filters using modified
sliding-block distributed arithmetic with offset binary coding,” in
Proc. IEEE In. Conf. Acoust., Speech, Signal Process. (ICASSP),
2009, pp. 545–548.
[15] B. K. Mohanty and P. K. Meher, “Delayed block LMS algorithm and
concurrent architecture for high-speed implementation of adaptive FIR
filters,” presented at the IEEE Region 10 TENCON2008 Conf., Hyder-
abad, India, Nov. 2008.
[16] R. Guo and L. S. DeBrunner, “Two high-performance adaptive filter
implementation schemes using distributed arithmetic,” IEEE Trans.
Circuits Syst. II, Exp. Briefs, vol. 58, no. 9, pp. 600–604, Sep. 2011.
[17] S. Baghel and R. Shaik, “FPGA implementation of fast block LMS
adaptive filter using distributed arithmetic for high-throughput,” in
Proc. Int. Conf. Commun. Signal Process. (ICCSP), Feb. 10–12, 2011,
pp. 443–447.
[18] S. Baghel and R. Shaik, “Low power and less complex implementation
of fast block LMS adaptive filter using distributed arithmetic,” in Proc.
IEEE Students Technol. Symp., Jan. 14–16, 2011, pp. 214–219.
[19] R. Jayashri, H. Chitra, H. Kusuma, A. V. Pavitra, and V. Chan-
drakanth, “Memory based architecture to implement simplified block
LMS algorithm on FPGA,” in Proc. Int. Conf. Commun. Signal
Process. (ICCSP), Feb. 10–12, 2011, pp. 179–183.
[20] Q. Shen and A. S. Spanias, “Time and frequency domain X block LMS
algorithm for single channel active noise control,” Control Eng. J., vol.
44, no. 6, pp. 281–293, 1996.
[21] D. P. Das, G. Panda, and S. M. Kuo, “New block filtered-X LMS algo-
rithms for active noise control systems,” IEE Signal Procesd., vol. 1,
no. 2, pp. 73–81, Jun. 2007.
[22] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and Im-
plementation. New York: Wiley, 1999.
[23] C. S. Burrus, “Index mappings for multidimensional formulation of the
DFT and convolution,” IEEE Trans. Acoust., Speech, Signal Process.,
vol. 25, pp. 239–242, Jun. 1977.
Basant K. Mohanty (M’06–SM’11) received M.Sc.
degree in physics from Sambalpur University, India,
in 1989 and the Ph.D. degree in the field of VLSI for
digital signal processing from Berhampur University,
Orissa, in 2000.
In 2001, he joined as Lecturer in Electrical and
Electronic Engineering Department, BITS Pilani,
Rajasthan. Then, he joined as an Assistant Professor
in the Department of Electronics and Communi-
cation Engineering, Mody Institute of Education
Research (Deemed University), Rajasthan. In 2003,
he joined Jaypee University of Engineering and Technology, Guna, Madhya
Pradesh, where he became Associate Professor in 2005 and full Professor in
2007. His research interest includes design and implementation of low-power
and high performance systems for multimedia applications, multi-core pro-
cessor design and algorithm for concurrent processing. He has published nearly
40 technical papers.
Dr. Mohanty is a life time member of The Institution of Electronics and
Telecommunication Engineering, New Delhi, India. He was the recipient of the
Rashtriya Gaurav Award conferred by India International friendship Society,
New Delhi, India for 2012.
Pramod Kumar Meher (SM’03) received the M.Sc.
degree in physics and the Ph.D. degree in science
from Sambalpur University, India, in 1978, and 1996,
respectively.
Currently, he is a Senior Scientist with the Institute
for Infocomm Research, Singapore, and Adjunct Pro-
fessor with the School of Electrical Sciences, Indian
Institute of Technology Bhubaneswar, India. Previ-
ously, he was a Professor of Computer Applications
with Utkal University, India, from 1997 to 2002, and
a Reader in electronics with Berhampur University,
India, from 1993 to 1997. His research interest includes design of dedicated and
reconfigurable architectures for computation-intensive algorithms pertaining to
signal, image and video processing, communication, bio-informatics and intel-
ligent computing. He has contributed nearly 200 technical papers to various
reputed journals and conference proceedings.
Dr. Meher has served as a speaker for the Distinguished Lecturer Program
(DLP) of IEEE Circuits Systems Society and Associate Editor of the IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS. Currently, he
is serving as Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND
SYSTEMS—I: REGULAR PAPERS, the IEEE TRANSACTIONS ON VERY LARGE
SCALE INTEGRATION (VLSI) SYSTEMS, and Journal of Circuits, Systems, and
Signal Processing. He was the recipient of the Samanta Chandrasekhar Award
for excellence in research in engineering and technology for 1999.

06340356

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (13)

Similar to 06340356

Similar to 06340356 (20)

Recently uploaded

Recently uploaded (20)

06340356