Fx3111501156

Ayman Elnaggar, Mokhtar Aboelaze / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1150-1156
Embedded Reconfigurable Architectures for Multidimensional
Transforms
Ayman Elnaggar, Mokhtar Aboelaze
Department of Computer Science & Engineering, German University in Cairo, New Cairo City, Egypt
Department of Computer Science, York University, Toronto, Canada M3J 1P3

Abstract
This paper presents a general approach AB  CD  ( A  B)(C  D) (1)
for generating higher order (longer size)
multidimensional (m-d) architectures from 2
m
( A  B)  C  A  ( B  C) (2)
lower order (shorter sizes) architectures. The If n  n1n2 , then
objective of our work is to derive a unified
framework and a design methodology that allows An1  Bn2  Pn,n1 ( I n2  An1 ) Pn,n2 ( I n1  Bn2 ) (3)
direct mapping of the proposed algorithms into If n  n1n2 n3 , then
embedded reconfigurable architectures such as
FPGAs. Our methodology is based on I n1  An2  I n3  Pn, n1n2 ( I n1n3  An2 ) Pn, n3 (4)
manipulating tensor product forms so that they If 2n  n1n2 , then
can be mapped directly into modular parallel Pn, 2  Pn, n1 Pn, n2 (5)
architectures. The resulting circuits have very
simple modular structure and regular topology. Where  denotes the tensor product, I n is
the identity matrix of size n, and Pn, s , the
Keywords – Reconfigurable Architectures,
Recursive algorithms, multidimensional transforms, permutation matrix, is n  n binary matrix whose
tensor products, permutation matrices. entries are zeroes and ones, such that each row or
column of has a single 1 entry. If n  rs then Pn, s
I.INTRODUCTION n
This paper proposes an efficient and cost- is an n  n binary matrix specifying an -shuffle
effective general methodology for mapping s
multidimensional transforms onto efficient (or s-stride) permutation. The effect of the
reconfigurable architectures such as FPGAs. The permutation matrix Pn, s on an input vector X n of
main objective of this paper is to derive a design
methodology and recursive formulation for the length n is to shuffle the elements of X n by
multidimensional transforms which is useful for the grouping all the r elements separated by distance s
true modularization and parallelization of the together. The first r element will be
resulting computation. x0 , x s , x 2 s ,  , x( r 1) s , the next r elements are
Our methodology employs tensor product
(or Kronecker products) decompositions and x1 , x1 s , x1 2 s ,  , x1( r 1) s , and so on.
permutation matrices as the main tools for
expressing the general framework for The main result reported in this paper
multidimensional DSP transforms. We employ shows that a large two-dimensional (2-d)
several techniques to manipulate such computation for a given DSP transform on an n  n
decompositions into suitable recursive expressions input array can be decomposed recursively into three
which can be mapped efficiently onto reconfigurable stages as shown in Fig. 1 for the case n  4 . The
FPGAs structures. middle stage is constructed recursively from 22
Our work is based on a non-trivial generalization of parallel (data-independent) blocks each realizing a
the one-dimensional DSP transforms. It has been smaller-size computation of the same DSP
shown that when coupled with stride permutation transform. The pre-additions and the post-
matrices, tensor products provide a unifying permutations stages serve as "glue" circuits that
framework for describing and programming a wide combine the 22 lower order blocks to construct the
range of fast recursive algorithms for various higher order architecture. We also show that the
transform. This unifying framework is suited for proposed unified approach can be extended such that
parallel processing machines and vector processing an m-d DSP transform can be constructed from 2m
machines [6], [10]. smaller size m-d ones. The objective of our work is
Some of the tensor product properties that will be to derive a unified framework and a design
used throughout this paper are [6], [10]: methodology that allows direct mapping of the

1150 | P a g e

proposed algorithms into reconfigurable FPGAs
1 0
 1 0 0
A  1 1  ,
architectures.
  B  1 1  1
Observe that, we have drawn our networks
0 1  0 0 1
such that data flows from right to left. We chose this  
convention to show the direct correspondence In this case j  3 (three parallel blocks of the
between the derived algorithms and the proposed convolution of smaller size n / 2) ). The realization
reconfigurable architecture. of the 1-d linear convolution is shown in Fig. 1.
I. A GENERAL FRAMEWORK FOR 1-D RECURSIVE
DSP TRANSFORMS
In this section, we present a general
framework to derive recursive formulations for
multidimensional transforms. Given a 1-d DSP
algorithm in a matrix-vector form
Ym  Tm, n X n (6)

Where, Tm, n is the transform matrix, X n
and Ym are the input vector of size n and the
Fig. 1. The realization of the 1-d linear convolution
output vector of size m , respectively. Then, using
sparse matrix factorization approach [9], the matrix B. The 1-d DCT
Tm, n can be factorized so that
Tm, n  T1 T2 Tk (7) The 1-d DCT Tn of size n can be written as [3], [7]

Where, each of the matrices T1 , T2 ,, Tk Tn  Rn ( I 3  Tn / 2 ) Qn . (10)
is sparse. Sparseness implies that either most of the Where
elements of the matrix are zeros or the matrix in the Rn  Pn, n / 2 ( I n / 2  Ln / 2 ) ,
block diagonal form. By applying tensor product
property (3) to the block diagonal matrices of Qn  ( I 2 V 1 ) ( I n / 2  Cn / 2 ) ( F2  I n / 2 ) Vn .
n/2
equation (7), we have
 1 
C n  diag ,
Tm, n  ( Ri ) ( I j  Tm / 2 , n / 2 ) (Qk ) (8)  2 cosn 

n  (4M  1), M  0, 1, , n  1,
Where, Q k and Ri are the pre- and post- 2n
processing glue structure that combine j blocks in 1 1 0 0  0 0
parallel of the lower-order transform of size 0 1 1 0  0 0
Tm / 2 , n / 2 .  
0 0 1 1  0 0
Lm    ,
A. The 1-d Linear Convolution    0   
The 1-d linear convolution matrix C (n) of 0 0 0 0  1 0
 

size n  2 , where  is an integer can be written 0 0 0 0  0 1
 
as [5], [6] Vn  ( I n / 2  J n / 2 ) Pn,2 ,
C (n)  Rn ( I 3 C (n / 2) ) Qn . (9) 1 1 
F2   ,
1  1
Where,
I n / 2 is the identity matrix of dimension n / 2 , 
Qn ( P  1   2 ) 1[( I  1  A)P   1 ].
2 3,2 3 2 2 ,2 is the direct sum operator, J n / 2 is the exchange
Rn  R  k ( P   B) P 
3( 2 1),3 2 1 3( 2 1),( 2 1)
(I )
2
matrix of order n / 2 defined as

1151 | P a g e

0 0  0 0 third stage) will be replaced by the single
0 0  1 0
permutation P8,2 as shown in Fig. 3 (b). Similarly,
  equation (13) can be simplified to
Jn / 2        , k 1
  Wn   Pn , 2 ( I 2k 1  W2 ) (15)
0 1  0 0 i 0
1
 0  0 0
 Thus, Wn can be computed by the
cascaded product of k similar stages (independent of
In this case j  2 (two parallel blocks of the DCT of i) of double matrix products instead of the triple
matrix products in equation (8). Alternatively, we
smaller size n / 2) ). The realization of the 1-d DCT
can realize (15) by a single block of
is shown in Fig. 2.
Pn , 2 ( I 2k 1  W2 ) and take the output after k
iterations that allows a hardware saving without
slowing down the processing speed and reduction in
the hardware size as shown in Fig. 4 for the case
n 8.
It should be mentioned that we have applied
property (5) to reduce the shuffling inherited in the
original WHT algorithm to allow a uniform
hardware blocks as shown in Fig. 3 (b). We haven’t
modified the original complexity of the WHT that
are centered in the W2 blocks as shown in Fig. 3
and Fig. 4.
Fig. 2. The realization of the 1-d DCT
Applying property (1), equation (12) can be
C. The 1-d WHT
modified to
Our last example is the 1-d WHT. The
original 1-d WHT transform matrix is defined as [1], Wn  W2  Wn / 2  I 2 W2  Wn / 2 I n / 2
[2]  ( I 2  Wn / 2 ) (W2  I n / 2 ) (16)
W Wn / 2  1 1   ( I 2  Wn / 2 ) Qn
Wn   n / 2
W  , W2  
 1  1 ,
 (11)
 n / 2  Wn / 2   
Where, Qn  (W2  I n / 2 ) (17)
Where, W2 is the 2-point WHT. Let k  log 2 n , we
Equation (16) represents the two-stage
can write equation (11) in the iterative tensor- recursive tensor product formulation of the 1-d WHT
product form
(in this case j  2 ) in which the first stage is the pre-
Wn  W2  Wn / 2  W2  W2    W2
additions ( Qn ), followed by the second stage of the
k 1 (12)
  ( I i  W2  I k i 1 )
2 2
core computation I 2  Wn / 2  that consists of a
i 0 parallel blocks of two identical smaller WHT
which using property (4), can be modified to
computations each of size n / 2 as shown in Fig. 5.
k 1
Wn   Pn , 2i 1 ( I 2k 1  W2 ) Pn , 2k i 1 (13)
i 0 I. A GENERAL FRAMEWORK FOR 2-D RECURSIVE
As an example, we can express W8 as DSP TRANSFORMS
  
W8  P8 , 2 ( I 4  W2 ) P8 , 4 P8 , 4 ( I 4  W2 ) P8 , 2 . For a 2-d input data, X n1 , n 2 , of size n1  n2 , and

P8 , 8 ( I 4  W2 ) P8 ,1  a separable 2-d transform, Tn1 , n 2 , we can write the
(14) output, Yn1 , n 2 , in the form
The realization of W8 is shown in Fig. 3 (a).
Yn1 , n 2 = Yn1 , n 2 X n1 , n 2 (18)
Applying property (5) to equation (14)
and noting that now the permutations in two adjacent where, X n1 , n 2 and Yn1 , n 2 are the input and
stages can be grouped together into a single
output column-scanned vectors, respectively. For
permutation, the adjacent permutations P8,2 P8,8
separable matrices, the 2-d transform matrix
(from the first and the second stage) will be replaced Tn1 , n 2 can be written in the tensor product form as
by the single permutation P8,2 and the adjacent
[9]
permutations P8,4 P8,4 (from the second and the

1152 | P a g e

Tn1 , n 2  Tn1  Tn 2 (19)

Where Tn1 and T n 2 are the row and column 1- Applying properties (1) to (4) to derive the
2-d recursive form
d transforms, respectively as defined in (8). By
replacing Tn1 and T n 2 by their corresponding ~ ~
Tn1, n2  ( Rn1, n2 ) ( I 2  Tn1 / 2, n2 / 2 ) (Qn1, n2 ) (20)
j
values from equation (8) and

1153 | P a g e

~ ~ Since the convolution matrix C(n/2) is of dimension
Where, Qn , n and Rn , n are the 2-d pre- and
1 2 1 2 [(n  1)  n / 2 ] , we can write C(n2 / 2) as
post-processing glue structure, respectively that
C(n2 / 2)  I n2 1 .C(n2 / 2) . I n2 / 2 (28)
combine j 2 of the lower-order (smaller size) 2-d
Substituting (21) in (20) and applying property (1),
transform Tn1 / 2, n 2 / 2 of dimension ~
C n1 , n 2  ( P9( n1 1),3( n1 1) ( I 9 C (n1 / 2)
n1 / 2  n2 / 2 . P9( n1 / 2),3 )  ( I n 2 1.C (n 2 / 2). I n 2 / 2 )
A. The 2-d convolution ( P9( n1 1),3( n1 1)  I n 2 1 )
Let n1  21 and n2  2 2 . Pratt [9] has ( I 9 C (n1 / 2) C (n 2 / 2))
shown that for an n1 n2 input data image, the 2-d ( P9( n1 / 2),3  I n 2 / 2 ).
convolution output is given by (29)
q  Cn1, n2 f (21) Now, substituting (29) in (24) gives
~
~ ~
~
Where, Cn1 ,n2 is the 2-d convolution transform Cn1 ,n2  R ( I 9  Cn1 / 2,n2 / 2 ) Q , (30)
matrix; and q and f are the output and input column-
scanned vectors, respectively of size n  n1 n2 . Where, Cn1 / 2, n2 / 2  C(n1 / 2)  C(n2 / 2) is the
Pratt has also shown that, for separable transforms, lower order 2-d convolution matrix for an
the matrix Cn1, n2 can be decomposed into the n1 / 2  n2 / 2 input image,
~
~
tensor form Q  ( P9( n1 / 2),3  I n2 / 2 ) (Q1  Q2 ) and
Cn1,n2  C ( n1)C ( n2 ) (22) ~
~
R ( R1 R2 ) ( P9(n1 1),3(n1 1) I n 2 1 ) are
Where, C (n1 ) and C (n 2 ) represent row and
column 1-d convolution operators on f, respectively, the new 2-d pre- and post-additions, respectively.
as defined in (8) and (9). From (9) and (22), we can Equation (30) represents the recursive 2-d
express the 2-d convolution matrix as a function of convolution algorithm. In this case we use 9 (
1-d convolutions as follows [4] j 2  32  9 ) of the lower-order C n1 / 2 , n2 / 2
C n1 , n2  [ R1 ( I 3 C (n1 / 2)Q1 ][ R2 ( I 3 C (n2 / 2)Q2 ]
convolution blocks in parallel to generate the higher
(23) order Cn1 , n2 convolution as shown before in Fig. 6.
Applying property (1), leads to
C n1 , n 2  [( R1  R2 )(( I 3 C (n1 / 2))
(( I 3 C ( n 2 / 2))(Q1 Q2 )], (24)
~~ ~
 R C n1 , n 2 Q.

Where,
~
R ( R1  R2 ),
~
C n1 , n 2  ( I 3 C (n1 / 2))( I 3 C (n2 / 2)) ,
~
Q (Q1 Q2 )]. B. The 2-d DCT
(25) Since the DCT matrix is separable, the 2-d
~
Note that the matrix Cn1 ,n2 contains the 1-d DCT for an image of dimension n1  n 2 can be

convolutions matrices C(n1 / 2) and C(n2 / 2) in an computed by a stage of n 2 parallel 1-d DCT
involved tensor product expression. By applying computations on n1 points each, followed by
property (2), we can write (24) as
~ another stage of n1 parallel 1-d DCT computations
C n , n  (( I 3 C (n1 / 2) I 3 ) C (n2 / 2)) , (26)
1 2
on n 2 points each. This can be represented by the
Applying property (4), yields to
~ matrix-vector form
C n1 , n 2  ( P9(n1 1),3(n1 1) X  Tn1 , n 2 x , (31)
( I 9 C (n1 / 2) P9(n1 / 2),3 ) C (n2 / 2)) ,
Where Tn1 , n 2 is the 2-d DCT transform matrix for
(27)
an n1  n 2 image, X and x are the output and input
column-scanned vectors, respectively. By

1154 | P a g e

substituting (10) in (31), we have
X  (Tn1  Tn 2 ) x . (32) ˆ
m
ˆ
X  [ R (I m  Wni / 2 ) Q] x , (36)
By further manipulation of equation (32) in a similar 2 i1
way to that we did to (23) of the 2-d convolution, m
we can write (23) as [3] Where  Wni / 2 is the lower order m-d WHT and
i 1
~ ~
X  (Rn1 , n2 (I 4  Tn1 / 2, n2 / 2 ) Qn1, n2 ) x (33)  m 1 i  m 
Where, Q     Pu1 ,u 2  I u3     Qi  ,
ˆ
  k 1    i 1 
~  i 1   
Qn1 , n2  ( P2n1 ,2  I n2 / 2 ) Qn1 , n2 , and (37)
m 1 m 1 
~ R    Pw1 , w2  I w3  ,
ˆ
Rn1 , n2  Rn1 , n2 ( P2n1 , n1  I n2 / 2 ) . i 1 
 l i 

Equation (33) represents the truly recursive 2-d m i 1 i
~ ~ u1  2  n j , u 2  2 , u 3   (n m  j 1 ),
DCT in which Qn1 , n2 and Rn1 , n 2 are the pre- j 1 2 j 1
and post-processing glue structures, respectively, 1 i i 1 i
2 w1   (n k ), w2   (n k ), w3   (n j ) ,
that combine 2 ( j 2  2 2  4 ) identical lower- 2 k 1 k 1 2 j  i 1
order 2-d DCT modules each of size
n1 / 2  n2 / 2 in parallel, to construct the higher Qi is the 1-d pre-processing as defined by (17).
Equation (37) extends our results by showing that a
order 2-d DCT of size n1  n 2 .
large m-d WHT can be computed from a single
stage of smaller m-d WHTs.
II. A GENERAL FRAMEWORK FOR M-D
RECURSIVE DSP TRANSFORMS III. CONCLUSIONS
We can extend the steps in deriving In this paper, we presented a general
recursive formulae of the 1-d and the 2-d transforms approach for decomposing higher order (longer size)
to the multidimensional case. For an m-d transform multidimensional (m-d) architectures from 2
m
Tni ,The general form will be lower order (shorter sizes) architectures. We have
m m shown several examples for the 1-d and 2-d
ˆ ˆ
 Tn  R ( I m  Tni / 2 ) Q (34) common transforms such as linear convolution,
i 1 i j i 1 DCT, and WHT. We have extended our results to
ˆ ˆ
Where, Q and R are the m-d pre- and post-
cover the m-d case as well. The objective of our
work was to derive a unified framework and a
processing glue structures that combine j m design methodology that allows direct mapping of
the proposed algorithms into reconfigurable
parallel blocks of the lower-order m-d transforms of
architectures. The resulting circuits have very
size Tni / 2 . simple modular structure and regular topology that
can be mapped directly to FPGAs.
A. The m-d WHT
We can extend the 2-d WHT derivation to REFERENCES
the m-d case. From (12) and (34), the m-d WHT can
1. E. Cetin, O. N. Gerek, and S. Ulukus,
be written in the tensor product form
"Block Wavelet Transforms for Image
Coding," IEEE Trans. on Circuits and
X  (Wn1  Wn 2    Wn m ) x systems for Video Technology, Vol. 3, pp.
m (35) 433-435, 1993.
 (  Wn i ) x 2. Elnaggar, Mokhtar Aboelaze, “A Scalable
i 1 Formulation for 2-D WHT,” Proc. of the
IEEE International Symposium on Circuits
Where, (Wn1  Wn2    Wnm ) is the m-d and Systems (ISCAS' 2003), pp IV484-
IV487, Thailand, May 2003.
WHT transform matrix for an m-d input, Wn is the 3. Elnaggar, H. M. Alnuweiri, "A New Multi-
i
1-d WHT coefficient matrix for an input vector of Dimensional Recursive Architecture for
Computing The Discrete Cosine
length n i as defined in (16), X and x are the output
Transform," IEEE Transactions on Circuits
and input column-scanned vectors, respectively. and Systems for Video Technology, Vol.
Using properties (1) to (4), we can write (35) in the 10, No. 1, pp. 113-119, February 2000.
form [2]

1155 | P a g e

4. Elnaggar and M. Aboelaze, "An Efficient
Architecture for Multi-Dimensional
Convolution," IEEE Trans. on Circuits and
Systems II, Vol. 47, No. 12, pp. 1520-1523,
2000.
5. Elnaggar and M. Aboelaze, “A Modified
Shuffle Free Architecture for Linear
Convolution,” IEEE Trans. on Circuits and
Systems II, Vol. 48, No. 9, pp. 862-866,
2001.
6. J. Granata, M. Conner, R. Tolimieri, "A
Tensor Product Factorization of the Linear
Convolution Matrix", IEEE Trans on
Circuits and Systems, Vol. 38, p. 1364--6,
1991.
7. H. S. Hou, "A Fast Recursive Algorithm
for Computing the Discrete Cosine
Transform," IEEE Trans. On ASSP, Vol.
Assp-35, No. 10, 1987.
8. K. R. Rao, P. Yip, "Discrete Cosine
Transform: Algorithms, Advantages, and
Applications," Academic Press, 1990.
9. W. K. Pratt, Digital Image Processing,
John Wiley & Sons, Inc., 1991.
10. R. Tolimieri, M. An, C. Lu, Algorithms for
Discrete Fourier Transform and
Convolution, Springer-Verlag, New York
1989.

1156 | P a g e

Fx3111501156

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (6)

Similar to Fx3111501156

Similar to Fx3111501156 (20)

Fx3111501156