Upcoming SlideShare
×

Fx3111501156

80

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
80
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
1
0
Likes
0
Embeds 0
No embeds

No notes for slide

Fx3111501156

1. 1. Ayman Elnaggar, Mokhtar Aboelaze / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1150-1156 Embedded Reconfigurable Architectures for Multidimensional Transforms Ayman Elnaggar, Mokhtar Aboelaze Department of Computer Science & Engineering, German University in Cairo, New Cairo City, Egypt Department of Computer Science, York University, Toronto, Canada M3J 1P3Abstract This paper presents a general approach AB  CD  ( A  B)(C  D) (1)for generating higher order (longer size)multidimensional (m-d) architectures from 2 m ( A  B)  C  A  ( B  C) (2)lower order (shorter sizes) architectures. The If n  n1n2 , thenobjective of our work is to derive a unifiedframework and a design methodology that allows An1  Bn2  Pn,n1 ( I n2  An1 ) Pn,n2 ( I n1  Bn2 ) (3)direct mapping of the proposed algorithms into If n  n1n2 n3 , thenembedded reconfigurable architectures such asFPGAs. Our methodology is based on I n1  An2  I n3  Pn, n1n2 ( I n1n3  An2 ) Pn, n3 (4)manipulating tensor product forms so that they If 2n  n1n2 , thencan be mapped directly into modular parallel Pn, 2  Pn, n1 Pn, n2 (5)architectures. The resulting circuits have verysimple modular structure and regular topology. Where  denotes the tensor product, I n is the identity matrix of size n, and Pn, s , theKeywords – Reconfigurable Architectures,Recursive algorithms, multidimensional transforms, permutation matrix, is n  n binary matrix whosetensor products, permutation matrices. entries are zeroes and ones, such that each row or column of has a single 1 entry. If n  rs then Pn, sI.INTRODUCTION n This paper proposes an efficient and cost- is an n  n binary matrix specifying an -shuffleeffective general methodology for mapping smultidimensional transforms onto efficient (or s-stride) permutation. The effect of thereconfigurable architectures such as FPGAs. The permutation matrix Pn, s on an input vector X n ofmain objective of this paper is to derive a designmethodology and recursive formulation for the length n is to shuffle the elements of X n bymultidimensional transforms which is useful for the grouping all the r elements separated by distance strue modularization and parallelization of the together. The first r element will beresulting computation. x0 , x s , x 2 s ,  , x( r 1) s , the next r elements are Our methodology employs tensor product(or Kronecker products) decompositions and x1 , x1 s , x1 2 s ,  , x1( r 1) s , and so on.permutation matrices as the main tools forexpressing the general framework for The main result reported in this papermultidimensional DSP transforms. We employ shows that a large two-dimensional (2-d)several techniques to manipulate such computation for a given DSP transform on an n  ndecompositions into suitable recursive expressions input array can be decomposed recursively into threewhich can be mapped efficiently onto reconfigurable stages as shown in Fig. 1 for the case n  4 . TheFPGAs structures. middle stage is constructed recursively from 22Our work is based on a non-trivial generalization of parallel (data-independent) blocks each realizing athe one-dimensional DSP transforms. It has been smaller-size computation of the same DSPshown that when coupled with stride permutation transform. The pre-additions and the post-matrices, tensor products provide a unifying permutations stages serve as "glue" circuits thatframework for describing and programming a wide combine the 22 lower order blocks to construct therange of fast recursive algorithms for various higher order architecture. We also show that thetransform. This unifying framework is suited for proposed unified approach can be extended such thatparallel processing machines and vector processing an m-d DSP transform can be constructed from 2mmachines [6], [10]. smaller size m-d ones. The objective of our work isSome of the tensor product properties that will be to derive a unified framework and a designused throughout this paper are [6], [10]: methodology that allows direct mapping of the 1150 | P a g e
2. 2. Ayman Elnaggar, Mokhtar Aboelaze / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1150-1156proposed algorithms into reconfigurable FPGAs 1 0  1 0 0 A  1 1  ,architectures.   B  1 1  1 Observe that, we have drawn our networks 0 1  0 0 1such that data flows from right to left. We chose this  convention to show the direct correspondence In this case j  3 (three parallel blocks of thebetween the derived algorithms and the proposed convolution of smaller size n / 2) ). The realizationreconfigurable architecture. of the 1-d linear convolution is shown in Fig. 1.I. A GENERAL FRAMEWORK FOR 1-D RECURSIVEDSP TRANSFORMS In this section, we present a generalframework to derive recursive formulations formultidimensional transforms. Given a 1-d DSPalgorithm in a matrix-vector form Ym  Tm, n X n (6) Where, Tm, n is the transform matrix, X nand Ym are the input vector of size n and the Fig. 1. The realization of the 1-d linear convolutionoutput vector of size m , respectively. Then, usingsparse matrix factorization approach [9], the matrix B. The 1-d DCTTm, n can be factorized so thatTm, n  T1 T2 Tk (7) The 1-d DCT Tn of size n can be written as [3], [7] Where, each of the matrices T1 , T2 ,, Tk Tn  Rn ( I 3  Tn / 2 ) Qn . (10)is sparse. Sparseness implies that either most of the Whereelements of the matrix are zeros or the matrix in the Rn  Pn, n / 2 ( I n / 2  Ln / 2 ) ,block diagonal form. By applying tensor productproperty (3) to the block diagonal matrices of Qn  ( I 2 V 1 ) ( I n / 2  Cn / 2 ) ( F2  I n / 2 ) Vn . n/2equation (7), we have  1  C n  diag ,Tm, n  ( Ri ) ( I j  Tm / 2 , n / 2 ) (Qk ) (8)  2 cosn   n  (4M  1), M  0, 1, , n  1, Where, Q k and Ri are the pre- and post- 2nprocessing glue structure that combine j blocks in 1 1 0 0  0 0parallel of the lower-order transform of size 0 1 1 0  0 0Tm / 2 , n / 2 .   0 0 1 1  0 0 Lm    ,A. The 1-d Linear Convolution    0    The 1-d linear convolution matrix C (n) of 0 0 0 0  1 0   size n  2 , where  is an integer can be written 0 0 0 0  0 1  as [5], [6] Vn  ( I n / 2  J n / 2 ) Pn,2 ,C (n)  Rn ( I 3 C (n / 2) ) Qn . (9) 1 1  F2   , 1  1 Where, I n / 2 is the identity matrix of dimension n / 2 , Qn ( P  1   2 ) 1[( I  1  A)P   1 ]. 2 3,2 3 2 2 ,2 is the direct sum operator, J n / 2 is the exchangeRn  R  k ( P   B) P  3( 2 1),3 2 1 3( 2 1),( 2 1) (I ) 2 matrix of order n / 2 defined as 1151 | P a g e
3. 3. Ayman Elnaggar, Mokhtar Aboelaze / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1150-1156 0 0  0 0 third stage) will be replaced by the single 0 0  1 0 permutation P8,2 as shown in Fig. 3 (b). Similarly,   equation (13) can be simplified toJn / 2        , k 1   Wn   Pn , 2 ( I 2k 1  W2 ) (15) 0 1  0 0 i 0 1  0  0 0  Thus, Wn can be computed by the cascaded product of k similar stages (independent ofIn this case j  2 (two parallel blocks of the DCT of i) of double matrix products instead of the triple matrix products in equation (8). Alternatively, wesmaller size n / 2) ). The realization of the 1-d DCT can realize (15) by a single block ofis shown in Fig. 2. Pn , 2 ( I 2k 1  W2 ) and take the output after k iterations that allows a hardware saving without slowing down the processing speed and reduction in the hardware size as shown in Fig. 4 for the case n 8. It should be mentioned that we have applied property (5) to reduce the shuffling inherited in the original WHT algorithm to allow a uniform hardware blocks as shown in Fig. 3 (b). We haven’t modified the original complexity of the WHT that are centered in the W2 blocks as shown in Fig. 3 and Fig. 4.Fig. 2. The realization of the 1-d DCT Applying property (1), equation (12) can beC. The 1-d WHT modified to Our last example is the 1-d WHT. Theoriginal 1-d WHT transform matrix is defined as [1], Wn  W2  Wn / 2  I 2 W2  Wn / 2 I n / 2[2]  ( I 2  Wn / 2 ) (W2  I n / 2 ) (16) W Wn / 2  1 1   ( I 2  Wn / 2 ) QnWn   n / 2 W  , W2    1  1 ,  (11)  n / 2  Wn / 2    Where, Qn  (W2  I n / 2 ) (17)Where, W2 is the 2-point WHT. Let k  log 2 n , we Equation (16) represents the two-stagecan write equation (11) in the iterative tensor- recursive tensor product formulation of the 1-d WHTproduct form (in this case j  2 ) in which the first stage is the pre- Wn  W2  Wn / 2  W2  W2    W2 additions ( Qn ), followed by the second stage of the k 1 (12)   ( I i  W2  I k i 1 ) 2 2 core computation I 2  Wn / 2  that consists of a i 0 parallel blocks of two identical smaller WHTwhich using property (4), can be modified to computations each of size n / 2 as shown in Fig. 5. k 1Wn   Pn , 2i 1 ( I 2k 1  W2 ) Pn , 2k i 1 (13) i 0 I. A GENERAL FRAMEWORK FOR 2-D RECURSIVEAs an example, we can express W8 as DSP TRANSFORMS   W8  P8 , 2 ( I 4  W2 ) P8 , 4 P8 , 4 ( I 4  W2 ) P8 , 2 . For a 2-d input data, X n1 , n 2 , of size n1  n2 , and  P8 , 8 ( I 4  W2 ) P8 ,1  a separable 2-d transform, Tn1 , n 2 , we can write the (14) output, Yn1 , n 2 , in the formThe realization of W8 is shown in Fig. 3 (a). Yn1 , n 2 = Yn1 , n 2 X n1 , n 2 (18) Applying property (5) to equation (14)and noting that now the permutations in two adjacent where, X n1 , n 2 and Yn1 , n 2 are the input andstages can be grouped together into a single output column-scanned vectors, respectively. Forpermutation, the adjacent permutations P8,2 P8,8 separable matrices, the 2-d transform matrix(from the first and the second stage) will be replaced Tn1 , n 2 can be written in the tensor product form asby the single permutation P8,2 and the adjacent [9]permutations P8,4 P8,4 (from the second and the 1152 | P a g e
4. 4. Ayman Elnaggar, Mokhtar Aboelaze / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1150-1156Tn1 , n 2  Tn1  Tn 2 (19) Where Tn1 and T n 2 are the row and column 1- Applying properties (1) to (4) to derive the 2-d recursive formd transforms, respectively as defined in (8). Byreplacing Tn1 and T n 2 by their corresponding ~ ~ Tn1, n2  ( Rn1, n2 ) ( I 2  Tn1 / 2, n2 / 2 ) (Qn1, n2 ) (20) jvalues from equation (8) and 1153 | P a g e
5. 5. Ayman Elnaggar, Mokhtar Aboelaze / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1150-1156 ~ ~ Since the convolution matrix C(n/2) is of dimensionWhere, Qn , n and Rn , n are the 2-d pre- and 1 2 1 2 [(n  1)  n / 2 ] , we can write C(n2 / 2) aspost-processing glue structure, respectively that C(n2 / 2)  I n2 1 .C(n2 / 2) . I n2 / 2 (28)combine j 2 of the lower-order (smaller size) 2-d Substituting (21) in (20) and applying property (1),transform Tn1 / 2, n 2 / 2 of dimension ~ C n1 , n 2  ( P9( n1 1),3( n1 1) ( I 9 C (n1 / 2) n1 / 2  n2 / 2 . P9( n1 / 2),3 )  ( I n 2 1.C (n 2 / 2). I n 2 / 2 )A. The 2-d convolution ( P9( n1 1),3( n1 1)  I n 2 1 ) Let n1  21 and n2  2 2 . Pratt [9] has ( I 9 C (n1 / 2) C (n 2 / 2))shown that for an n1 n2 input data image, the 2-d ( P9( n1 / 2),3  I n 2 / 2 ).convolution output is given by (29) q  Cn1, n2 f (21) Now, substituting (29) in (24) gives ~ ~ ~ ~Where, Cn1 ,n2 is the 2-d convolution transform Cn1 ,n2  R ( I 9  Cn1 / 2,n2 / 2 ) Q , (30)matrix; and q and f are the output and input column-scanned vectors, respectively of size n  n1 n2 . Where, Cn1 / 2, n2 / 2  C(n1 / 2)  C(n2 / 2) is thePratt has also shown that, for separable transforms, lower order 2-d convolution matrix for anthe matrix Cn1, n2 can be decomposed into the n1 / 2  n2 / 2 input image, ~ ~tensor form Q  ( P9( n1 / 2),3  I n2 / 2 ) (Q1  Q2 ) and Cn1,n2  C ( n1)C ( n2 ) (22) ~ ~ R ( R1 R2 ) ( P9(n1 1),3(n1 1) I n 2 1 ) areWhere, C (n1 ) and C (n 2 ) represent row andcolumn 1-d convolution operators on f, respectively, the new 2-d pre- and post-additions, respectively.as defined in (8) and (9). From (9) and (22), we can Equation (30) represents the recursive 2-dexpress the 2-d convolution matrix as a function of convolution algorithm. In this case we use 9 (1-d convolutions as follows [4] j 2  32  9 ) of the lower-order C n1 / 2 , n2 / 2C n1 , n2  [ R1 ( I 3 C (n1 / 2)Q1 ][ R2 ( I 3 C (n2 / 2)Q2 ] convolution blocks in parallel to generate the higher (23) order Cn1 , n2 convolution as shown before in Fig. 6.Applying property (1), leads to C n1 , n 2  [( R1  R2 )(( I 3 C (n1 / 2)) (( I 3 C ( n 2 / 2))(Q1 Q2 )], (24) ~~ ~  R C n1 , n 2 Q.Where, ~ R ( R1  R2 ), ~ C n1 , n 2  ( I 3 C (n1 / 2))( I 3 C (n2 / 2)) , ~ Q (Q1 Q2 )]. B. The 2-d DCT (25) Since the DCT matrix is separable, the 2-d ~ Note that the matrix Cn1 ,n2 contains the 1-d DCT for an image of dimension n1  n 2 can beconvolutions matrices C(n1 / 2) and C(n2 / 2) in an computed by a stage of n 2 parallel 1-d DCTinvolved tensor product expression. By applying computations on n1 points each, followed byproperty (2), we can write (24) as ~ another stage of n1 parallel 1-d DCT computations C n , n  (( I 3 C (n1 / 2) I 3 ) C (n2 / 2)) , (26) 1 2 on n 2 points each. This can be represented by theApplying property (4), yields to ~ matrix-vector formC n1 , n 2  ( P9(n1 1),3(n1 1) X  Tn1 , n 2 x , (31) ( I 9 C (n1 / 2) P9(n1 / 2),3 ) C (n2 / 2)) , Where Tn1 , n 2 is the 2-d DCT transform matrix for (27) an n1  n 2 image, X and x are the output and input column-scanned vectors, respectively. By 1154 | P a g e
6. 6. Ayman Elnaggar, Mokhtar Aboelaze / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1150-1156substituting (10) in (31), we have X  (Tn1  Tn 2 ) x . (32) ˆ m ˆ X  [ R (I m  Wni / 2 ) Q] x , (36)By further manipulation of equation (32) in a similar 2 i1way to that we did to (23) of the 2-d convolution, mwe can write (23) as [3] Where  Wni / 2 is the lower order m-d WHT and i 1 ~ ~X  (Rn1 , n2 (I 4  Tn1 / 2, n2 / 2 ) Qn1, n2 ) x (33)  m 1 i  m Where, Q     Pu1 ,u 2  I u3     Qi  , ˆ   k 1    i 1 ~  i 1   Qn1 , n2  ( P2n1 ,2  I n2 / 2 ) Qn1 , n2 , and (37) m 1 m 1 ~ R    Pw1 , w2  I w3  , ˆRn1 , n2  Rn1 , n2 ( P2n1 , n1  I n2 / 2 ) . i 1   l i  Equation (33) represents the truly recursive 2-d m i 1 i ~ ~ u1  2  n j , u 2  2 , u 3   (n m  j 1 ),DCT in which Qn1 , n2 and Rn1 , n 2 are the pre- j 1 2 j 1and post-processing glue structures, respectively, 1 i i 1 i 2 w1   (n k ), w2   (n k ), w3   (n j ) ,that combine 2 ( j 2  2 2  4 ) identical lower- 2 k 1 k 1 2 j  i 1order 2-d DCT modules each of size n1 / 2  n2 / 2 in parallel, to construct the higher Qi is the 1-d pre-processing as defined by (17). Equation (37) extends our results by showing that aorder 2-d DCT of size n1  n 2 . large m-d WHT can be computed from a single stage of smaller m-d WHTs.II. A GENERAL FRAMEWORK FOR M-DRECURSIVE DSP TRANSFORMS III. CONCLUSIONS We can extend the steps in deriving In this paper, we presented a generalrecursive formulae of the 1-d and the 2-d transforms approach for decomposing higher order (longer size)to the multidimensional case. For an m-d transform multidimensional (m-d) architectures from 2 mTni ,The general form will be lower order (shorter sizes) architectures. We have m m shown several examples for the 1-d and 2-d ˆ ˆ  Tn  R ( I m  Tni / 2 ) Q (34) common transforms such as linear convolution, i 1 i j i 1 DCT, and WHT. We have extended our results to ˆ ˆWhere, Q and R are the m-d pre- and post- cover the m-d case as well. The objective of our work was to derive a unified framework and aprocessing glue structures that combine j m design methodology that allows direct mapping of the proposed algorithms into reconfigurableparallel blocks of the lower-order m-d transforms of architectures. The resulting circuits have verysize Tni / 2 . simple modular structure and regular topology that can be mapped directly to FPGAs.A. The m-d WHT We can extend the 2-d WHT derivation to REFERENCESthe m-d case. From (12) and (34), the m-d WHT can 1. E. Cetin, O. N. Gerek, and S. Ulukus,be written in the tensor product form "Block Wavelet Transforms for Image Coding," IEEE Trans. on Circuits and X  (Wn1  Wn 2    Wn m ) x systems for Video Technology, Vol. 3, pp. m (35) 433-435, 1993.  (  Wn i ) x 2. Elnaggar, Mokhtar Aboelaze, “A Scalable i 1 Formulation for 2-D WHT,” Proc. of the IEEE International Symposium on CircuitsWhere, (Wn1  Wn2    Wnm ) is the m-d and Systems (ISCAS 2003), pp IV484- IV487, Thailand, May 2003.WHT transform matrix for an m-d input, Wn is the 3. Elnaggar, H. M. Alnuweiri, "A New Multi- i1-d WHT coefficient matrix for an input vector of Dimensional Recursive Architecture for Computing The Discrete Cosinelength n i as defined in (16), X and x are the output Transform," IEEE Transactions on Circuitsand input column-scanned vectors, respectively. and Systems for Video Technology, Vol.Using properties (1) to (4), we can write (35) in the 10, No. 1, pp. 113-119, February 2000.form [2] 1155 | P a g e
7. 7. Ayman Elnaggar, Mokhtar Aboelaze / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 1, January -February 2013, pp.1150-11564. Elnaggar and M. Aboelaze, "An Efficient Architecture for Multi-Dimensional Convolution," IEEE Trans. on Circuits and Systems II, Vol. 47, No. 12, pp. 1520-1523, 2000.5. Elnaggar and M. Aboelaze, “A Modified Shuffle Free Architecture for Linear Convolution,” IEEE Trans. on Circuits and Systems II, Vol. 48, No. 9, pp. 862-866, 2001.6. J. Granata, M. Conner, R. Tolimieri, "A Tensor Product Factorization of the Linear Convolution Matrix", IEEE Trans on Circuits and Systems, Vol. 38, p. 1364--6, 1991.7. H. S. Hou, "A Fast Recursive Algorithm for Computing the Discrete Cosine Transform," IEEE Trans. On ASSP, Vol. Assp-35, No. 10, 1987.8. K. R. Rao, P. Yip, "Discrete Cosine Transform: Algorithms, Advantages, and Applications," Academic Press, 1990.9. W. K. Pratt, Digital Image Processing, John Wiley & Sons, Inc., 1991.10. R. Tolimieri, M. An, C. Lu, Algorithms for Discrete Fourier Transform and Convolution, Springer-Verlag, New York 1989. 1156 | P a g e