Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Wiener Filter Hardware Realization


Published on

Super-resolution reconstruction is a method for reconstructing higher resolution images from a set of low resolution observations. The sub-pixel differences among different observations of the same scene allow to create higher resolution images with better quality. In the last thirty years, many methods for creating high resolution images have been proposed. However, hardware implementations of such methods are limited. Wiener filter design is one of the techniques we will use initially for this process. Wiener filter design involves matrix inversion. A novel method for the matrix inversion has been proposed in the report. QR decomposition will be the computational algorithm used using Givens Rotation.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Wiener Filter Hardware Realization

  1. 1. Wiener Filter Realization using Hardware. QR decomposition of matrices and inversion by Givens’ Rotation *************************************** 7th Semester Project Report Akashdip Das Abantika Chowdhury Sayan Chaudhuri Guide : Dr. Ayan Banerjee Electronics and Telecommunication Engineering Department December, 2016 1
  2. 2. Contents 1 Abstract 3 2 Introduction 3 3 Wiener Filtering 4 4 Q-R decomposition of a matrix 6 5 Hardware for inversion of an upper triangular matrix(R) 9 5.1 Storage in a RAM . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.2 Address generation Mechanism . . . . . . . . . . . . . . . . . 10 5.3 Hardware for finding the inverse of diagonal elements . . . . . 12 5.4 Hardware for the finding the inverse of the other elements . . 13 6 Conclusions 15 6.1 Multi PORT RAM for faster performance . . . . . . . . . . . 15 6.2 Distributed Arithmetic for computing the product of the two matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 7 Acknowledgements 17 2
  3. 3. 1 Abstract Super-resolution reconstruction is a method for reconstructing higher reso- lution images from a set of low resolution observations. The sub-pixel differ- ences among different observations of the same scene allow to create higher resolution images with better quality. In the last thirty years, many methods for creating high resolution images have been proposed. However, hardware implementations of such methods are limited. Wiener filter design is one of the techniques we will use initially for this process. Wiener filter design involves matrix inversion. A novel method for the matrix inversion has been proposed in the report. QR decomposition will be the computational algo- rithm used using Givens Rotation. 2 Introduction The process of super resolution initially requires that the image be restored from the effects of noise and degradation(assumed isotropic). For that pur- pose the Wiener Filter is used that basically helps in forming an estimate of the image from the degraded one.The fundamentals of the Wiener Filtering has been discussed in Section 3. The Wiener Filtering requires generation of the inverse of a given matrix The method followed here is the QR Decompo- sition(discussed in Section 4). The QR decomposition involves generation of an upper triangular matrix which we will be inverting in the proposed algo- rithm. Various techniques for decomposition of the matrix has been discussed in papers [3],[4]. However the inversion of a matrix proposed by them was not sufficient for the general solution for the problem. Rather the solution was illustrated for a specific system of 3x3 matrix. The QR decomposition involves forming an upper triangular matrix and an orthogonal matrix. The inversion of an orthogonal matrix is simply obtained by computing its trans- pose. The inversion of the upper triangular matrix has been discussed in this paper. The solutions available for this process is for a 3x3 or 4x4 system. So in this paper we have generalized the inversion to a nxn system. The hardware that is required for this purpose has been developed in Section 5 along with sound reasoning and justification. The hardware that has been developed has scopes for enhanced performance that has been discussed in section 6 3
  4. 4. 3 Wiener Filtering In signal processing, the Wiener filter is a filter used to produce an estimate of a desired or target random process by linear time-invariant (LTI) filtering of an observed noisy process, assuming known stationary signal and noise spectra, and additive noise. The Wiener filter minimizes the mean square error between the estimated random process and the desired process. The goal of the Wiener filter is to compute a statistical estimate of an unknown signal using a related signal as an input and filtering that known signal to produce the estimate as an output. For example, the known signal might consist of an unknown signal of interest that has been corrupted by additive noise. The Wiener filter can be used to filter out the noise from the corrupted signal to provide an estimate of the underlying signal of interest. he Wiener filter is based on a statistical approach based on MMSE (Minimum Mean Square Error).The causal finite impulse response (FIR) Wiener filter, instead of using some given data matrix X and output vector Y, finds optimal tap weights by using the statistics of the input and output signals. It populates the input matrix X with estimates of the auto-correlation of the input signal (T) and populates the output vector Y with estimates of the cross-correlation between the output and input signals (V). In order to derive the coefficients of the Wiener filter, consider the signal w[n] being fed to a Wiener filter of order N and with coefficients {a0, · · · , aN }. The output of the filter is denoted x[n] which is given by the expression. x[n] = N i=0 aiw[n − i]. The residual error is denoted e[n] and is defined as e[n] = x[n] s[n] (see the corresponding block diagram). The Wiener filter is designed so as to mini- mize the mean square error (MMSE criteria) which can be stated concisely as follows: ai = arg min E e2 [n] , where E[·] denotes the expectation operator. In the general case, the coefficientsai may be complex and may be derived for the case where w[n] and s[n] are complex as well. With a complex signal, the matrix to be solved is a Hermitian Toeplitz matrix, rather than symmetric Toeplitz matrix. For simplicity, the following considers only the case where all these quantities are real. The mean square error (MSE) may be rewritten as: 4
  5. 5. E e2 [n] = E (x[n] − s[n])2 = E x2 [n] + E s2 [n] − 2E[x[n]s[n]] = E   N i=0 aiw[n − i] 2   + E s2 [n] − 2E N i=0 aiw[n − i]s[n] To find the vector [a0, . . . , aN ] which minimizes the expression above, calcu- late its derivative with respect to each ai ∂ ∂ai E e2 [n] = ∂ ∂ai    E   N i=0 aiw[n − i] 2   + E s2 [n] − 2E N i=0 aiw[n − i]s[n]    = 2E N j=0 ajw[n − j] w[n − i] − 2E[s[n]w[n − i]] = 2 N j=0 E[w[n − j]w[n − i]]aj − 2E[w[n − i]s[n]] Assuming that w[n] and s[n] are each stationary and jointly stationary, the sequencesRw[m] and Rws[m] known respectively as the autocorrelation of w[n] and the cross-correlation between w[n] and s[n] can be defined as fol- lows: Rw[m] = E{w[n]w[n + m]} Rws[m] = E{w[n]s[n + m]} The derivative of the MSE may therefore be rewritten as (notice that Rws[−i] = Rsw[i]) ∂ ∂ai E e2 [n] = 2 N j=0 Rw[j − i]aj − 2Rsw[i] i = 0, · · · , N. Letting the derivative be equal to zero results in N j=0 Rw[j − i]aj = Rsw[i] i = 0, · · · , N. which can be rewritten in matrix form      Rw[0] Rw[1] · · · Rw[N] Rw[1] Rw[0] · · · Rw[N − 1] ... ... ... ... Rw[N] Rw[N − 1] · · · Rw[0]      T      a0 a1 ... aN      a =      Rsw[0] Rsw[1] ... Rsw[N]      v These equations are known as the Wiener–Hopf equations. The matrix T ap- 5
  6. 6. pearing in the equation is a symmetric Toeplitz matrix. Under suitable con- ditions on R , these matrices are known to be positive definite and therefore non-singular yielding a unique solution to the determination of the Wiener filter coefficient vector, a = T−1 v It is this equation that makes it necessary to design a Matrix Inversion Hard- ware that is faster than the existing ones so that there is less delay in image processing and also generalization to NxN form. The inversion of the matrix will be done in this paper using QR decomposition using Givens Rotation 4 Q-R decomposition of a matrix QR Decomposition: QR decomposition is one of the most important opera- tions in linear algebra. It can be used to find matrix inversion, to solve a set of simulations equations or in numerous applications in scientific computing. It represents one of the relatively small numbers of matrix operation primitive from which a wide range of algorithms can be realized. QR decomposition is an elementary operation, which decomposes a matrix into an orthogonal and a triangular matrix. QR decomposition of a real square matrix A is a decomposition of A as A = QR, where Q is an orthogonal matrix (QT Q = I) and R is an upper triangular matrix. And we can factor m x n matrices (with m n) of full rank as the product of an m x n orthogonal matrix where QT Q = I and an n x n upper triangular matrix. There are different meth- ods which can be used to compute QR decomposition. The techniques for QR decomposition are Gram-Schmidt ortho-normalization method, House- holder reflections, and the Givens rotations. Each decomposition method has a number of advantages and disadvantages because of their specific solution process.The Givens’ Rotation Technique has been discussed If there are two nonzero vectors, x and y, in a plane, the angle, θ, between them can be formalized as : cos(θ)= (x,y) ||x||2||y||2 The rotation will be performed using 16 bit pipelined CORDIC. This formula can be extended to n vectors. The angle, θ , can be defined as 6
  7. 7. θ=arccos (x,y) ||x||2||y||2 ((A−1 ) −1 )=A A=QR where R is an upper triangular matrix and R is an orthogonal matrix. I=QQT Consider a 4X4 system A =     a1,1 a1,2 a1,3 a1,4 a2,1 a2,2 a2,3 a2,4 a3,1 a3,2 a3,3 a3,4 a4,1 a4,2 a4,3 a4,4     R =     a1,1 a1,2 a1,3 a1,4 0 a2,2 a2,3 a2,4 0 0 a3,3 a3,4 0 0 0 a4,4     The matrix of Givens Rotation is G(i,j, θ) =     1 0 0 0 0 cos(θ) sin(θ) 0 0 −sin(θ) cos(θ) 0 0 0 0 1     Givens Rotation process utilizes a cycle of rotation whose function is to null an element in the sub-diagonal of the matrix forming the QR matrix. Q matrix is obtained by concatenating all the Givens Rotation. R is to be found from three rotation where each element is obtained from each rotation. Givens Rotation matrices needed for a 3x3 system G1 =   cos(θ) 0 sin(θ) 0 1 0 −sin(θ) 0 cos(θ)   G2 =   cos(θ) sin(θ) 0 −sin(θ) cos(θ) 0 cos(θ) cos(θ) 1   G3 =   1 0 0 cos(θ) cos(θ) sin(θ) cos(θ) −sin(θ) cos(θ)   θ, A(3,1) , A(2,1), A(3,2) can be obtained using c1 = A1(1,1) √ A1(3,1)2 +A1(1,1)2 7
  8. 8. c2 = A1(1,1) √ A1(2,2)2 +A1(3,2)2 c3 = A1(1,1) √ A1(2,2)2 +A1(3,2)2 s1 = A1(3,1) √ A1(3,1)2 +A1(1,1)2 s2 = A1(2,1) √ A1(2,1)2 +A1(1,1)2 s3 = A1(3,2) √ A1(2,2)2 +A1(3,2)2 Q = G1 T .G2 T .G3 T A2 = G1A1 A3 = G2A2 R = G3A3 A = QR A−1 = (QR)−1 A−1 = (R)−1 (Q)−1 A−1 = (R)−1 (Q)T This nececitates the formation of the inverse of the upper triangular ma- trix and it’s subsequent multiplication to the transpose of the orthogonal matrix. Figure 1: Basic Hardware for matrix inversion using QR decomposition.The G matrix is formed using Givens Rotation performed using CORDIC 8
  9. 9. 5 Hardware for inversion of an upper trian- gular matrix(R) We have designed the hardware for inversion of a generalised N X N upper triangular matrix R. where R=      r1,1 r1,2 · · · r1,n 0 r2,2 · · · r2,n ... ... ... ... 0 0 · · · rn,n      Let B be (R)−1 . The algorithm is as followed 1 f or ( row=1;row<=n ; row++) 2 B(row , row )=1/R(row , row ) 3 next row 4 f or ( row=1;row<=n ; row++) 5 f or ( col=row+1; col<=n ; col++) 6 s=0 7 f or (k=1;k<=col −1;k++) 8 s=s+B(row , k)R(k , col ) 9 s=−s /R( col , col ) 10 B(row , col )=s 11 next k 12 next col 13 next row We observe that the inverse of the upper triangular matrix is also an upper triangular matrix with the diagonal elements reciprocal of the diag- onal elements of the original matrix. The inverse of the other elements are calculated recursively using the algorithm as mentioned above. An example to illustrate how the algorithm works is shown below. Let A be an upper triangular matrix and B be its inverse then A=        a1,1 a1,2 a1.3 · · · r1,n 0 a2,2 a2,3 · · · a2,n 0 0 a3,3 · · · a3,n ... ... ... ... ... 0 0 0 · · · an,n        B=        b1,1 b1,2 ab1.3 · · · br1,n 0 b2,2 b2,3 · · · b2,n 0 0 b3,3 · · · b3,n ... ... ... ... ... 0 0 0 · · · bn,n        Since AB=I 9
  10. 10.        a1,1 a1,2 a1.3 · · · r1,n 0 a2,2 a2,3 · · · a2,n 0 0 a3,3 · · · a3,n ... ... ... ... ... 0 0 0 · · · an,n               b1,1 b1,2 ab1.3 · · · br1,n 0 b2,2 b2,3 · · · b2,n 0 0 b3,3 · · · b3,n ... ... ... ... ... 0 0 0 · · · bn,n        =        1 0 0 · · · 0 0 1 0 · · · 0 0 0 1 · · · 0 ... ... ... ... ... 0 0 0 · · · 1        Multiplying the ith row of matrix A with the ith column of B yields ai,ibi,i=1. Hence we see that bi,i = 1 ai,i Now to solve for the non diagonal elements of the matrix B. We multiply the first row and second column first to get a1,1b1,2+a1,2b2,2=0. We already know thw value of b2,2 So the only unknown is b1,2. Now in general to obtain the value of bi,j we multiply the ith row of A and the jth column of B and equate that to 0 proceeding in a proper sequence of steps so that the values of b that are needed to do the forward substitution are obtained from beforehand. 5.1 Storage in a RAM In any matrix total number of elements = n x n=n2 . In the upper triangular matrix generated here the number of non-zero elements is n(n−1) 2 since the rest of the elements are zero in the bottom left triangle.So for minimisation of hardware we have come up with an algorithm to omit storage of the zeros in the RAM. If the zeros were not omitted the position of the element ri,j would be j + (i-1)x n. However since this is not the case we are required to develop an algorithm to generate the RAM location address for given i, j and n 5.2 Address generation Mechanism As in the upper triangular matrix the ri,j = 0 for i<j; there would no need for storing them as zeroes individually in the RAM, instead we could just omit the zeroes and find the position in the RAM corresponding to inputs (i,j) that is ri,j would be given and a corresponding location in the RAM would be obtained in our mechanism where zeroes are not stored, the address in the RAM for ri,j would be equal to n(i-1)+j-i(i−1) 2 -1. Now this formula is obtained from the fact that in the actual system we would have the address of the element ri,j as j + (i-1)x n but this time for 10
  11. 11. each row we are omitting i zeros, so the cumulative number of zeros omitted is i k=1 k Figure 2: Block diagram of the address generation block Figure 3: Circuit diagram of the address generation block Hardware Required : 11
  12. 12. 4 adders/subtractors 2 multipliers 1 bit right shifter 5.3 Hardware for finding the inverse of diagonal ele- ments The following circuit (Figure 4) can be used for inversion of the diagonal elements of the upper triangular matrix. The circuit consists of a loadable up counter that counts till the number of rows in the matrix. Hence the comparator to indicate that this process needs to stop when the value n is reached. The circuit then sends value to the address generator block of RAM A and then the same address is sent to RAM B so that the data is modified in the same location in both RAM and RAM B. Hardware Required : 1 Loadable Up Counter 1 Comparator 1 Inverter Block that computes the inverse of a 16 bit number. Time Required : Same as n clock pulses 12
  13. 13. Figure 4: Schematic hardware design for inversion of diagonal elements 5.4 Hardware for the finding the inverse of the other elements The following circuit(Figure 5) can be used for diagonalizing all elements other than the diagonal elements. Hardware Required : 3 Loadable Up Counter 4 address generation blocks 1 divider 1 multiplier 4 adders/subtractors 1 Register Necessary control circuits for termination of loops No. of clock cycles needed : O(n2 ) 13
  14. 14. Figure 5: Schematic hardware design for inversion of elements other than those lying in the principal diagonal 14
  15. 15. 6 Conclusions 6.1 Multi PORT RAM for faster performance One of the obstacles in the way of obtaining high performance in computing is the memory-wall . If the processing elements cannot get the data from reg- ister file (RF) at the processing rate, this causes a bottleneck that adversely affects the overall performance. In order to meet the requirement of proper data usage between the computational units, such a computation system needs a register file that can meet the requirements of different computing units on the FPGA. The demand to process more data per unit time requires multiple read and write operations at a time, which can be achieved by the usage of multi-port register files (MPo-RFs) instead of conventional single- port RFs (SPo-RF).Multi-ported memories are challenging to implement on FPGAs since the block RAMs included in the fabric typically have only two ports. Hence we must construct memories requiring more than two ports either out of logic elements or by combining multiple block RAMs. Some Conventional Multi-Port Register File Implementations that can be used: 1. Distributed Memory 2. Replication 3. Banking 4. Multi-pumping 6.2 Distributed Arithmetic for computing the product of the two matrices Distributed arithmetic is a technique developed for the real-time computation of the inner product of the vector with constant elements and the vector with varying coefficients. The inner product is computed without splitting into operations of multiplication and addition. At calculation, operations of summation and shift of inner products of an unchangeable vector and a bit-slice of a changeable vector are carried out. All possible values of partial inner products are calculated offline and written down in Look Up Table (LUT).The content of LUT is computed dynamically in the online mode. Contents of this memory remain invariable for the period of multiplication of the left matrix by a column of the right matrix. Despite need of calculation of contents of LUT total number of micro-operations of addition decreases 15
  16. 16. Figure 6: 4 Read + 1 Write block RAM as an example of Multiport RAM in comparison with a classical way of calculation of matrix product. 16
  17. 17. 7 Acknowledgements The authors would like to thank their Project Guide Dr. Ayan Banerjee for his invaluable suggestions and proper direction throughout the course of the project. Thankfulness and heartfelt gratitude is also extended to Mr. Anirban Chakraborty who is currently pursuing his Ph.D under the guidance of Prof. Ayan Banerjee. References [1] Gonzalez, R. C., Woods, R. E. (2002). Digital image processing. Upper Saddle River, NJ: Prentice Hall. [2] Seyid K, Blanc S, Leblebici Y Hardware Implementation of Real-Time Multiple Frame Super-Resolution eyid Very Large Scale Integration (VLSI-SoC), 2015 IFIP/IEEE International Conference on [3] Matrix Inversion Using QR Decomposition by Parabolic Synthesis Nafiz Ahmed Chisty— [4] Brown, Robert Grover; Hwang, Patrick Y.C. (1996). Introduction to Ran- dom Signals and Applied Kalman Filtering (3 ed.). New York: John Wiley Sons. ISBN 0-471-12839-2. [5] D. Boulfelfel, R.M. Rangayyan, L.J. Hahn, and R. Kloiber, 1994, ”Three- dimensional restoration of single photon emission computed tomography images”, IEEE Transactions on Nuclear Science, 41(5): 1746-1754, Octo- ber 1994 [6] Wiener, Norbert (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Series. New York: Wiley. ISBN 0-262-73005-7. [7] Thomas Kailath, Ali H. Sayed, and Babak Hassibi, Linear Estimation, Prentice-Hall, NJ, 2000, ISBN 978-0-13-022464-4. [8] Wiener N: The interpolation, extrapolation and smoothing of stationary time series’, Report of the Services 19, Research Project DIC-6037 MIT, February 1942 17
  18. 18. [9] Kolmogorov A.N: ’Stationary sequences in Hilbert space’, (In Russian) Bull. Moscow Univ. 1941 vol.2 no.6 1-40. English translation in Kailath T. (ed.) Linear least squares estimation Dowden, Hutchinson Ross 1977 [10] Vladislav Lesnikov, Tatiana Naumovich, Alexander Chastikov, ”Modifi- cation of the architecture of a distributed arithmetic”, East-West Design Test Symposium (EWDTS) 2015 IEEE, pp. 1-4, 2015. [11] Tips Tricks: Creating a 2W+4R FPGA Block RAM, Part 1 ´Alvaro Lopes, Senior Software engineer, Critical Software [12] An Efficient FPGA Implementation of Scalable Matrix Inversion Core using QR Decomposition 18