Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1458-1462 Verilog Implementation Of Da Based Dct With High Accuracy Error-Compensated Adder Tree Y Venkat Seshaiah B.Chinna rao P.M.Francis (PG student) Prof. &Head, Dept.of ECE Asst.Prof. In Dept. of ECE Gokul Eng College Gokul Eng College Gokul Eng CollegeAbstract In brief, by operating the shifting and area cost [8], [9]. Hence the NEDA architecture isaddition in parallel, an error-compensated adder- the smallest architecture for DA-based DCT coretree (ECAT) is proposed to deal with the designs, but speed limitations exist in the operationstruncation errors and to achieve low-error and of serial shifting and addition after the DA-high-throughput discrete cosine transform (DCT) computation. The high-throughput shift-adder-treedesign. Instead of the 12 bits used in previous (SAT) and adder-tree (AT), those unroll the numberworks, 9-bit distributed arithmetic-precision is of shifting and addition words in parallel for DA-chosen for this work so as to meet peak-signal-to- based computation, were introduced in [10] andnoise-ratio (PSNR) requirements. Thus, an area- [11], respectively. However, a large truncation errorefficient DCT core is implemented to achieve 1 occurred. In order to reduce the truncation errorGpels/s throughput rate with gate counts of 22.2 effect, several error compensation bias methods haveK for the PSNR requirements outlined in the been presented [12]–[14] based on statisticalprevious works. analysis of the relationship between partial products and multiplier-multiplicand. However, the elementsIndex Terms—Distributed arithmetic (DA)- of the truncation part outlined in this work arebased, error-compensated adder-tree (ECAT), 2- independent so that the previously describedD discrete cosine transform (DCT). compensation methods cannot be applied.I. INTRODUCTION This brief addresses a DA-based DCT core Discrete cosine transform (DCT) is a with an error-compensated adder-tree (ECAT). Thewidely used tool in image and video compression proposed ECAT operates shifting and addition inapplications [1]. Recently, the high-throughput DCT parallel by unrolling all the words required to bedesigns have been adopted to fit the requirements of computed. Furthermore, the error-compensatedreal-time applications [2]–[11]. circuit alleviates the truncation error for high accuracy design. Based on low-error ECAT, the DA- The multiplier-based DCTs were presented precision n this work is chosen to be 9 bits instead ofand implemented in [2] and [3].To reduce the traditional 12 bits so as to achieve the peak-area,ROM-based distributed arithmetic (DA)was signal-to-noise-ratio (PSNR) [1] requirements.applied in DCT cores[4]–[6].Uramotoet Therefore, the hardware cost is reduced, and the.[4]implemented the DA-based multipliers speed is improved using the proposed ECAT.usingROMsto produce partial products together withadders that accumulated these partial products. In This brief is organized as follows. Inthis way, instead of multipliers, the DA-based ROM Section II, the mathematical derivation of thecan be applied in a DCT core design to reduce the distributed arithmetic is given. The proposed ECATarea required. In addition, the symmetrical properties architecture is discussed in Section II. The proposedof the DCT transform and parallel DA architecture 8 X 8 2-D DCT core is demonstrated in Section III.can be used in reducing the ROM size in [5] and [6], The comparisons and results are presented in Sectionrespectively. Recently, ROM-free DA architectures IV, and conclusions are drawn in Section V. were presented [7]–[11]. Shams et al.employed a bit-level sharing scheme to construct the II. MATHEMATICAL DERIVATION OFadder-based butterfly matrix called new DA DISTRIBUTED ARITHMETIC(NEDA) [7]. Being compressed, the butterfly-adder-matrix in [7] The inner product is an important tool in utilized 35 adders and 8 shift-addition digital signal processing applications. It can beelements to replace the ROM. Based on NEDA written as follows:architecture, the recursive form and arithmetic logicunit (ALU) were applied in DCT design to reduce 1458 | P a g e
  2. 2. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1458-1462 (3) (1) In general, the shifting and addition computationWhere Ai,Xi and L are ith fixed coefficient, ith input uses a shift-and-add operator [7] in VLSIdata, and number of inputs, respectively. Assume implementation in order to reduce hardware cost.that coefficient 4i is Q-bit two’s complement binary However, when the number of the shifting andfraction number. Equation (1) can be expressed as addition words increases, the computation time willfollows: also increase. Therefore, the shift-adder-tree (SAT) presented in [10] operates shifting and addition in parallel by unrolling all the words needed to be computed for high-speed applications. However, a large truncation error occurs in SAT, and an ECAT architecture is proposed in this brief to compensate for the truncation error in high-speed applications. In Fig. 1, the Q P-bit words operate the shifting and addition in parallel by unrolling all computations. Furthermore, the operation in Fig. 1 can be divided into two parts: the main part (MP) that includes _ most significant bits (MSBs) and the truncation part (TP) that has least significant bits (LSBs). Then, the shifting and addition output can be expressed as follows: (2) (4) The output Y will obtain the P-bit MSBs using a rounding operation called post truncation (Post-T), which is used for high-accuracy applications. Note that y0 may be 0 or a negative number However, hardware cost increases in the VLSIdue to two’s complement representation. In (2), y0 design. In general, the TP is usually truncated tocan be calculated by adding all Xi values when reduce hardware costs in parallel shifting andAi,j=1 and then the transform output Y can be addition operations, known as the direct truncationobtained by shifting and adding all nonzero yi alues. (Direct-T) method. Thus, a large truncation errorThus the inner product computation in (1) can be occurs due to the neglecting of carry propagationimplemented by using shifting and adders instead of from the TP to MP. In order to alleviate themultipliers. Therefore, low hardware cost can be truncation error effect, several error compensationachieved by using DA-based architecture. bias methods have been presented [12]–[14]. All previous works were only applied in the design of a fixed-width multiplier. Because the products in a multiplier have a relationship between the input multiplier and multiplicand, the compensation methods usually use the correlation of inputs to calculate a fixed [12] or an adaptive [13], [14] compensation bias using simulation or statistical analysis. Note that the addition elements yqp in the TP in Fig. 1 (where 1<=Q<=(Q-1) and P-q- 1<=p<=(P-1)) are independent from each other.Fig.1.Q P-bit words shifting and addition operations Therefore, the previous compensation methodin parallel. cannot be applied in this work, and the proposed ECAT is explained as follows. A. Proposed Error-Compensated SchemeIII. ECAT ARCHITECTURE From Fig. 1, (4) can be approximated as From (2), the shifting and additioncomputation can be written as follows: (5) 1459 | P a g e
  3. 3. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1458-1462where is the compensated bias from the TP to the Fig. 2. Proposed ECAT architecture of shifting andMP as listed in (6)–(8) addition operators for the (P,Q)=(12,6) example The proposed ECAT has the highest accuracy with a moderate areadelay product. The shift-and-add [7] method has the smallest area, but the overall computation time is equal to 10.8ns that (6) is the longest. Similarly, the SAT [10], which truncates the TP and computes in parallel, takes 3.72 ns to complete the computation and uses 406 gates, which is the best area-delay product performance. However, for system accuracy, the SAT is the worst option shown in Table I. Therefore, the ECAT is suitable for high-speed and low-error applications. (7) TABLE 1: COMPARISONS OF THE PROPOSED ECAT WITH OTHER ARCHITECTURES FOR A (8) SIX 8-BIT WORDS EXAMPLEB. Performance Simulation for an Error-Compensated Circuit IV. DISCUSSION AND COMPARISONS In this subsection, comparisons of the The test image “Lena” used to checkabsolute average error ξ , the maximum error ξmax, system accuracy is comprised of 512 X 512 pixelsand the mean square error ξ for the proposed with each pixel being represented by 8-bit 256 grayerror-compensated circuit with Direct-T and Post-T level data. After inputting the original test imageare listed. The ξ,, ξmax and ξ are defined as pixels to the proposed 2-D DCT core, the transformfollows: output data is captured and fed into MATLAB to compute the inverse DCT using 64-bit double- precision operations. The PSNRs are close to 44 and 47 dB for test image and for random 8-bit 256 gray level data inputs, receptively. Table II compares the proposed 8 X 8 2-D DCT core with previous 2-D DCT cores. In [3], a multiplier-based DCT core based on pipeline radix-4square single delay feedback path architecture to achieve high-speedC. Proposed ECAT Architecture design. The ROM-based DCT core is presented in The proposed ECAT architecture is [4] to reduce adders to reduce the chip area of DCTillustrated in Fig. 2 for (P,Q)=(12,6) (case 3), where core. Nevertheless, a speed limitation for shift-and-block FA indicates a full-adder cell with three inputs add is in NEDA design. In [10] and [11], the SAT(a, b, and c) and two outputs, a sum (s) and a carry- and AT architectures for DA-based DCTs improveout (co). Also, block HA indicates half-adder cell the throughput rate of the NEDA method. However,with two inputs (a and b) and two outputs, a sum (s) DA-precision must be chosen as 13 bits to meet theand a carry-out (co). The comparisons of area, delay, system accuracy with more area overhead. Thearea-delay product, and accuracy for the proposed proposed DCT core uses low-error ECAT to achieveECAT with other architectures are listed in Table II. a high-speed design, and the DA-precision can beThe area and delay are synthesized using a Synopsys chosen as 9 bits to meet the PSNR requirements forDesign Compiler with the Artisan TSMC 0.18- μm reducing hardware costs. The proposed DCT coreStandard cell library. has the highest hardware efficiency, defined as follows (based on the accuracy required by the presented standards). 1460 | P a g e
  4. 4. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1458-1462 TABLE III: COMPARISONS OF 2-D DCT ARCHITECTURES IN FPGAS IV.RESULTS AND DISCUSSION The 8-point 1-D DCT architecture and the implementation were discussed in the previous chapters. Now this chapter deals with the simulationTABLE II :COMPARISONS OF DIFFERENT 2-D and synthesis results of the implemented 1-D 8-pointDCT ARCHITECTURES WITH THE PROPOSED DCT. Here ModelSim tool is used in order toARCHITECTURE simulate the design and checks the functionality of the design. Once the functional verification is done, the design will be taken to the Xilinx tool for Synthesis process and the net-list generation. Simulation Result:Fig. 3. Architecture of the proposed 1-D 8-pointDCT Fig : Simulation results of Distributed Arithmatic DCT V. CONCLUSION In this brief, a high-speed and low-error 8 X 8 2-D DCT design with ECAT is proposed to improve the throughput rate significantly up to about 13 folds at high compression rates by operating the shifting and addition in parallel. Furthermore, the proposed error-compensated circuit alleviates the truncation error in ECAT. In this way, the DA-Fig. 4. Core layout and characteristics precision can be chosen as 9 bits instead of 12 bits so as to meet the PSNR requirements. Thus, the Furthermore, the proposed 2-D DCT core proposed DCT core has the highest hardwaresynthesized by using Xilinx ISE 9.1, and the Xilinx efficiency than those in previous works for the sameXC2VP30 FPGA can achieve 792 mega pixels per PSNR requirements. Finally, an area-efficient 2-Dsecond (M-pels/sec) throughput rate (up to about 7 DCT core is implemented using a TSMC 0.18- μmfolds of previous work [16]). Table III compares the process, and the maximum throughput rate is 1proposed 2-D DCT core with previous FPGA Gpels/s. In summary, the proposed architecture isimplementations. 1461 | P a g e
  5. 5. Y Venkat Seshaiah, B.Chinna rao, P.M.Francis / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 5, September- October 2012, pp.1458-1462suitable for high compression rate applications in IEEE Trans. Circuits Syst. II, Exp. Briefs,VLSI designs. vol. 43, no. 2, pp. 90–95, Feb. 1996. [13] K. J. Cho, K. C. Lee, J. G. Chung, and K.REFERENCES K. Parhi, “Design of low-error fixed-width [1] Y.Wang, J. Ostermann, and Y. Zhang, modified booth multiplier,” IEEE Trans. Video Processing and Communications, 1st Very Large Scale Integr. (VLSI) Syst., vol. ed. Englewood Cliffs, NJ: Prentice-Hall, 12, no. 5, pp. 522–531, May 2004. 2002. [14] L. D. Van and C. C. Yang, “Generalized [2] Y. Chang and C.Wang, “New systolic array low-error area-efficient fixedwidth implementation of the 2-D discrete cosine multipliers,” IEEE Trans. Circuits Syst. I, transform and its inverse,” IEEE Trans. Reg. Papers, vol. 52, no. 8, pp. 1608–1619, Circuits Syst. Video Technol., vol. 5, no. 2, Aug. 2005. pp. 150–157, Apr. 1995. [15] C. C. Sun, P. Donner, and J. Gotze, “Low- [3] C. T. Lin, Y. C. Yu, and L. D. Van, “Cost- complexity multi-purpose IP core for effective triple-mode reconfigurable quantized discrete cosine and integer pipeline FFT/IFFT/2-D DCT processor,” transform,” in Proc. IEEE Int. Symp. IEEE Trans. Very Large Scale Integr. Syst., Circuits Syst., 2009, pp. 3014–3017. vol. 16, no. 8, pp. 1058–1071, Aug. 2008. [16] A. Tumeo, M. Monchiero, G. Palermo, F. [4] S. Uramoto, Y. Inoue, A. Takabatake, J. Ferrandi, and D. Sciuto, “A pipelined fast Takeda, Y. Yamashita, H. Yerane, and M. 2D-DCT accelerator for FPGA-based Yoshimoto, “A 100-MHz 2-D discrete SoCs,” in Proc. IEEE Comput. Soc. Annu. cosine transform core processor,” IEEE J. Symp. VLSI, 2007, pp. 331–336. Solid-State Circuits, vol. 27, no. 4, pp. 492– [17] S. Ghosh, S. Venigalla, and M. Bayoumi, 499, Apr. 1992. “Design and implementation of a 2D-DCT [5] S. Yu and E. E. S. , Jr., “DCT architecture using coefficient distributed implementation with distributed arithmetic,” in Proc. IEEE Comput. Soc. arithmetic,” IEEE Trans. Comput., vol. 50, Ann. Symp. VLSI, 2005, pp. 162–166. no. 9, pp. 985–991, Sep. 2001. [6] P. K. Meher, “Unified systolic-like architecture for DCT and DST using distributed arithmetic,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 3, no. 12, pp. 2656–2663, Dec. 2006. [7] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, “NEDA: A low-power high-performance DCT architecture,” IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955–964, Mar. 2006. [8] M. R. M. Rizk and M. Ammar, “Low power small area high performance 2D- DCT architecture,” in Proc. Int. Design Test Workshop, 2007, pp. 120–125. [9] Y. Chen, X. Cao, Q. Xie, and C. Peng, “An area efficient high performance DCT distributed architecture for video compression,” in Proc. Int. Conf. Adv. Comm. Technol., 2007, pp. 238–241. [10] C. Peng, X. Cao, D. Yu, and X. Zhang, “A 250 MHz optimized distributed architecture of 2D 8� DCT,” in Proc. Int. Conf. ASIC, 8 2007, pp. 189–192. [11] C. Y. Huang, L. F. Chen, and Y. K. Lai, “A high-speed 2-D transform architecture with unique kernel for multi-standard video applications,” in Proc. IEEE Int. Symp. Circuits Syst., 2008, pp. 21–24. [12] S. S. Kidambi, F. E. Guibaly, and A. Antonious, “Area-efficient multipliers for digital signal processing applications,” 1462 | P a g e