Your SlideShare is downloading.
×

Like this document? Why not share!

- FPGA Implementation of 5/3 Integer ... by Nano Scientific R... 379 views
- An Efficient VLSI Architecture for ... by Nano Scientific R... 471 views
- VLSI Implementation of Discrete Wav... by Nano Scientific R... 493 views
- Implementation of hybrid wave pipel... by Nano Scientific R... 76 views
- High Speed DWT Processor Implementa... by Nano Scientific R... 419 views
- Filtering Electrocardiographic Sign... by IDES Editor 612 views

659

Published on

Published in:
Education

No Downloads

Total Views

659

On Slideshare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

0

Comments

0

Likes

1

No embeds

No notes for slide

- 1. 2012 Third International Conference on Intelligent Systems Modelling and Simulation Implementation of Hybrid Wave-pipelined 2D DWT Using ASIC Venkatasubramanian Adhinarayanan*, Rengaprabhu Paramasivam$, Seetharaman Gopalakrishnan# and T. N. Prabakar@, * Research Scholar, Department of CSE, Sathayabamma University, Chennai, India $ Research Scholar, Dept. of Info. and Comm. Engineering, Anna University of Technology, Tiruchirappalli, India # Principal, Oxford Engineering College, Tiruchirappalli, India, Email: jgsraman@gmail.com @ Dean, Oxford Engineering College, Tiruchirappalli, India Abstract - In the literature, pipelined systems require A new multiplier algorithm denoted as Baugh-Wooley clock routing complexity and clock skews between pipelined constant coefficient multiplier (BW-PKCM) is different parts of the system. A circuit design technique proposed and used for the study and comparison of distributed such as wave-pipelining achieves high speed without the arithmetic algorithm (DAA) and lifting schemes on FPGAs in above limitations. Wave-pipelined circuit dispenses with [3]. For the computation of 2D DWT, 2’s complement the need for registers for storing the intermediate results multiplications are required. In the literature BW method [4] and instead uses the inherent capacitance at the input to has been studied with carry save, carry ripple, and serial the various blocks. This results in lower power at the cost parallel algorithms. These schemes are inefficient in speed, of speed. Hybrid scheme is aimed at combining the area, or both when one of the operand is fixed. For an N-bit advantages of both pipelining and wave-pipelining. Hence, number, conventional 2’s complement multiplier (C2CM) we proposed the design and implementation of hybrid wave- requires [N-1/4] arrays of 4-inputs LUTs. But sign extension pipelined 2D-DWT using lifting scheme in this paper. For the and BW methods require [N/4] arrays of 4-inputs LUTs. The purpose of comparison, non pipelined scheme as well as the size of the array is equal to the number of product bits. The 2’s scheme with pipelining within the blocks and between the blocks complement block and control logic increases the number of is implemented. From the results, it is concluded that the hybrid LUT arrays area and multiplication time for the C2CM. WP is faster than non-pipelined and requires less area, less clock routing complexity and lower power than pipelined. However, for the sign extension and BW, the number of LUT array may be the same as that required for the first scheme. Key-Words: FPGA, SOC, ASIC, DWT, lifting. The lifting scheme with BWPKCM requires 4% less area but has the same speed compared to that using distributed I. INTRODUCTION arithmetic algorithm with sign extension scheme. The Field-programmable gate arrays (FPGAs) have grown implementation details are available with [3]. In 2D DWT, enormously in their complexity and can encompass all the filter coefficients are constant. Hence, BW-PKCM which major functional elements of a complete end product into a combines the pipelined KCM with Baugh-Wooley single chip [1]. An FPGA-based system on chip can contain multiplication algorithm is used in this paper. one or more processors, memories, dedicated components for The operating frequency of the 2D DWT may be accelerating critical tasks and interfaces to various peripherals. increased, if it is implemented using either pipelining or WP. Development tools for the FPGAs, the Altera, San Jose, CA, Pipelining results in the highest operating frequency but has USA system-on-programmable-chip (SOPC) builder, enable number of disadvantages such as increased area, power the integration of intellectual proprietary (IP) cores for dissipation, and clock routing complexity. WP has been common DSP functions and user-designed custom blocks with proposed as one of the techniques for overcoming these the softcore processors Nios II. The availability of on-chip limitations. A number of systems have been implemented dedicated multipliers, softcore/hardcore processors and IP using wave-pipelining on ASICs and FPGAs [5], [6]. The cores make the FPGAs to be an ideal platform for the concept of wave-pipelining has been described in a number of implementation of area as well as speed intensive image previous works [7], [8], [9]. WP results in increase in the processing applications such as discrete cosine transform speed and reduction in the clock routing complexity. The (DCT) and discrete wavelet transform (DWT) [2]. proposed hybrid scheme is aimed at combining the advantages Joint Pictures experts Group 2000 (JPEG2000) is a of both pipelining and wave-pipelining. recently standardized image compression algorithm that The organization of the rest of the paper is as provides significant enhancements over the existing JPEG follows: In section II, the review of previous work on 2D standard. JPEG2000 differs from widely used compression DWT is described. In section III, design of wave-pipelined standards in that it relies on DWT and uses embedded bit lifting blocks is presented. In section IV, automation schemes plane coding of the wavelet coefficients. DWT has been for wave-pipelined circuits are presented. In section V, traditionally implemented using convolution or FIR filter bank implementation and study of lifting blocks are discussed, and structures. These structures require both a large number of results are presented. In section VI, summarizes the arithmetic computations and a large memory for storage, conclusions. which are not desirable for high-speed/low-power image processing applications.978-0-7695-4668-1/12 $26.00 © 2012 IEEE 368 365DOI 10.1109/ISMS.2012.91
- 2. II. REVIEW OF PREVIOUS WORK ON 2D DWT Coefficient Multiplier (KCM). KCM uses a ROM for finding the product of a constant and a variable. The variable is fed as 2D wavelet transform may be computed using filter address to the ROM, which contains the productsbanks. Fig .1 shows One level 2D DWT, x[n] shows the input corresponding to all possible combinations of the operands.image, LL1 shows the subset of the transform coefficients When the ROM is implemented using 4 input Look Up Tablesrepresents the coarse form of the input image. The input (LUTs), a no. of stages of LUTs and adders are required tosamples x(n) are passed through the 2 stages of analysis filters find the product. For example a 12x12 bit KCM requires oneas shown in Fig .1. They are first processed by the low pass ROM stage consisting of three 16X16 ROMs and two stagesh[n] and high pass g[n] horizontal filters and are sub sampled of 16 bit adders. The speed of the KCM can be increased byby two. Subsequently, the outputs (L1, H1) are processed by introducing the pipelining registers at the outputs of ROMslow pass and high pass vertical filters. and adders. Fig. 2a α block Fig. 2b β block The Pipelined Constant Coefficient Multiplier (PKCM) using Fig.1 One level 2D DWT the BW content is referred to as BW-PKCM in [10] and is shown to be superior compared to the other approaches. The horizontal and vertical filters contains 5 lifting Hence, only this multiplier is considered for wave pipeliningblocks (α, β, γ, δ, ξ). The lifting scheme uses a poly-phase in this paper. The detailed diagram of the α blockstructure for the analysis filter. For the two levels 2D DWT implemented using BW-PKCM is shown in Fig. 3. The samethe input is LL1 component and for further decomposition the scheme can be adopted for the β, γ, δ, ξ1,ξ2 blocks.same procedure is followed. For every level the image getsreduced by a factor of four. In the lifting scheme, the odd andeven input samples are processed by the lifting blocks (α, β, γ,δ, ξ (ξ1 & ξ2)) in cascade as shown in Fig. 1a. ξ1, ξ2 are scalingblocks. Details of α and β blocks are shown in Fig. 2a andFig. 2b. γ and δ blocks are obtained by replacing the constantsα, β with γ, δ. Fig.3 α block using BW-PKCM Fig.1a Simplified block diagram of Lifting Scheme for 9/7 III. DESIGN OF WAVE-PIPELINED LIFTING BLOCKS filter. ON FPGAS In Fig. 2, since the output from one block is fed asthe input to the next block, the maximum rate at which the An RTL model of a circuit consists of ainput can be fed to the system depends on the sum of the combinational logic circuit separated by the input and outputdelays in all the four stages. The speed is increased by registers. The combinational logic circuit may be consideredintroducing pipelining at the points indicated by dotted lines in to be a wave-pipelined circuit if a number of waves are madeFig. 2 [b]. In this case, the input rate is determined by the to simultaneously propagate through it is shown in Fig. 4alargest delay among all the four blocks. The delay in the [11]. In other words, at any point of time, a sequence of dataindividual stages is reduced further by using Constant is processed in the combinational logic block. In the case of 369 366
- 3. pipelining, only one data is processed in the combinational However, the simulation is inadequate for testing due to thelogic block at a time. Further, the maximum data rate in the difference between the actual delays and the delays calculatedpipelined circuit depends only on Dmax, the maximum by the layout editor. This is because, the layout editorpropagation delay in the combinational logic block. considers only the worst case delays and the actual delays may Fig. 4b shows temporal/spatial diagram of be significantly different due to fabrication variations.combinational logic circuits [8]. If Dmin denotes the This difference becomes important as the logic depthminimum propagation delay of the signal through the of the circuit increases. Hence, the design is downloaded tocombinational logic block, the maximum data rate of the the actual FPGA and its operation is checked using a PC basedwave-pipelined circuit depends on (Dmax – Dmin). test system in [14]. If correct results are not obtained, delaysTraditionally, in a wave-pipelined circuit, higher speeds are are altered and the design is downloaded for testing again. Aachieved by equalizing the Dmax and Dmin [9]. The output of number of iterations of place and route, simulation,the wave-pipelined circuit alternates between unstable and downloading and testing in the actual device may be requiredstable states. The stable period decreases with the increase in till the correct results are obtained. The design of wave-the logic depth. By adjusting the latching instant at the output pipelined circuit in this fashion requires human interventionregister to lie in the stable period, the wave-pipelined circuit and is time consuming. Automation of the above three tasks iscan be made to work properly. But, for large logic depths, considered in this paper.there may not be any stable period. Hence adjusting thelatching instant by itself may not be adequate for storing the IV. AUTOMATION SCHEMES FOR WAVE-PIPELINEDcorrect result at the output register. For such cases, the clock CIRCUITSperiod has to be increased to increase the stable period. The self tuned wave-pipelined circuit is proposed by including a BIST circuit to tune the clock frequencies and clocks with different skews. The block diagram of Self Tuned Wave-pipelined circuit is shown in Figure 5. It consists of different functional blocks namely PRSG block, PRBS sequence generator, signature analyzer, counter, Programmable Clock generator Circuit, Programmable skew generator circuit and FSM. Fig. 4a Wave-pipelined circuit Fig.4b Temporal/spatial diagram of combinational logic circuits Equalization of path delays, adjustment of the clock Fig. 5. Self tuned wave-pipelined circuit.period and clock skew are the three tasks carried out formaximizing the operating speed of the wave-pipelined circuit. A self tuned wave-pipelined has two modes ofAll the three tasks require the delays to be measured and operation namely test mode and normal mode. TM signal isaltered if required. Layout editors, such as EPIC editor from used to select the mode of operation. The circuit is placed inXilinx, may be used for this purpose. In [12], [13], these tasks test mode by making TM signal to be 1. In test mode, FSMare carried out manually. The wave-pipelined circuit designed first varies the clock generator and skew generator circuitusing the layout editor may be tested using simulation. respectively. The programmable clock generator circuit then 370 367
- 4. generates the first clock and this clock is given to PRSG The wave-pipelined circuit using the programmable clock andcircuit, programmable skew generator circuit and input skew generator can be operated at a higher frequency than thatregister. The PRSG block is used for exhaustive testing and it can be achieved using the commercially available synthesisgenerates all 2n combinations of the inputs for an n-bit input. tools which use Dmax for fixing the operating frequency. TheThe programmable skew generator circuit generates skew and automation may be carried out using either off-chip processorthe skewed clock is applied to the output register and counter or on-chip processor. The off-chip processor is used when thecircuit. The counter is used to keep track of the number of test FPGA is used as a coprocessor or hardware accelerator for avectors fed to the combinational block and it generates the main processor or microcontroller. Since off-chipenable signal(sig_en) after all the test vectors have been communication between the FPGA and a processor isapplied. Instead of comparing every output with the expected bound to be slower than on-chip communication, in order tooutput, a signature is generated from the outputs minimize the time required for adjustment of the parameterscorresponding to all the applied inputs using PRBS generator of the wave-pipelined circuit (clock frequency and skew), theand it is compared with stored value in signature analyzer built in self test approach using design for testability [14]circuit. The signature analyzer gives two control signals technique, is proposed for this case [15].(sig_in & chng) to the FSM block which indicates the matchor not. Depending upon the control signals received from V. IMPLEMENTATION AND STUDY OF THE 2D DWTsignature analyzer, CS and SS values to the Clock and Skew USING LIFTING SCHEMEgenerator circuits are generated. If there is no match, FSMchanges the SS value from 0-15 for every CS value. Even after The overall block diagram of one level 2D-DWT s shown inall the skews are applied for a particular CS value, if there is Fig. 6. The input image and the output of the horizontal filtersno match, it changes the CS value. In this way, FSM changes as well as vertical filters are assumed to be stored in the blockCS and SS values until it finds a match. When match is found, RAMs. For the horizontal filters, the even and odd inputs areFSM fixes CS and SS values and the circuit is placed in applied from two block RAMs of size 512x11. For testing, thenormal mode by changing TM=0. In normal mode, user inputs image is assumed to be loaded into the block RAMs usingcan be applied. Memory Initialization File (MIF). The result is written into 4 block RAMs of size 256x1.A. Procedure for Adjusting the Clock Period and Skew The adjustment of the clock skew and clock periodcan be automated by adopting programmability. Theprogrammable clock and clock skew generator may beimplemented. Fig. 6 gives the circuit diagram of a clockgeneration scheme which consists of a delay block and aninverter. The actual clock period depends on the interconnectdelay. The select input of the multiplexer is varied with eithera processor or a Finite State Machine (FSM) to achievedifferent clock frequencies. Similarly, for the clock skewgenerator, the same circuit is used, but the feedbackconnection is removed and the select line is varied throughprocessor or FSM to achieve different clock skew ranges. Fig. 7 Overall block diagram of one level 2D DWT A. Implementation results on Spartan-III XC3S200 Implementation results for one level 2D DWT on Xilinx Spartan-III XC3S200 using all the three approaches and the results are given in Fig. 8. The programmable clock and clock skew blocks are implemented as Macro blocks using Xilinx ISE 8.1i project navigator. For tuning the wave- pipelined circuit, the Micro blaze softcore processor is used. Xilinx Embedded Design Kit (EDK) software is used to integrate the custom block to the Micro blaze processor [16]. Fig. 6. Programmable clock generator. The rest of the steps are similar to what is used for the Altera SOC kit [17]. For the all three schemes, the no. of logic 371 368
- 5. elements, no. of registers, maximum operating frequency andpower dissipated are computed and the results are given inFig. 8. From this Fig. 8, it may be concluded that for thelifting scheme, the method using hybrid WP-P BW-KCM isfaster than non pipelined BW-KCM by a factor of 1.07. Thescheme with BW-PKCM is in turn faster than the hybrid WP-P BW-KCM by a factor of 1.56 and this is achieved with theincrease in the number of registers by a factor of 3.157 andincrease in the number of LEs by a factor of 1.54 compared tothe hybrid WP-P unit. Fig. 9 Implementation results on 2D DWT (pipelining) Fig. 8 Implementation results on Spartan-III XC3S200B. Implementation of 1 level 2D DWT using ASIC The 2D DWT scheme is implemented on ASIC usingthe lifting blocks with 9/7 biorthogonal filters and BW-PKCM Fig. 10 Implementation results on 2D DWT (non-pipelining)multipliers. The 2D DWT is implemented using 180nmtechnology in ASIC. Verilog HDL language is used to VI. CONCLUSIONSdescribe the functionality of the circuit and after the circuit isdescribed in HDL, functionality is verified modelsim In this paper, techniques for implementation of thesimulation tool. Leo spectrum is used for synthesizing the hybrid WP-P KCM with Baugh-Wooley multiplicationcircuit is shown in fig. 9 and 10. First time the 2D DWT is algorithm are proposed. The 9/7 bi-orthogonal filtersimplemented using ASIC. In future, it can be extended to implemented on Xilinx SOC device using the lifting schemecompare with hybrid WP schemes. with the following three multipliers: with BW-PKCM, BW- KCM and hybrid WP-P BW-KCM. From the implementation Table 1. Implementation results of 2D DWT results, it is verified that hybrid WP-P BW-KCM is faster compared to non pipelined BW-KCM and is register efficient, Scheme Area Freq.(Mhz) and less clock routing complexity compared to BW-PKCM. The one level 2D DWT scheme is also implemented, in ASICs Pipelining 6896 346.8 using pipelining and non-pipeling. It can be extended for Non-Pipelining 3712 299.48 whybrid WP for future work. 372 369
- 6. REFERENCES1. G. Martin and H. Chang, “System-on-Chip design,” Proc. of Intl. conf. on ASIC, pp.12 – 17, 2001.2. B. A. Draper, J. R. Beveridge, A. P. W. Bohm, C. Ross, and M. Chawathe, “Accelerated image processing on FPGAs,” IEEE 8 VLSI Design Transactions on Image Processing, vol. 12, no. 12, pp. 1543–1551, 2003.3. G. Lakshminarayanan, B. Venkataramani, J. S. Kumar, A. K. Yousuf, and G. Sriram, “Design and FPGA implementation of image block encoders with 2D-DWT,” in Proceedings of IEEE Conference on Convergent Technologies for Asia-Pacific Region (TENCON ’03), vol. 3, pp. 1015–1019, Bangalore, India, October 2003.4. K. K. Parhi, VLSI Signal Processing Systems, JohnWiley&Sons, New York, NY, USA, 1999.5. J. Nyathi and J. G. Delgado-Frias, “A Hybrid wave-pipelined network router,” IEEE Transactions on Circuits and Systems- I, Fundamental Theory and Applications, vol. 49, no. 12, pp. 1764 –1772, Dec. 2002.6. O. Hauck, A. Katoch and S. A. Huss, “VLSI system design using asynchronous wave pipelines: a 0.35 µm CMOS 1.5 GHz elliptic curve public key cryptosystem chip,” Proc. of Sixth Intl. Symposium on Advanced Research in Asynchronous Circuits and Systems 2000 (ASYNC 2000), pp. 188 –197, April 2000.7. W. P. Burleson, M. Ciesielski, F. Klass and Liu, “Wave- pipelining: a tutorial and research survey,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 6, Issue. 3, pp. 464 -474, Sept. 1998.8. C. Gray, W. Liu and R. Cavin, “Wave-pipelining: Theory and Implementation,” Kluwer Academic Publishers, 1993.9. W. Tuttlebee, “Software defined radio,” John Wiley & Sons ltd. USA, 2004.10. I. Daubechies, and W. Sweldens, “Factoring Wavelet Transforms into Lifting Steps,” Journal of Fourier Analysis and Applications, vol. 4, pp 247-269, Nov. 3, 1998.11. G. Lakshminarayanan and B. Venkataramani, “Optimization techniques for FPGA based wave-pipelined DSP blocks,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 13, no. 7, pp 783-793, July 2005.12. E. I. Boemo, S. Lopez-Buedo and J. M. Meneses, “Wave pipelines via look-up tables,” IEEE International Symposium on Circuits and Systems (ISCAS 96), vol. 4, pp. 185 -188, 1996.13. G. Lakshminarayanan, B. Venkataramani, J. Senthil Kumar, A. K. Md. Yousuf and G. Sriram, “Design and FPGA implementation of image block encoders with 2D-DWT,” Proc. TENCON 2003, vol. III, pp 1015-1019, Oct 15-17, Bangalore, 200314. G. Seetharaman, B.Venkataramani and G. Lakshminarayanan, “Design and FPGA implementation of self- tuned wave-pipelined filters,” IETE journal of research, vol 52, no. 4, pp. 305-313, July-August 2006.15. M. J. S. Smith, “Application Specific Integrated Circuits,” Pearson Education Asia Pvt. Ltd, Singapore, 2003.16. Altera documentation library- 2003, Altera Corporation, USA.17. Xilinx documentation library, Xilinx Corporation, USA. 373 370

Be the first to comment