A block based pass-parallel spiht algorithm.bak


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A block based pass-parallel spiht algorithm.bak

  1. 1. 1064 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 A Block-Based Pass-Parallel SPIHT Algorithm Yongseok Jin, Member, IEEE, and Hyuk-Jae Lee Abstract—Set-partitioning in hierarchical trees (SPIHT) is a compensation for a LCD driver chip [8]–[11]. Among thewidely used compression algorithm for wavelet-transformed im- three wavelet-based coding algorithms, EZW and EBCOTages. One of its main drawbacks is a slow processing speed due to need a binary arithmetic coding that requires a large amountits dynamic processing order that depends on the image contents.To overcome this drawback, this paper presents a modified of hardware circuitry and memory increasing the hardwareSPIHT algorithm called block-based pass-parallel SPIHT (BPS). cost and moreover suffering from limited throughput [12].BPS decomposes a wavelet-transformed image into 4 × 4 blocks On the contrary, SPIHT does not need arithmetic codingand simultaneously encodes all the bits in a bit-plane of a 4 × 4 providing a cheaper and faster hardware solution. In addition,block. To exploit parallelism, BPS reorganizes the three passes SPIHT surpasses EZW and is close to EBCOT in compressionof the original SPIHT algorithm and then BPS encodes/decodesthe reorganized three passes in a parallel and pipelined manner. efficiency [4], [13]. Therefore, extensive research has focusedThe precalculation of the stream length of each pass enables the on SPIHT and its variations to improve the efficiency ofparallel and pipelined execution of these three passes by not only wavelet-based image coding.an encoder but also a decoder. The modification of the processing The original SPIHT algorithm processes wavelet coefficientsorder slightly degrades the compression efficiency. Experimental in a dynamic order that depends on the values of theresults show that the peak signal-to-noise ratio loss by BPS isbetween approximately 0.23 and 0.59 dB when compared to coefficients. Thus, it is not easy to process multiplethe original SPIHT algorithm. Both an encoder and a decoder coefficients in parallel; and consequently, it is difficult toare implemented in the hardware that can process 120 million improve the throughput of the original SPIHT. In order tosamples per second at an operating clock frequency of 100 MHz. increase the throughput, the SPIHT algorithm is modified suchThis processing speed allows a video of size of 1920 × 1080 in that the processing order is fixed statically (i.e., the processingthe 4:2:2 format to be processed at the rate of 30 frames/s. Thegate count of the hardware is about 43.9K. order is independent of the values of the coefficients) [14]– [19]. Although a fixed-order SPIHT improves throughput, the Index Terms—Discrete wavelet transform (DWT), set- coding efficiency is degraded because its order is differentpartitioning in hierarchical trees (SPIHT), wavelet image coding. http://ieeexploreprojects.blogspot.com from the order of the original SPIHT. No list SPIHT algorithm (NLS) [14] is initially proposed for a fixed-order SPIHT algo- I. Introduction rithm to reduce the required memory. Later, Corsonello et al. AVELET-BASED image coding, such as the proposed a low cost implementation of NLS [15]. In [14] andW JPEG2000 standard [1], is widely used because [15], the modified algorithm uses an array data structure forof its high compression efficiency. There are three important storing coding states in the fixed order instead of the list datawavelet-based image coding algorithms that have embedded structure required for the dynamic order of the original SPIHT.coding property enabling easy bit rate control with progressive Although NLS succeeds in the reduction of the memory size, ittransmission of information for a wavelet-transformed image. does not process coefficients in parallel, so only 1 or 2 bits areThey are the embedded zerotree wavelet algorithm (EZW) [2], produced at each step. Consequently, the throughput of NLSembedded block coding with optimized truncation algorithm is only 0.092 bit per cycle [15]. To improve the coding speed,(EBCOT) [3], and set partitioning in hierarchical trees Chen et al. [16] proposed a modified SPIHT that processesalgorithm (SPIHT) [4]. There are many video applications a 4 × 4 bit-plane in one cycle. However, this algorithm doesthat need image compression with embedded coding property. not exploit pixel parallelism but processes multiple sequentialThese applications include frame memory compression for steps in one cycle in its hardware implementation leading to aa video compression chip [5]–[7], overdrive detection and significant increase of the critical path delay in combinational logic circuits. Consequently, the operating clock frequency Manuscript received March 1, 2010; revised November 12, 2010 and April20, 2011; accepted November 14, 2011. Date of publication March 5, 2012; is limited although a 4 × 4 bit-plane is processed in a singledate of current version June 28, 2012. This work was supported by the Korean cycle. Thus, the overall throughput is also not very high.Science and Engineering Foundation, under Grant 2011-0027502 funded by Fry et al. [17] proposed a bit-plane parallel SPIHT encoderthe Ministry of Education, Science, and Technology, Korean Government.This paper was recommended by Associate Editor O. C. Au. architecture. This modified SPIHT decomposes wavelet Y. Jin is with the Department of Computer Science and Engineering, coefficients bit-plane by bit-plane and then processes multiplePennsylvania State University, University Park, PA 16802 USA (e-mail: bit-planes independently in a parallel manner. Then, the resultsyongseok.jin@gmail.com). H.-J. Lee is with the School of Electrical Engineering and Com- of multiple bit-planes are merged into a single bitstream. Thisputer Science, Seoul National University, Seoul 151-742, Korea (e-mail: bit-plane-parallel approach achieves very large throughput byhyuk− jae− lee@capp.snu.ac.kr). processing four pixels in a single cycle. However, there are two Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org. drawbacks. One is the low utilization of the parallel hardware Digital Object Identifier 10.1109/TCSVT.2012.2189793 and memory because the execution time of bit-plane coding 1051-8215/$31.00 c 2012 IEEE
  2. 2. JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1065differs from bit-plane to bit-plane. In addition, dependingon the bitrate, the results from less significant bit-planes aretruncated and are not merged into the final bitstream. Thetruncated bitstream implies that the hardware execution cyclesused for the generation of the truncated bitstream are wasted.The more serious drawback lies in the fact that this bit-planeparallel approach is not applicable to a decoder. In a decoder,multiple bit-planes cannot be decoded in parallel becausethe decoder cannot predict the length of each bit-plane, andconsequently, cannot divide the bitstream into multiple bit-plane streams for parallel processing at the beginning of thedecoding process. As the speed of a decoder is often more im-portant than the encoder speed, the low decoding speed in the Fig. 1. (a) Spatial orientation tree. (b) Morton scanning order of a 16 × 16bit-plane parallel SPIHT severely limits the application of this 3-level wavelet-transformed image.algorithm. This paper proposes a BPS algorithm and its hardware bit-plane is defined asimplementation. BPS decomposes a wavelet-transformed im-age into 4 × 4-bit blocks (4 × 4 blocks in a bit-plane) and 1, max (|w(i)|) ≥ 2nprocesses one 4 × 4-bit block in a single cycle. The three Sn (T ) = w(i)∈T (1) 0, otherwise.passes in the original SPIHT algorithm are reorganized intothree new passes, which can be executed in a pipelined and When Sn (T ) is “0,” T is called an insignificant set; otherwise,parallel manner. Parallel execution is possible because the T is called a significant set. An insignificant set can be rep-reorganization of the three passes removes data dependence resented as a single-bit “0,” but a significant set is partitionedand enables the precalculation of the bit length of each pass into subsets, whose significances in turn are to be tested again.before the pass is processed. This bit length precalculation Based on the zerotree hypothesis [2], SPIHT encodes a givenenables parallel execution not only for an encoder but also set T and its descendants [denoted by D(T )] together byfor a decoder. As a result, the encoder and decoder imple- checking the significance of T ∪ D(T ) (the union of T andmenting BPS achieve a fast execution and large throughput D(T )) and by representing T ∪D(T ) as a single symbol “zero”through parallel execution and efficient hardware utilization. if T ∪ D(T ) is insignificant. On the other hand, if T ∪ D(T ) http://ieeexploreprojects.blogspot.comChanging the processing order in BPS degrades the coding is significant, T is partitioned into subsets, each of which isefficiency that is slightly lower than the original SPIHT tested independently.algorithm. To reduce the complexity of SPIHT, an entire picture is This paper is organized as follows. In Section II, the decomposed into 4 × 4 sets (sets consisting of 4 × 4 pixels),SPIHT algorithm is introduced. Section III describes the and the significance of the union of each 4 × 4 set and itsproposed BPS algorithm and Section IV presents the hardware descendants is tested. The SPIHT algorithm encodes waveletimplementation and experimental results. Finally, Section V coefficients bit-plane by bit-plane from the most significantconcludes this paper. bit-plane to the least significant bit-plane. Fig. 2 presents the SPIHT algorithm encoding a single bit-plane. A SPIHT algorithm consists of three passes: insignificant set pass (ISP), II. SPIHT Algorithm insignificant pixel pass (IPP), and significant pixel pass (SPP). SPIHT is a compression algorithm applied to an image in According to the results of the (n + 1)th bit-plane, the nth bitthe wavelet transformed domain. A wavelet-transformed image of pixels are categorized and processed by one of the threecan be organized as a spatial orientation tree (SOT) [Fig. passes. Insignificant pixels classified by the (n + 1)th bit-plane1(a)] in which an arrow represents the relationship between a are encoded by IPP for the nth bit-plane whereas significantparent and its offspring. Each node of the tree corresponds to pixels are processed by SPP. The main goal of each pass is thea coefficient (also called pixel) in the transformed image. Fig. generation of the appropriate bitstream according to wavelet1(b) shows the Morton scanning order of the SOT [20] where coefficient information. ISP, the second pass in the SPIHTthe number assigned to each pixel represents the scanning algorithm shown in Fig. 2, handles insignificant sets. If a setorder. For an image of size m × n, the upper-leftmost nodes of in this pass is classified as a significant set in the nth bit-size (m/2L )×(n/2L ) are called the root nodes of the SOT when plane, it is decomposed into smaller sets until the smaller setsthe image is transformed by L-level DWT. Fig. 1(a) shows an are insignificant or they correspond to single pixels. If theimage of size 16×16 transformed by 3-level DWT. The square smaller sets are insignificant, they are handled by ISP. If thedenoted by R in Fig. 1(a) represents the root whereas the 2×2 smaller sets correspond to single pixels, they are handled bypixels numbered 0, 1, 2, and 3 in Fig. 1(b) correspond to the either IPP or SPP depending on their significance.root. If the most significant bit-plane is a zero bit-plane (that For a given set T , SPIHT defines a function of significance is all coefficients have their most significant bit equal to 0),which indicates whether the set T has pixels larger than a the bit-plane is not encoded, and consequently, the numbergiven threshold. Sn (T ), the significance of set T in the nth of encoded bit-planes is decreased. The following significant
  3. 3. 1066 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012Fig. 2. SPIHT algorithm encoding for a single bit-plane of wavelet coeffi-cients.bit-planes are not encoded if they are also a zero bit-plane.Thus, the SPIHT algorithm starts from the first nonzero bit(FNZB) plane. For example, if the largest coefficient in SOTis smaller than 27 , FNZB is 6 and the bit-plane coding starts Fig. 3. 4 × 4 BPS algorithmfrom the sixth bit-plane. The FNZB is stored in the header ofthe coded stream. Further details about the SPIHT algorithm processes each 4 × 4-bit block at a time. After one 4 × 4-bitare described in [4]. block is processed, the next 4 × 4-bit block is processed in the http://ieeexploreprojects.blogspot.com In the original SPIHT algorithm, three linked lists are main- Morton scanning order [20] as shown in Fig. 1(b).tained for processing, ISP, SPP, and IPP, respectively. In each The encoded stream in the original SPIHT is categorizedpass, the entries in the linked list are processed in the first-in- into three types: sorting bit, magnitude bit, and sign bit. Thefirst-out (FIFO) order. This FIFO order causes a large overhead sorting bit is the result of the significance test for a 2 × 2 orslowing down the computation speed of the SPIHT algorithm. 4 × 4 set indicating whether the set is significant or not. TheTo speed up the algorithm, sets and pixels are visited in the magnitude and sign bits indicate the magnitude and sign ofMorton order as shown in Fig. 1(b) and processed by the each pixel, respectively. The magnitude and sign bits outputappropriate pass. This modified algorithm, called Morton order in IPP and SPP are called “refining bit,” but the magnitudeSPIHT, hereafter, is relatively easy to implement in hardware and sign bits output in ISP are called the “first refining bit”with a slight degradation of the compression efficiency when because they are the refining bits generated first for each pixel.compared with the original SPIHT [14], [15], [17]–[19]. The The proposed BPS algorithm for a single 4 × 4-bit blockalgorithm shown in Fig. 2 describes both the original SPIHT is described in Fig. 3. The 4 × 4-bit block is denoted byand Morton order SPIHT algorithms. The processing order of H that is decomposed into four 2 × 2 blocks. In Fig. 3, Qthe for-loops in each pass differentiates the original SPIHT represents a 2×2-bit block that is a subblock of a 4 × 4-bitfrom the Morton order SPIHT algorithm. block H and n represents the bit-plane number. BPS consists of three passes that output refining bits, sorting bits, and first refining bits, respectively. According to the type of generated III. High Throughput Wavelet Image Coding bits, these three passes are called refinement pass (RP), sorting This section presents a modified SPIHT algorithm, called pass (SP), and first refinement pass (FRP), respectively. Thethe BPS. The proposed algorithm aims to speed up both RP is a combination of the IPP and SPP from the originalencoding and decoding times with a slight sacrifice in the SPIHT and visits each 2 × 2 block which is significant incompression efficiency. the previous bit-plane (i.e., Sn+1 (Q) = 1 as the condition in line 2 of the algorithm in Fig. 3). Then, RP outputs the nthA. Block-Based Pass-Parallel SPIHT magnitude bit of the significant 2 × 2-bit block (line 4 in BPS processes each bit-plane from the most significant bit- Fig. 3). In addition, the sign bit of a pixel is output (lineplane just like the original SPIHT algorithm. However, the 6) if the pixel becomes significant in the nth bit-plane (i.e.,processing order of the pixels in each bit-plane is different Sn+1 (w(i)) = 0 ∧ Sn (w(i)) = 1 in line 5). Since the two passesfrom the original SPIHT. BPS first decomposes an entire bit- IPP and SPP from the original SPIHT are combined as aplane into 4 × 4-bit blocks (4 × 4 blocks in a bit-plane) and single pass RP in BPS, the order of pixels processed in BPS
  4. 4. JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1067is different from that in SPIHT. It is noted from experimental from this FNZB. Consequently, the number of encoded bit-results that the degradation of the compression efficiency by planes can be reduced. For the root pixel(s), the value fromthis change of the processing order is not very significant. MSB to FNZB-1 is stored in the header. The ISP pass in the original SPIHT is decomposed into SP For the FNZBth bit-plane (the most significant bit-plane toand FRP passes in BPS. The SP classifies a block as either a be processed), initialization is necessary before the algorithmsignificant block or an insignificant block and transmits sorting given in Fig. 3 begins. Initially, the 2 × 2 set that includesbits. The first step of the SP is to transmit and generate the the root pixel(s) is classified as a significant block. All othersignificance of the 4 × 4-bit block (line 11 in Fig. 3). This blocks are classified as insignificant. For any 2 × 2 set Q, theis done when two conditions are satisfied (line 10 in Fig. 3). parameter dsig is derived. This parameter is used to evaluateThe first condition is that the 4 × 4-bit block is insignificant in the significance Sn (Q ∪ D(Q)). Furthermore, for any 4 × 4 setthe (n + 1)th bit-plane (i.e., Sn+1 (H ∪ D(H)) = 0). The second H, significance Sn (H ∪ D(H)) is also evaluated in advance.condition ∼ (parent(H) ∧ Sn (parent(H)) = 0 implies that it The initial derivation of dsig makes the significance evaluationis not necessary to generate the significance of the set if the simple enough to be processed in a single cycle. Details of this4×4-bit block has a parent whose descendants are insignificant derivation are explained in Section IV-A.because the insignificance of the parent already indicates thatthe 4 × 4-bit block is insignificant. SP is the only pass that B. Bitstream Generation for a Fast Decoderprocesses a 4 × 4-bit block. The other two passes RP and FRP Increasing the speed of a decoder may be more importantprocess a 2 × 2-bit block as the processing unit. and more difficult than that of an encoder. The encoder can The remaining operation of the SP depends on the signifi- process RP and SP in parallel because they are independent. Incance of the 4 × 4-bit block (tested in line 12). If the block is a decoder, RP and SP being independent of each other is notsignificant, it is decomposed into four 2 × 2-bit blocks. The enough for parallel execution. Another condition for parallelsignificance of each 2 × 2 block is generated (line 14) if it is execution is the precalculation of the start bit of each passinsignificant in the (n + 1)th bit-plane (line 13). According to in the bitstream. This condition is obvious because a decoderits significance, each 2 × 2-bit block is classified either as an cannot start to process a pass unless the start bit of the passinsignificant block to be processed by the SP for the (n − 1)th is known prior to the start of the pass. It is not easy for abit-plane (line 16) or as a significant block to be processed decoder to find the start bit of each pass because the length ofby the FRP pass in the current bit-plane. To be processed by each pass is variable, and the length is known by the decoderthe FRP, a 2 × 2 block Q needs its significance Sn (Q) to only after the pass is completely decoded. Therefore, in orderbe set to 1 (line 18). The significant block processed by the to enable parallel execution of multiple passes in a decoder, http://ieeexploreprojects.blogspot.comFRP is called the new-significant block. When (H ∪ D(H)) is the bitstream should be formatted carefully to look ahead forinsignificant, all four 2 × 2-bit blocks in H are classified as the length of the bitstream for each pass.insignificant blocks for the (n − 1)th bit-plane (lines 19, 20, Fig. 4 shows the bitstream format of a 8×8 block. FNZB isand 21). The FRP pass processes the new-significant 2 × 2-bit stored in the leftmost position. Next, the magnitude pixels ofblocks classified by the SP (line 24). FRP outputs the nth the root(s) from the nth bit to the (FNZB+1)th bit are stored.magnitude bit of the pixels in the new-significant blocks. If Once the compression ratio is determined, the bitstream lengththe magnitude bit is significant in the FRP, this implies that is also determined. In this example, the bitstream length mustthe magnitude bit is significant in the first time for the pixel. be smaller than or equal to 256 (= 8 × 8 × 8 × 50%) assumingThus, the sign bit is also output. that the compression ratio is 50%. Hence, a decoder knows Recall that the ISP in the original SPIHT is decomposed into not only the first bit position but also the last bit position ofSP and FRP in the BPS algorithm in Fig. 3. The separation the bitstream. By exploiting this fact to increase the speed,of SP and FRP allows each pass to be processed in a single the proposed decoder parses the bitstream in both directions:cycle. It should be noted that the operation of FRP depends from left to right and from right to left. The magnitude bitson the results of SP, so they cannot be executed in parallel. In from RP and FRP and the sorting bit from SP are stored fromthe implementation, FRP is delayed by one cycle, so it can be left to right. On the other hand, sign bits from RP and FRPexecuted in parallel with the RP and SP of the next 4 × 4-bit are stored from right to left to the bitstream. Note that onlyblock. Parallel execution is possible because the FRP in the the first 4 × 4-bit block among 4 × 4-bit blocks in the FNZBcurrent 4 × 4-bit block is not dependent on the RP and SP bit-plane has RP magnitude and sign bits because only theof the next 4 × 4-bit block. As a result, for each cycle, the 2 × 2 set that includes the root pixel(s) is initially classifiedbitstream of a single 4 × 4-bit block for a given bit-plane is as a significant block as explained in Section III-A.generated. With the bitstream organization given in Fig. 4, each 4 × Additional improvement of the compression efficiency is 4-bit block in a single bit-plane can be processed in a singleachieved by a slight modification in the selection of FNZB. cycle in a pipelined manner. Fig. 5 shows the pipelinedWhen the size of the wavelet-transformed image is relatively execution of the BPS decoder. Fig. 5(a) shows a 8 × 8 blocksmall (e.g., 16 × 16), the root pixel(s) has a much larger that is decomposed into four 4 × 4 blocks, denoted by H1,absolute value than the other pixels in the image. Therefore, H2, H3, and H4, respectively. Fig. 5(b) shows the executiononly the root pixel(s) is significant for the several most cycle of the three passes of the four 4 × 4-bit blocks. Theimportant bit-planes. Thus, the FNZB is obtained from the RP and SP of a 4 × 4-bit block can be processed in parallel.pixels excluding the root pixel(s). Then, bit-plane coding starts Recall that parallel processing of two passes requires not only
  5. 5. 1068 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012Fig. 4. Bitstream organization. Fig. 6. Example of three-level wavelet coefficients used to demonstrate the proposed SPIHT algorithm. the RP according to the magnitude bit. Therefore, the sign bits are processed by the decoder one cycle after the corresponding magnitude bits. The sign bits of FRP can also begin only after the magnitude bits of the FRP are decoded. Thus, the decoding of FRP sign bits is done one cycle after the decoding of FRP magnitude bits. Note that the FRP sign bits can be processed in the same cycle as the RP sign bits of the next 4 × 4-bit block. C. An Example of Block-Based Pass-Parallel SPIHTFig. 5. Pipelined execution of the BPS decoder. (a) 8×8 block decomposedinto four 4×4 blocks. (b) Pipelined execution of RP, SP, and FRP. This section explains the proposed BPS coding example http://ieeexploreprojects.blogspot.com size 8 × 8 shown in Fig. 6. The wavelet coefficients of 8 × 8 coefficients are decomposed into four 4 × 4 blockstwo passes being independent of each other but also prior that are denoted by H1, H2, H3, and H4, respectively. Aknowledge of the length of the bitstream of the pass that 3-level wavelet transform is done and the upper leftmostis stored ahead in the bitstream. In this case, the number of pixel constitutes the root. The fifth and fourth bit-planes aremagnitude bits transmitted by an RP is known before RP starts shown in Fig. 6. In these figures, insignificant set, significantbecause it is determined by the result from the previous bit- pixel, and insignificant pixel are differentiated by darkness.plane. Therefore, the end bit of the RP magnitude (and the Significant pixels are indicated by darker shaded areas whereasstart bit of the next SP sorting) is known prior to the beginning insignificant pixels are represented by lighter shaded areas.of the RP. Thus, both RP magnitude bits and SP sorting bits Insignificant sets are represented by white boxes.can be decoded in parallel. On the other hand, the number The value of the root pixel is 672 that is the maximumof sign bits in the RP is known only after the magnitude bits absolute value among all coefficients. Thus, the FNZB is 9.of the corresponding RP are decoded. Thus, the sign bits of BPS starts its coding from the sixth bit-plane because theRP are decoded in one cycle later than the decoding of the maximum magnitude value other than the root is 72. Thecorresponding magnitude bits. For FRP, the number of sorting value 01012 from the MSB to the seventh bit of the root 672bits transmitted by the SP is known only after SP is completed. (=010101000002 ) is stored in the header. The bitstream gener-Thus, the next bits, FRP magnitude bits, can be decoded one ation of the fourth bit-plane is explained in this example. Forcycle later than SP. The number of magnitude bits from FRP each 8×8-bit block, the Morton processing order of the 4×4-is determined by the result of the SP in the same bit-plane. bit blocks is H1, H2, H3, and H4. Among the four 2×2 blocksThus, the length of the FRP can be precalculated before the that make up H1, all 2×2 blocks except the right and bottomFRP begins. This implies that the start bit of the RP of the block {12, 10, -2, 4} are significant and therefore they are pro-next 4 × 4-bit block is also known before the FRP begins. cessed in the RP. For the first 2×2 block {672, -72, -72, 32},Therefore, the RP (and SP as well) of the next 4 × 4-bit block magnitude bits “0000” are generated because their fourth mag-is performed in the same cycle as FRP. In summary, the RP and nitude bits are all zero. For the second block {-40, 24, -4, 0},SP can be processed in parallel with the FRP of the previous magnitude bits “0100” are generated. In addition, a sign bit4 × 4-bit block. “+” is generated because the first significant magnitude bit As shown in Fig. 4, sign bits are stored from the right of the is generated for pixel 24. Note that the sign bit is storedbitstream. The length of the sign bits transmitted by RP is not separately in the end of the bitstream. For the third blockknown before the RP is completed because it is determined by {-52, -8, 0, -18}, magnitude bits “1001” are generated. Note
  6. 6. JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1069 TABLE I Bitstream Results of the Fourth Bit-Plane in Fig. 6 by Pass-Parallel SPIHT 4 × 4 Block Pixel or Set RP SP FRPFig. 7. Bitstream generated from the example in the fourth bit-plane. 1 {672, -72, -72, 32} 0000 {-40, 24, -4, 0} 01+00 {-52, -8, 0, -18} 1001-that the first significant magnitude bit is generated for pixel {12, 10, -2, 4} 0-18, so the sign bit “-” is also generated. The RP for H1 is 2 H2 0completed and followed by the SP for H1. The union of 2×2 3 H3 1 {-16, -8, 2, 2} 1 1-000block {12, 10, -2, 4} and its descendant (=H4) is insignificant {2, -4, 0, -2} 0in the fourth bit-plane (S4 ({12, 10, −2, 4} ∪ H4) = 0) so that {2, 4, 4, 6} 0the sorting bit “0” is generated to indicate the insignificance {0, 0, -2, 2} 0 4 H4of the block. H1 in the fourth bit-plane does not include anynew-significant block, and consequently, no block is processedby the FRP. H3, and H4, they are all insignificant in the fifth bit-plane. Next, the bitstream for H2 is generated. H2 is insignificant This indicates that the magnitude bit lengths of the RP passesin the fifth bit-plane, so all the 2×2-bit blocks are insignificant for H2, H3, and H4 are all zero in the fourth bit-plane. Theblocks and no 2×2-bit blocks are processed in the RP. In SP, length of FRP is determined by the result of SP in the same“0” is generated because H2 is also insignificant in the fourth bit-plane. In this example, {-16, -8, 2, 2} in H3 is the onlybit-plane. Next, H3 is insignificant in the fifth bit-plane, so it 2×2 bit-block that is converted from an insignificant block to ais not processed by the RP. In the SP, H3 is significant in the significant block. Therefore, four magnitude bits are generatedfourth bit-plane, so the sorting bit “1” is generated and H3 in the corresponding FRP pass. The length of the sign bits inis partitioned into four 2×2-bit blocks and the significance the RP (and also FRP) is known after the magnitude bits of theof each 2×2-bit block is tested. The first 2×2-bit block same pass are decoded. For example, among the 12 magnitude{-16, -8, 2, 2} is significant, and sorting bit “1” is generated bits generated by the RP, “24” and “-18” are the two pixelswhereas the other 2×2-bit blocks are insignificant, and sorting that generate the magnitude bit of value “1” in the first timebit “0” is generated. Block {-16, -8, 2, 2} is a new-significant for each pixel. Thus, the two sign bits need to be decoded inblock which is processed by the FRP generating “1-000.” Next, the corresponding RP. http://ieeexploreprojects.blogspot.comH4 consists of four insignificant 2×2 blocks, so H4 is notprocessed by the RP. In the SP, the second “if-condition” is D. Hardware Organization of Block-Based Pass-Parallelnot satisfied, so it is not processed by the SP. Therefore, no SPIHTbit is generated for H4. Fig. 8 shows the hardware organization of the BPS encoder The bitstream generated in the fourth bit-plane is shown and decoder. The computation is decomposed into two stepsin Fig. 7. The bit results in Table I are stored from the top that are executed in a pipelined manner for both the encoderrow to the bottom row. The magnitude bits of the RP pass and the decoder. In the encoder, three passes, FRP, RP, andof H1 are stored in sequence as 0000 0100 1001. Then, the SP, are processed in parallel by three dedicated modules.sorting bit of H1 is stored next as 0 followed by the sorting The bit information (magnitude, sign, sorting bits, and theirbit of H2 which is also 0. Then, the sorting bits of H3, 11000, lengths) generated by the three passes is forwarded to theare stored followed by the magnitude bits of the FRP which bitstream aligner that merges the information and generatesare 1000. The sign bits “+” and “-” are stored as “0” and the final bitstream. For example, the sign bits are stored in the“1,” respectively, and they are stored from the right of the temporary buffer and generated at the end of the bitstream.bitstream. Thus, the sign bits “+” followed by “-” generated For bitstream merge and alignment, multiple barrel shiftersby the RP pass of H1 are stored as 01 from the right to the and registers are used by the bitstream aligner.left. Then, the sign bit “-” generated by the FRP pass of H3 In the BPS decoder, the first stage is the bitstream parser thatis stored as 1. Note that the output orders of the SPIHT and identifies the bit information to be passed to the three parallelthe BPS bitstreams are different, but the values corresponding pass decoders in the second stage. The sign bits are processedto each coefficient are the same for both algorithms. Thus, separately as they are stored from the end of a given bitstreamthese two algorithms produce different bitstreams only when and parsed in the reverse direction. Similar to the bitstreamthe length of the output bitstream is limited for compression. aligner in the encoder, the parser consists of multiple barrel In this example, the magnitude bit length of RP and FRP shifters and registers. Unlike the aligner, the length of the bitspasses can be precalculated prior to the start of the passes. in each pass is not available. Instead, the length informationFor example, consider the magnitude bit length of the RP is obtained from the previous decoding results. The derivationpass of H1 in the fourth bit-plane. As a result of the fifth bit- of the bit length makes the decoder more complex than theplane, it is figured out by the decoder that three 2×2 blocks encoder.are categorized as significant blocks. This implies that the Fig. 9 shows the organization of the bitstream parser. Themagnitude bits of 12 pixels are generated by the RP pass so 32-bit stream is stored in register D1 . When the accumulatedthat the magnitude bit length of this RP pass is 12. For H2, bit length of the parsed bitstream is greater than 32, carry-out
  7. 7. 1070 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 Fig. 10. Data storage pattern for BPS. (a) 8 × 8 wavelet coefficients in the sign and magnitude format. (b) Coefficient stored in the pattern that each wavelet coefficient is stored in the same address location. (c) Coefficient stored in the pattern that a 4×4 block in the same bit-plane stored in the same address location. Both the encoder and the decoder are designed to process a 4 × 4 block in a single cycle. If the size of a transform- block increases, the processing time also increases. On theFig. 8. Hardware organization of the BPS encoder and decoder. other hand, the complexity of the hardware is independent of the size of the transform-block. This is different from the hardware for wavelet transform of which complexity increases as the transform-block size increases. IV. Experimental Results A. Implementation of BPS An image is decomposed into transform-blocks, each of which is transformed by DWT and then coded by BPS encod- ing. As the size of the transform-block increases, the coding http://ieeexploreprojects.blogspot.com efficiency and the hardware cost of DWT also increase [21]. Section IV-B shows the experimental results on the coding efficiency according to the size of the transform-block. The DWT module generates wavelet coefficients in the sign and magnitude forms. On the other hand, the BPS encoder module accesses a 4 × 4-bit block one at a time. Thus, the storage pattern of the wavelet coefficients generated by the DWT module is not suitable for the BPS algorithm. ForFig. 9. Organization of the bitstream parser in a BPS decoder. efficient data access by the BPS algorithm, the storage pattern presented in [22] is adopted, which is shown in Fig. 10. Theis generated and the stream in D1 is moved to D0 . D2 is used wavelet coefficients of size 8×8 are decomposed into four 4×4for storing the accumulated bit length of the parsed bitstream. coefficients, H1, H2, H3, and H4. Fig. 10(b) shows the storageThe output bits of the barrel shifter BS0 are decomposed into pattern generated by a DWT module whereas Fig. 10(c) showsthree bit streams by additional barrel shifters BS1 and BS2 . the pattern stored for BPS. The pattern shown in Fig. 10(c)The decomposed bits are fed forward to three independent stores 4 × 4-bit blocks in the bit-plane order from the mostpasses, FRP, RP, and SP. The length precalculator derives the significant bit-plane to the least significant bit-plane. Within alengths of bits to be decoded by FRP and RP (LFRP and LRP ) bit-plane, 4 × 4-bit blocks are stored in the Morton scanningand gives them to BS1 and BS2 , which can extract the exact order. Within a 4 × 4-bit block, each bit is also stored in thebits that are to be decoded by RP and SP passes. These lengths Morton scanning order. As BPS can access the magnitude andare derived from the decoding results of the previous bit-plane. sign data at the same time, the magnitude and sign buffersThe length of the bits to be decoded in a single cycle (L) is are stored separately. If the BPS encoder and decoder use theobtained by the summation of LFRP , LRP , and LSP . LSP is the pattern shown in Fig. 10(b), they should access the memory 16length of the bits decoded by SP and its value ranges from 0 times to get a 4×4-bit block that causes a waste of the memoryto 5. This length is also calculated by the length precalculator. bandwidth. With the storage pattern shown in Fig. 10(c), only For sign bits, a bitstream parser similar to that shown in a single memory access is required to get a 4 × 4-bit block.Fig. 9 is also used. As SP does not generate any sign bit, only Further details about the storage pattern are described in [22].FRP and RP decode sign bits. The length of the sign bits is Recall that the initialization step derives the parameter dsigsmaller than that of the magnitude bits. As a result, the sign for every 2×2 set (see the last paragraph in Section III-A). Forbit parser is less complex than the magnitude bit parser. the derivation of dsig of a 2×2 set Q, another parameter dmax
  8. 8. JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1071 TABLE II PSNR (in dB) Comparison of Original SPIHT and Our High Throughput SPIHT Block-Based Block-Based Transform-Block Morton Order Pass-Parallel Pass-Parallel Pass-Parallel Size b/p SPIHT SPIHT SPIHT SPIHT SPIHT + Root in Header 4 44.66 44.64 43.73 43.70 43.97 16 × 16 2 35.02 34.98 34.52 34.63 34.85 1 29.66 29.64 29.19 29.41 29.70 0.5 25.96 25.97 25.68 25.89 26.24 4 45.84 45.82 45.06 45.05 45.00 32 × 32 2 36.34 36.26 35.75 35.87 35.83 1 31.08 31.03 30.56 30.73 30.67 0.5 27.46 27.42 27.07 27.27 27.20 4 46.56 46.54 45.91 45.88 45.84 64 × 64 2 37.19 37.10 36.55 36.64 36.61 1 31.91 31.85 31.30 31.51 31.45 0.5 28.23 28.17 27.75 27.99 27.92 4 47.20 47.17 46.73 46.68 46.64 128×128 2 38.15 38.03 37.54 37.59 37.55 1 32.73 32.67 32.21 32.34 32.30 0.5 28.95 28.87 28.45 28.62 28.56 4 47.62 47.59 47.28 47.23 47.20 256 × 256 2 38.84 38.70 38.18 38.23 38.20 1 33.41 33.29 32.91 32.96 32.93 0.5 29.48 29.39 29.01 29.13 29.07 4 48.02 47.97 47.88 47.81 47.78 512 × 512 2 39.67 39.50 39.18 39.08 39.06 1 34.31 34.21 33.92 33.92 33.89 0.5 30.21 30.13 29.90 29.98 29.92 4 46.65 46.62 46.10 46.06 46.07 Average 2 37.54 37.43 36.95 37.01 37.01 1 32.18 32.11 31.68 31.81 31.82 0.5 28.38 28.32 27.97 28.15 28.15 http://ieeexploreprojects.blogspot.comis computed first where dmax is the maximum absolute value block-based processing, pass reorganization, and the Morton-of all entries in the set Q ∪ D(Q). Once dmax is obtained, order-based operation in each pass. The increased speeddsig is derived by setting all the bits from the most significant achieved by these modifications may cause a slight degradationbit with its value of “1” down to the least significant bit. For of the compression efficiency. Thus, the degradation of theexample, when dmax is 000100102 , the fourth significant bit compression efficiency by each of these modifications isis the most significant bit with its value “1.” Then, dsig is evaluated by experimentation in which each modification is000111112 that is obtained by setting all the bits from the employed by SPIHT independently of the other modifications.fourth significant bit down to the least significant bit. Once Note that the number of generated bits is the same for alldsig is obtained, the significances for the nth bit-plane are modified versions of the SPIHT algorithm when no data isobtained by using a bitwise logical operation as follows: lost. For lossy compression when the number of allowed bits is limited, the compression efficiency depends on the modi- Sn (Q ∪ D(Q)) = AND(dsig(Q ∪ D(Q), 2n ) (2) fication because a different modification leads to a differentwhere AND represents the bitwise logical AND operation and generation order that affects the compression efficiency.dsig(Q ∪ D(Q)) represents the value of dsig derived for set Table II shows the results of the experiments. Test im-(Q ∪ D(Q)). Once the significance is derived for all 2×2 sets, ages are Barbara, Gold hill, and Lena [monochrome, 8 bitsthen the significance of every 4 × 4 set is obtained by logical per pixel (b/p), 512 × 512] and Bike, Cafe, and WomanOR operation. Let H denote a 4 × 4 set and Q1 , Q2 , Q3 , and (monochrome, 8 b/p, 2560 × 2048) from the JPEG 2000Q4 denote the subsets of H, then the significance of H ∪D(H) test bed. A test image is partitioned into blocks and thenis each block is transformed by DWT. The integer Le Gall 5/3 filter is used for DWT. The first column of Table II Sn (H ∪ D(H)) = OR(Sn (Q1 ∪ D(Q1 )), Sn (Q2 ∪ D(Q2 )), shows the transform-block size. The second column shows Sn (Q3 ∪ D(Q3 )), Sn (Q4 ∪ D(Q4 ))) (3) the number of bits per pixel that determines the compression ratio. For example, a bit rate of 4 b/p is equivalent to thewhere OR represents the bitwise logical OR operation. compression ratio of 2 as the original image bit rate withoutB. Comparison compression is 8 b/p. The peak signal-to-noise ratio (PSNR) This section evaluates the compression efficiency of the pro- performance of the original SPIHT algorithm is shown inposed BPS algorithm. Recall that BPS makes three modifica- the third column. As the compression ratio increases, thetions to speed up SPIHT computation. These modifications are PSNR decreases rapidly. Meanwhile, as the DWT block size
  9. 9. 1072 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012increases, the PSNR increases accordingly. The fourth columnshows the PSNR performance of the Morton Order SPIHT,which is equivalent to no list SPIHT [14]. The PSNR is slightlydecreased by an average of 0.07 dB (ranging from 0.03 to 0.11dB) when compared with the original SPIHT. The fifth columnshows the PSNR performance of the modified version of theMorton order SPIHT algorithm that adopts the reorganizedthree passes, RP, SP, and FRP, instead of the IPP, ISP, and SPP.As shown in the table, this modification decreases the PSNRperformance by an average of 0.51 dB (ranging from 0.40 to0.58 dB). The sixth column shows the PSNR performance ofthe proposed 4 × 4 BPS. When compared with the data inthe fifth column, the PSNR slightly increases except in thecase when the bit rate is 4 b/p as the average PSNR decreaseis 0.43 dB (ranging from 0.23 to 0.59 dB). In pass-parallelSPIHT, the refining bit always precedes the first refining bitin a bit-plane. On the other hand, in BPS, the first refiningbit of a lower subband precedes the refining bit of a highersubband. In general, the first refining bits of a lower subbandare more important than the refining bits of a higher subband.This is the reason why block based processing may lead to aslight increase of the PSNR. When the bit rate is 4 pp, theefficiency of BPS may be reduced because of the following Fig. 11. Barbara image compressed by the BPS algorithm with various bitreason. The integer Le Gall 5/3 DWT used for the experiment rates. (a) 4 b/s. (b) 2 b/s. (c) 1 b/s. (d) 0.5 b/s.always yields zero for the LSB of the low-pass coefficientsthrough the scaling step because the scaling factor is twofor the low-pass coefficients. When the bit rate is high, the between Lena image and Barbary image is 1.32 dB when theLSB planes are likely to be coded. In LSB plane coding, bit rate is 4 b/p. The PSNR degradation decreases as the bitBPS always outputs the coefficients of the lower subbands in rate decreases. The bottom graph shows the averages over the http://ieeexploreprojects.blogspot.comadvance of those of the higher subbands and does not increase six test images. The PSNR degradation is 0.97B on averagethe PSNR performance. when the bit rate is 4 b/p, whereas it is 0.07 dB when the bit The last column shows PSNR performance of BPS with rate is 0.5 b/p.the scheme that processes from the FNZBth bit-plane. As the Table III compares the throughputs of the proposed designDWT block size increases, the PSNR values drop sharply. and previous designs. For those algorithms without a decoderThis scheme is efficient only when the block size is 16 × 16. algorithm (no list SPIHT and bit-plane parallel SPIHT), theThe reason is because the difference between the FNZB with decoder throughput is considered the same as that of no listroots and the FNZB without roots is significantly different only SPIHT encoder. This is a reasonable number because no listwhen the transform-block size is 16 × 16. Therefore, as the SPIHT decoder can process both no list SPIHT and bit-planetransform-block size increases, the benefits from this scheme parallel SPIHT. As shown in the table, the proposed designdecrease whereas the overhead increases. achieves dominantly large throughput when compared with Fig. 11 presents the Barbara image that is encoded by any previous work. In most systems with an integrated FMCthe proposed BPS algorithm and then the encoded stream is module [5]–[7], the encoder and decoder throughputs of thedecoded again by the BPS algorithm. The transform-block size FMC need to be balanced. Otherwise, the overall performanceis 16 × 16 and the bit rate varies from 4, 2, 1, and 0.5 bits of the system may be significantly decreased by the one withper pixel. The PSNRs are 43.28 dB, 33.12 dB, 27.74 dB, and the smaller throughput. Thus, the large throughputs for both24.42 dB, respectively. As the bit rate decreases, the subjective the encoder and the decoder of the proposed design allowquality also degrades with blurring observed in object edges. much improved system performance when they are integratedA blocking effect is also apparent in the image with the bit rate in the system.at 0.5 b/p. The blocking effect can be reduced by increasing The fourth column shows the operating clock frequencythe transform-block size although the hardware cost increases that is obtained from the timing analysis after place anddramatically as the transform-block size increases. route for FPGA targeting. Previous works also use the same Fig. 12 compares PSNRs of six test images compressed timing analysis results from FPGA synthesis results. Note thatby the original SPIHT and BPS. The transform-block size is previous FPGA technology may be slower than that used bychosen as 16 × 16. Depending on the complexity of a test the proposed design. To eliminate the effect of technologyimage, the PSNR degradation varies substantially. The PSNR difference and emphasize the enhancement by the proposeddegradation is relatively large for complex images such as design, the fifth and sixth columns in Table III present ad-Barbara, whereas it is small for relatively simple images such ditional comparison factors that are the encoder and decoderas Lena. For example, the difference in PSNR degradation throughputs with normalized at 1 MHz clock frequency. As
  10. 10. JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1073 TABLE III Throughput Comparison of the SPIHT Architecture and Previous Works Encoder Decode Normalized Normalized Throughput Throughput Frequency Encoder Throughput Decoder Throughput (MPixels/s) (MPixels/s) (MHz) (MPixels/(s x MHz)) (MPixels/(s x MHz)) No list SPIHT [15] 18.4 18.4 100 0.184 0.184 Two-level SPIHT [16] 4.35 4.35 10 0.435 0.435 Bit-plane parallel SPIHT [17] 224 18.4 56 4 0.184 BPS 180 132 150 (encoder)/110 (decoder) 1.2 1.2 TABLE IV Implementation Comparison of the SPIHT Architecture and Previous Works Technology Size Logic Cell (FPGA) Memory (bit) No list SPIHT [15] Xilinx Virtex II-XC2V1000 4500 slices 10 125/11 520 24 BRAM (24×18K bits) Bit-plane parallel SPIHT [17] Xilinx Virtex 2000E 62%+34%+98% 83 808/43 200 N/A BPS encoder Xilinx Virtex V-LX330 3592 slices 22 996/331 876 8320 BPS decoder Xilinx Virtex V-LX330 3421 slices 21 901/331 876 6656 10 125 logic cells (presented in the fourth column in Table IV [23]). Bit-plane parallel SPIHT provides a large throughput that is expected because this algorithm processes all bit- planes concurrently. However, bit-plane parallelism cannot be used for a decoder because the decoder cannot decompose a bitstream bit-plane by bit-plane. The hardware of a bit- plane parallel SPIHT is implemented with three Xilinx Vertex 2000E FPGAs with their capacity consumed by 62%, 34%, and 98%, respectively [17]. This implies that a large amount of hardware logics are necessary to process all bit-planes in parallel. As the capacity of this FPGA is 43 200 logic cells, http://ieeexploreprojects.blogspot.com this hardware cost corresponds to 83 808 logic cells that are much larger than the no list SPIHT [24]. In the contrary, the hardware implementation for the BPS algorithm is not very large because it is designed effectively to meet the target throughput exploiting pixel-level parallelism. As shown in the table, the BPS encoder and decoder require 22 996 and 21 901 logic cells, respectively. The proposed design requires twice as much hardware resources as no list SPIHT, whereas it is about one fourth of the hardware cost for bit-plane parallel SPIHT. In order to avoid any unfair comparison by using different technologies, only the FPGA synthesis results are used for comparison. Although the FPGA used for the proposed designFig. 12. Compression efficiency comparison of SPIHT and BPS with six is different from that for the previous designs, all FPGAs aretest images. Transform block size is 16 × 16. developed by the same company and the logic cell counts can be used for comparison of the hardware cost because they areshown in the table, the normalized decoder throughput is normalized for comparisons among different FPGAs. Fromsignificantly improved by the proposed design, whereas the the logic cell counts, it is observed from Table IV that thenormalized encoder throughput is somewhere between those proposed design requires much less hardware cost than thatof no list SPIHT and bit-plane parallel SPIHT. required by bit-plane parallel SPIHT. Hardware implementations of the various SPIHT algorithms The two-level SPIHT presented in Table III is not presentedare summarized in Table IV. The results published in [15]– in Table IV because it runs very low clock frequency of 10[17] are presented for no list SPIHT, two-level SPIHT, and MHz when compared with other designs.bit-plane parallel SPIHT, respectively. The second column Consider the memory requirement for the transform of sizeshows the implementation technology. For previous work, the 16 × 16 with the wavelet coefficient stored in 13 bits. Thenumbers published in [15]– [17] are given in the table. No list wavelet coefficients require 16 × 16×13 bits when it is storedSPIHT sequentially generates 1 or 2 bits in a cycle resulting in the format shown in Fig. 10(b). After the coefficients arein a very small throughput of 18.4 M pixels/s. The hardware converted into the format in Fig. 10(c), the same size of bufferoccupies 4500 slices of Xilinx Virtex-II that corresponds to is also necessary. For the calculation of dsig, 1/4×16×16×13
  11. 11. 1074 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 7, JULY 2012 TABLE V for the datapaths may also be slightly increased by the complex Gate Counts of Hardware Modules conditional control flow of BPS. However, this increased complexity is very small. The reason is that the repeated Module Gate Count Logic Cell operation in Fig. 3 is precomputed in DSIG module that is (ASIC 0.13 μm) (Xiilnx Virtex V-LX330) described in Section IV-A and the rest is implemented by DWT 5.5 K 5214 DSIG 3.0 K 2750 bit-level combinational logic rather than a multiplexer and C2B 2.6 K 2378 demultiplexer. BPS encoder core 12.7 K 12 654 For memory bandwidth, the proposed design does not - 3 parallel pass 9.5 K 8994 - Bitstream aligner 3.2 K 3661 increase the required bandwidth. The reason is that both BPS IDWT 5.6 K 5232 and SPIHT require just a single memory access for each bit B2C 2.6 K 2396 except the sign bit in the buffer to complete either encoding BPS decoder core 14.5 K 14 273 - 3 parallel pass 8.4 K 8303 or decoding. The number of sign bit accesses is also the same - Bitstream parser 6.1 K 5970 for both BPS and SPIHT.bits are necessary, and the same size buffer is also necessaryfor the format conversion. Thus, the total memory requirement V. Conclusionis 8320 bits (=2.5×16 × 16×13). Additional buffers temporar- The proposed BPS algorithm modified the processing orderily storing the input image and output stream for an encoder (or of the original SPIHT in order to increase the speed of both aninput stream and output image for a decoder) are necessary, but encoder and a decoder. As a result, the processing speed of 120these buffers are not counted because they may be considered million pixels per second can be achieved by both an encoderoutside the BPS hardware and also may vary depending on and a decoder at a slight sacrifice of compression efficiency.the architecture that utilizes the BPS module. This speed allowed 1920×1080 size video in the 4:2:2 format The comparison of the hardware cost given in Table IV to be processed at the speed of 30 frames/s. The hardwareis a little bit unfair because the DWT sizes of the previous implementation of BPS with Verilog HDL showed that theimplementations are different from that of BPS. No list SPIHT required gate count is 43.9 K. The BPS is the only algorithm[15] performs 64×64 DWT whereas two-level SPIHT [16] and that allows the fast processing time for both an encoder and abit-plane parallel SPIHT [17] implements the DWT of sizes decoder with a small hardware cost. The increased throughput8 × 8 and 512 × 512, respectively. As the DWT size affects with a small hardware cost made it possible for a SPIHT-the compression efficiency, fair comparison requires the same based compression to be used for many video compression http://ieeexploreprojects.blogspot.comDWT size. The memory buffer size increases in proportion to applications that require fast encoding and decoding time.the DWT size whereas the logic gate count may not change These applications include frame memory compression for amuch because the hardware logic speed is fixed as the target H.264 or MPEG codec chip, and overdrive detection for LCDthroughput (pixels/cycle), which does not change depending displays.on the DWT size. For fair comparison, the memory size given A possible modification of SPIHT algorithm is that anin the rightmost column needs to be weighted by the DWT image is partitioned into multiple blocks (or stripes) andsize. that coefficient trees are local to these blocks. Then, these The detailed information about the hardware size is given blocks are simply processed in parallel with a relatively simplein Table V. DSIG (third row) represents the module that algorithm. The block boundaries in the bitstream could becalculates dsig, which is explained in Section IV-A DSIG given in the header, or the same bitrate could be used for allcalculation is required only for an encoder. C2B (fourth row) blocks. As a result, the decoder also can decode these blocksrepresents the module that converts the data format shown in parallel. This new scheme gives another option that speedsin Fig. 10(b) to that in Fig. 10(c), whereas B2C is the up the algorithm at a sacrifice of compression efficiency.module for the format conversion in the opposite direction. In fact, an implementation of [15] processes four 64 × 64The conversion of data pattern from Fig. 10(b) to Fig. 10(c) blocks in parallel achieving four times speedup. We believerequires additional hardware resources that are the 16×16×13 that one limitation of this scheme is that the compressionbuffer, the 1/4×16 × 16×13 buffer, and the address generation block size cannot be large. This is because the DWT sizemodule (C2B in Table V for an encoder and B2C for a must also decreases as the compression block size decreases.decoder). Given that the total buffer size is 2.5×16 × 16×3, Note that a small DWT size may substantially reduce thethe memory size increases by 50%. On the other hand, 9.2% compression efficiency. For applications such as a H.264 videoof logic gates are increased due to the addition of C2B and codec chip, a macroblock of size 16 × 16 is used as theB2C modules. The proposed BPS design requires additional basic unit for memory access, and consequently, the memoryhardware units, bitstream aligner for the encoder, and bitstream compression used in a H.264 codec often requires 16 × 16parser for the decoder due to its pass-parallel processing. as the compression block size. Thus, this scheme may not beThe gate counts of these two units are 3.2K and 6.1K gates, effectively used for a frame memory compression for an H.264respectively. These are about 24% and 44% of the hardware codec. On the other hand, the proposed BPS does not have anycosts of the BPS encoder and decoder, respectively. This limitation on the DWT size. Another limitation with a largeimplies that a substantial overhead is necessary to implement compression block size is that the hardware complexity maythe proposed algorithm. Furthermore, control signal generation increase as the block size increases. The increased complexity
  12. 12. JIN AND LEE: BLOCK-BASED PASS-PARALLEL SPIHT ALGORITHM 1075makes it difficult to increase the operating clock frequency, and [16] C.-C. Cheng, P.-C. Tseng, and L.-G. Chen, “Multimode embedded compression codec engine for power-aware video coding system,” IEEEconsequently, reduces the throughput. Despite the limitations, Trans. Circuits Syst. Video Technol., vol. 19, no. 2, pp. 141–150, Feb.this scheme provides a good tradeoff between speed and com- 2009.pression and it can be combined with BPS to obtain the best [17] T. Fry and S. Hauck, “SPIHT image compression on FPGAs,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 9, pp. 1138–1147, Sep.tradeoff. 2005. [18] A. Nandi and R. Banakar, “Throughput efficient parallel implementation of SPIHT algorithm,” in Proc. Int. Conf. Very Large Scale Integr. Des., References 2008, pp. 718–725. [19] R. Kutil, “Approaches to zerotree image and video coding on MIMD [1] JPEG2000 Image Coding System, document ISO/IEC 15444-1, 2000. architectures,” Parallel Comput., vol. 28, pp. 1095–1109, Aug. 2002. [2] J. M. Shapiro, D. S. R. Center, and N. J. Princeton, “Embedded image [20] V. R. Algazi and J. Estes, “Analysis-based coding of image transform coding using zerotrees of wavelet coefficients,” IEEE Signal Process., and subband coefficients,” in Proc. SPIE Vis. Commun. Image Process. vol. 41, no. 12, pp. 3445–3462, Dec. 1993. Conf., 1995, pp. 11–21. [3] D. Taubman, “High performance scalable image compression with [21] N. Zervas, G. Anagnostopoulos, V. Spiliotopoulos, Y. Andreopoulos, EBCOT,” IEEE Trans. Image Process., vol. 9, no. 7, pp. 1158–1170, and C. Goutis, “Evaluation of design alternatives for the 2-D-discrete Jul. 2000. wavelet transform,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, [4] A. Said and W. Pearlman, “A new, fast, and efficient image codec based no. 12, pp. 1246–1262, Dec. 2001. on set partitioning in hierarchical trees,” IEEE Trans. Circuits Syst. Video[22] A. Gupta, S. Nooshabadi, and D. Taubman, “Efficient interfacing of Technol., vol. 6, no. 3, pp. 243–250, Jun. 1996. DWT and EBCOT in JPEG2000,” IEEE Trans. Circuits Syst. Video [5] T. Y. Lee, “A new frame-recompression algorithm and its hardware Technol., vol. 18, no. 5, pp. 687–693, May 2008. design for MPEG-2 video decoders,” IEEE Trans. Circuits Syst. Video [23] Virtex-II Platform FPGAs: Complete Data Sheet, Tech. Doc. DS031 Technol., vol. 13, no. 6, pp. 529–534, Jun. 2003. (v3.5) Product Specification, Xilinx, San Jose, CA, 2007, pp. 1–318. [6] Y. Jin, Y. Lee, and H.-J. Lee, “A new frame memory compression [24] Virtex-E 1.8V FPGAs: Complete Data Sheet, Tech. Doc. DS022 (v3.5) algorithm with DPCM and VLC in a 4 × 4 block,” EURASIP J. Adv. Product Specification, Xilinx, San Jose, CA, 2002, pp. 1–233. Signal Process., vol. 2009, no. 629285, p. 18, 2009. [7] W.-Y. Chen, L.-F. Ding, P.-K. Tsung, and L.-G. Chen, “Architecture Yongseok Jin (M’09) received the B.S. and Ph.D. design of high performance embedded compression for high definition degrees in electrical engineering and computer sci- video coding,” in Proc. IEEE Int. Conf. Multimedia Expo, Apr.–Jun. ence from Seoul National University, Seoul, Korea, 2008, pp. 825–828. in 2003 and 2010, respectively. [8] J. Someya, A. Nagase, N. Okuda, K. Nakanishi, and H. Sugiura, “De- He is currently a Research Associate with the Mi- velopment of single chip overdrive LSI with embedded frame memory,” crosystems Design Laboratory, Department of Com- in Proc. SID Symp. Dig., vol. 39. 2008, pp. 464–467. puter Science and Engineering, Pennsylvania State [9] T. B. Yng, B.-G. Lee, and H. Yoo, “A low complexity and lossless University, University Park. His current research frame memory compression for display devices,” IEEE Trans. Consumer interests include computer architectures and system- Electron., vol. 54, no. 3, pp. 1453–1458, Aug. 2008. on-chip design for video coding and computer vision[10] Y.-H. Lee, Y.-Y. Lee, H.-Z. Lin, and T.-H. Tsai, “A high-speed lossless applications. embedded compression codec for high-end LCD applications,” in Proc. http://ieeexploreprojects.blogspot.com IEEE Asian Solid-State Circuits Conf., Nov. 2008, pp. 185–188.[11] J.-W. Han, M.-C. Hwang, S.-G. Kim, T.-H. You, and S.-J. Ko, “Vector Hyuk-Jae Lee received the B.S. and M.S. degrees in quantizer based block truncation coding for color image compression in electronics engineering from Seoul National Univer- LCD overdrive,” IEEE Trans. Consumer Electron., vol. 54, no. 4, pp. sity, Seoul, Korea, in 1987 and 1989, respectively, 1839–1845, Nov. 2008. and the Ph.D. degree in electrical and computer[12] G. Pastuszak, “A novel architecture of arithmetic coder in JPEG2000 engineering from Purdue University, West Lafayette, based on parallel symbol encoding,” in Proc. Int. Conf. Parallel Comput. IN, in 1996. Electr. Eng., 2004, pp. 303–308. From 1998 to 2001, he was with the Server[13] W. A. Pearlman, A. Islam, N. Nagaraj, and A. Said, “Efficient, low- and Workstation Chipset Division, Intel Corporation, complexity image coding with a set-partitioning embedded block coder,” Hillsboro, OR, as a Senior Component Design En- IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 11, pp. 1219– gineer. From 1996 to 1998, he was on the faculty 1235, Nov. 2004. of the Department of Computer Science, Louisiana[14] F. Wheeler and W. Pearlman, “SPIHT image compression without lists,” Tech University, Ruston. In 2001, he joined the School of Electrical En- in Proc. IEEE ICASSP, vol. 4. Jun. 2000, pp. 2047–2050. gineering and Computer Science, Seoul National University, where he is[15] P. Corsonello, S. Perri, G. Staino, M. Lanuzza, and G. Cocorullo, “Low currently a Professor. He is the Founder of Mamurian Design, Inc., Seoul, a bit rate image compression core for onboard space applications,” IEEE fabless system-on-chip (SoC) design house for multimedia applications. His Trans. Circuits Syst. Video Technol., vol. 16, no. 1, pp. 114–128, Jan. current research interests include computer architectures and SoC design for 2006. multimedia applications.