ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
LOW POWER Z-SCAN ARCHITECTURE FOR 2-D DWT
1. A POWER EFFICIENT ARCHITECTURE
FOR 2-D DISCRETE WAVELET
TRANSFORM
Rahul Jain, CoWare India
Preeti Ranjan Panda, IIT-Delhi
2. Agenda
Memory Power Optimization
Existing Z-Scan based Schemes
Low Power Z-Scan (Proposed Architecture )
Results
Conclusion
10 August 2006 10th IEEE VLSI Design And Test 2
Symposium, 2006
3. Memory Power Optimization
Importance of Optimizing Memory System Energy
Many emerging applications like JPEG2000 are data
intensive
Memory system can contribute up to 90% energy
Concurrently Optimizing Memory Architecture and
Accesses
Algorithm Level
Reduce memory requirement
Improve regularity of accesses
Build optimized memory architecture
Memory Partitioning
Custom Circuits
10 August 2006 10th IEEE VLSI Design And Test 3
Symposium, 2006
4. Z-Scan based Schemes [Chiu-SIPS’03]
Suspending a DWT line computation
Store 4 intermediate values
Z-Scan
Column Processing starts early
On-Chip Buffer Required = 4*M
M =Image Tile ht 2* CH
Optimal Z-Scan 2* CW
EBCOT Code-Block size (CW*CH) considered
On-Chip Buffer Required = 4*M+4*2*CW
Usually CW=CH=64 (values used in exp.)
10 August 2006 10th IEEE VLSI Design And Test 4
Symposium, 2006
5. Low-Power Z-Scan (1)
Generalize the Z-Scan
Compute r elements in a row
For Z Scan, r =2
For Optimal Z-Scan, r = 2*CW
On-Chip Buffer Required = 4*M+4*r
r r
2*CH
10 August 2006 10th IEEE VLSI Design And Test 5
Symposium, 2006
6. Low-Power Z-Scan (2)
r will be a sub-integral multiple of 2*CW
This considers the Code Block Size
2 separate buffers used
Row Buffer (RB) = 4*M
Column Buffer (CB) = 4*r
How to decide the value of r ?
Size of CB α r
RB Sleep Time α r RB in Low Power Mode
RB access
CB: r locations
10 August 2006 10th IEEE VLSI Design And Test 6
Symposium, 2006
7. Memory Power Analysis (1)
Let us assume that each element is computed in
unit time (Energy and Power can be used interchangeably)
For a memory of size 2n, Let
Pa(2n) : memory access power
Ps(2n) : sleep mode / data retention mode power
Pw(2n) : wakeup power for each state transition from
sleep mode to active mode
Let, Ps(2n) = s* Pa (2n) and Pw (2n) = w* Pa (2n)
s = 0.1, w = 0.33 (Assumed for Experiments)
Buffer Accesses
Read at Resumption
Write at Suspension
10 August 2006 10th IEEE VLSI Design And Test 7
Symposium, 2006
8. Memory Power Analysis (2)
Row Buffer Power
2 access per r elements
RB in sleep mode for r-2 element computation
Wakeup RB once per row
Power per ‘r’ element computation:
Prow_buffer (r, M) = 2* Pa(M) + (r-2) * Ps(M) + Pw(M)
RB in Low Power Mode
Wakeup
Row Computation Resumes
Row Computation Suspends
10 August 2006 10th IEEE VLSI Design And Test 8
Symposium, 2006
9. Memory Power Analysis (3)
Column Buffer Power
1 access per element
Power consumption per element computation:
Pcol_buffer (r) = Pa(r)
Col Computation Resumes
Col Computation Suspends
Power per 2-D DWT Element Computation:
Prow_buffer (r, M)/r + Pcol_buffer (r)
10 August 2006 10th IEEE VLSI Design And Test 9
Symposium, 2006
10. Variation of Power with r
6.00E-10
5.00E-10
4.00E-10
Energy (J)
M=512
M=256
3.00E-10 M=128
M=64
r=32 M=32
2.00E-10
r=16
1.00E-10
0.00E+00
2 4 8 16 32 64 128
Value of r
10 August 2006 10th IEEE VLSI Design And Test 10
Symposium, 2006
11. Power Implications of Banking (1)
Banked Buffer
Increases the average idleness of the each buffer
Lower Access Power
Predictable state changes, no timing overheads
Let there be ‘b’ RB banks and ‘c’ CB banks
Average RB power per element:
Prow = [Power of bank in use*M/b + Sleep Power*(M-M/b)] / M
= [{Prow_buffer (r, M/b) / r} * M/b + Ps (M/b) * (M-M/b)] / M
Each bank waked up once for M*r elements
Additional Row Buffer Wakeups per Element = b/M*r
10 August 2006 10th IEEE VLSI Design And Test 11
Symposium, 2006
12. Power Implications of Banking (2)
Average column-buffer power per element:
Pcol = [{Pcol_buffer (r/c)} * r/c + Ps (r/c) * (r-r/c)] / r
No of Column Buffer Wakeups per Element = c/r
Additional Wakeup Power :
Pwakeups = [Pw(M/b) * b/M*r ] + [ Pw(r/c) * c/r ]
MUX power considered
Total Power per Element :
Prow + Pcol + Pwakeups + Pmux
10 August 2006 10th IEEE VLSI Design And Test 12
Symposium, 2006
13. r vs Power (Banked Case, M=512)
Min Power
with r=64,
c=4, b=8
10 August 2006 10th IEEE VLSI Design And Test 13
Symposium, 2006
14. Energy Consumption Comparison
Optimal Low-Power
Z-scan %
M Z-scan Z-scan r c b
(10-11J) imp
(10-11J) (10-11J)
32 23.4 29.1 8.08 32 4 4 72.2
64 25.5 29.3 8.13 64 4 4 72.3
128 29.9 29.7 8.18 64 4 8 72.5
256 38.5 30.6 8.29 64 4 8 72.9
512 55.8 32.3 8.49 64 4 8 73.7
1024 90.3 35.8 8.89 64 4 8 75.2
Up to 90% and 75% improvement over Z-Scan and Optimal
Z-Scan respectively
10 August 2006 10th IEEE VLSI Design And Test 14
Symposium, 2006
15. Energy Modelling
Sequential Access Memory [Moon-CICC’02]
Configured as a circular buffer
Address Sequencing logic and decoders replaced with
row sequencer to get low power and high speed
Banked implementation used for big memory
Energy Modelling [Coumeri-TVLSI’00]
Empirical Equations for modelling energy of on-chip
SRAM memory
Model parameters are Size, Bit Width, Access Mode
Individual equations for different memory components
To model SAM, Row Decoder, Column Decoder, Buffers
not considered
10 August 2006 10th IEEE VLSI Design And Test 15
Symposium, 2006
16. Conclusion
A methodology to arrive at a Low-Power
DWT architecture proposed
Co-Optimization of Memory Architecture
and Access pattern done
Up to 90% energy saving achieved
The derived architecture depends on the
target memory technology
Would lead to different architectures for ASIC
and FPGA implementations
10 August 2006 10th IEEE VLSI Design And Test 16
Symposium, 2006
17. References:
[Chiu-SIPS’03]: Mu-Yu Chiu et al (2003).Optimal data
transfer and buffering schemes for JPEG2000 encode.
IEEE Workshop on SIPS, Aug. 2003, pp. 177 – 182
[Moon-CICC’02]: Joong-Seok Moon et.al (2002). Low-
power sequential access memory design. Custom
Integrated Circuits Conference, 2002. pp.111 – 114
[Coumeri-TVLSI’00]: Coumeri, S.L et al (2000).
Memory modelling for System Synthesis. IEEE Trans.
VLSI Systems, , June 2000, pp:327 – 334
10 August 2006 10th IEEE VLSI Design And Test 17
Symposium, 2006
18. Thank You
Questions!
10 August 2006 10th IEEE VLSI Design And Test 18
Symposium, 2006
20. Discrete Wavelet Transform
2D wavelet transform:
1st:1D wavelet transform to all rows
2nd:1D wavelet transform to all columns
Each Row/Column can be computed independently
Store 4 values at line computation suspension
0 1 2 3 4 5 6 7 8 X(i)
1 3 5 7 Y(2i+1)
Colored arrows show
multiplication by
0 2 4 6 8 Y(2i) constants a, b, c, d
defined in JPEG2000
1 7 Z(2i+1) standard
3 5
0 2 8 Z(2i)
4 6
10 August 2006 10th IEEE VLSI Design And Test 20
Symposium, 2006
21. Buffer Structure
The Buffers are all the time full
They are accessed like a circular FIFO
General Memory Row Decoder not required
use a counter
use a shift register loaded with a 1 initially
Every Write Signal
Increments the counter
Shifts the Register
Store all the 4 intermediate values in one Column
No need for the Column Decoder
This would be similar to Sequential Access Memory
(SAM) [Moon-CICC’02]
10 August 2006 10th IEEE VLSI Design And Test 21
Symposium, 2006