Efficient floating-point texture decompression

2,299 views

Published on

Presentation at SoC 2010 (International Symposium on System-on-Chip) in Tampere, Finland. The full paper is available at IEEEXplore (http://dx.doi.org/10.1109/ISSOC.2010.5625555).

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,299
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The latest NVIDIA GeForce GTX 480 can fetch 42 billion texels per second, and the decoder must keep up with that.
  • Efficient floating-point texture decompression

    1. 1. Efficient Floating-Point Texture Decompression<br />Tomi Aarnio (NRC Tampere)<br />Claudio Brunelli (NRC Tampere)<br />Timo Viitanen (TUT)<br />
    2. 2. Texturing pipeline in a GPU<br />
    3. 3. Texturing pipeline in a GPU<br />Memory bandwidth is the worst bottleneck<br />
    4. 4. Texturing pipeline in a GPU<br />Cache size is another<br />Memory bandwidth is the worst bottleneck<br />
    5. 5. Texturing pipeline in a GPU<br />Cache size is another<br />Memory bandwidth is the worst bottleneck<br />Texture compression can alleviate both!<br />
    6. 6. Texturing pipeline in a GPU<br />Must be very fast:<br />~40 gigatexels/sec<br />
    7. 7. The established solution<br />Nearly all existing schemes work the same way<br />Partition the image into blocks of 4 x 4 pixels<br />Compress each block independently<br />Use a fixed compression ratio (6:1)<br />Our focus is on high dynamic range (HDR) textures<br />RGB colors in 16-bit floating-point (FP16)<br />Compressed from 48 bits per pixel, down to 8 bpp<br />
    8. 8. FP16 texture compression<br />Roimela et al. [SIGGRAPH 2006, I3D 2008]<br />Munkberg et al. [SIGGRAPH 2006, CGF 2008]<br />Sun et al. [Graphics Hardware 2008, IEEE TVCG 2010]<br />BC6H/BPTC [DirectX 11, OpenGL 4]<br />
    9. 9. FP16 texture compression<br />Roimela et al. [SIGGRAPH 2006, I3D 2008]<br />Munkberg et al. [SIGGRAPH 2006, CGF 2008]<br />Sun et al. [Graphics Hardware 2008, IEEE TVCG 2010]<br />BC6H/BPTC [DirectX 11, OpenGL 4]<br />Far too high complexity<br />
    10. 10. FP16 texture compression<br />Roimela et al. [SIGGRAPH 2006, I3D 2008]<br />Munkberg et al. [SIGGRAPH 2006, CGF 2008]<br />Sun et al. [Graphics Hardware 2008, IEEE TVCG 2010]<br />BC6H/BPTC [DirectX 11, OpenGL 4]<br />Our contribution<br />Implemented and optimized #1 (a.k.a. ”NXR”)<br />Benchmarked against #4<br />
    11. 11. Red<br />Baseline decoder<br />Extract bitfields<br />R, B,<br />Lexponent<br />Lmantissa<br />int-to-fp16 converter<br />fp16 multiplier<br />R<br />R<br />210<br /> Green<br />int-to-fp16 converter<br />fp16 multiplier<br />G<br />Blue<br />int-to-fp16 converter<br />fp16 multiplier<br />B<br />B<br />Lexponent<br />fp16 normalizer<br />Lmantissa<br />
    12. 12. Optimizations<br />Simplify this<br />Red<br />Extract bitfields<br />R, B,<br />Lexponent<br />Lmantissa<br />int-to-fp16 converter<br />fp16 multiplier<br />R<br />R<br />210<br /> Green<br />int-to-fp16 converter<br />fp16 multiplier<br />G<br />Blue<br />int-to-fp16 converter<br />fp16 multiplier<br />B<br />B<br />Simplify this<br />Lexponent<br />fp16 normalizer<br />Lmantissa<br />
    13. 13. Optimizations (Part 1)<br />Red and Blue are in 0.10-bit fixed point<br /> Can be treated as fp16 denormals with no conversion logic<br />Simplify the multipliers (L*R and L*B)<br />Exponent can’t increase – remove biasing and overflow logic<br />Mantissa will fit in 1.20 fixed point – remove overflow logic<br />At most 10 leading zeros – truncate post-normalizers<br />No need to deal with signs, infinities and NaNs<br />
    14. 14. Red<br />Extract bitfields<br />R, B,<br />Lexponent<br />Lmantissa<br />Green<br />Blue<br />Optimized decoder<br />
    15. 15. Optimized decoder<br />CLZ<br />Count Leading Zeros<br /><<<br />Shift Left<br />10 x 11 -bit multiplier<br />Extract bitfields<br />R, B,<br />Lexponent<br />Lmantissa<br />Red<br />Clamp, Shift &<br />Pack<br />Rexponent<br />Lexponent<br />R<br />R<br />CLZ<br />Rmantissa<br /><<<br />Green<br />Lmantissa<br />Blue<br />
    16. 16. Optimized decoder<br />CLZ<br />Count Leading Zeros<br /><<<br />Shift Left<br />10 x 11 -bit multiplier<br />Extract bitfields<br />R, B,<br />Lexponent<br />Lmantissa<br />Red<br />Clamp, Shift &<br />Pack<br />Rexponent<br />Lexponent<br />R<br />R<br />CLZ<br />Rmantissa<br /><<<br />Green<br />Lmantissa<br />Blue<br /><<<br />Clamp, Shift & Pack<br />Bmantissa<br />B<br />B<br />CLZ<br />Lexponent<br />Bexponent<br />
    17. 17. Optimizations (Part 2)<br />Eliminate the green channel multiplier<br />LG = L (1024 – (R + B)) = 1024L – (LR + LB)<br />Two 20-bit adders are much cheaper than a 10-bit multiplier<br />Round to zero instead of nearest<br />Introduces a maximum of 1-bit error<br />Compression error is much larger, 4-8 bits<br />
    18. 18. Optimized decoder<br />CLZ<br />Count Leading Zeros<br /><<<br />Shift Left<br />10 x 11 -bit multiplier<br />Extract bitfields<br />R, B,<br />Lexponent<br />Lmantissa<br />Red<br />Clamp, Shift &<br />Pack<br />Rexponent<br />Lexponent<br />R<br />R<br />CLZ<br />Rmantissa<br /><<<br />Green<br />Lexponent<br />Clamp, Shift & Pack<br />220<br />Gexponent<br />Lmantissa<br />G<br />CLZ<br />Gmantissa<br /><<<br />Blue<br /><<<br />Clamp, Shift & Pack<br />Bmantissa<br />B<br />B<br />CLZ<br />Lexponent<br />Bexponent<br />
    19. 19. FPGA synthesis (Altera Stratix III)<br />
    20. 20. ASIC synthesis @ 180 nm (Synopsys)<br />
    21. 21. ASIC synthesis @ 180 nm (Synopsys)<br />Only one of 14 modes.<br />A complete decoder would be somewhat larger.<br />
    22. 22. ASIC synthesis @ 180 nm (Synopsys)<br />Relatively long critical path, due to leading-zero counters.<br />
    23. 23. Summary<br />VHDL implementation of a floating-point texture decoder<br />Our optimizations reduced area by ~50%<br />Competing decoder turned out 75% larger<br />Main weakness: long critical path<br />Completely feasible to put on real hardware<br />
    24. 24. Future work<br />Measure power consumption<br />More important than silicon area<br />Optimize the long latency<br />Can also help reduce area & power<br />Implement an encoder in ASIC<br />Textures are increasingly generated in real time<br />
    25. 25. Efficient Floating-Point Texture Decompression<br />Tomi Aarnio (NRC Tampere)<br />Claudio Brunelli (NRC Tampere)<br />Timo Viitanen (TUT)<br />

    ×