A Survey of Compressed GPU-based Direct Volume Rendering

www.crs4.it/vic/
Visual Computing Group
A Survey of Compressed GPU-based
Direct Volume Rendering
• Enrico Gobbetti
• Jose Díaz
• Fabio Marton
June 2015

E. Gobbetti, F. Marton and J. Diaz
Goal and motivation
• Massive volumetric models anywhere!
– Deal with time and memory limits
• Compute compact representation of volumetric
models suitable for fast GPU rendering
– Focus on rectilinear scalar grids
3

• Visualization
“The use of computer-supported,
interactive, visual representations
of (abstract) data to amplify
cognition.”
• Volume visualization
– Discrete data samples in the 3D
space
• Represented as a 3D rectilinear
grid
– Applications
• Medical science
• Engineering
• Earth science ……
3D grid of samples
Evidence to study
Output image
Visualization
Pipeline
Introduction

Introduction
http://gallery.ensight.com
http://www.vizworld.com
• Visualization
“The use of computer-supported,
interactive, visual representations
of (abstract) data to amplify
cognition.”
• Volume visualization
– Discrete data samples in the 3D
space
• Represented as a 3D rectilinear
grid
– Applications
• Medical science
• Engineering
• Earth science ……

Introduction
• Rendering methods
– Indirect Volume Rendering
• Isosurface extraction
• Data mapped to geometric
primitives
• Rendering of geometry
– Direct Volume Rendering (DVR)
• No need of isosurface extraction
• Data mapped to optical properties
• Direct visualization by compositing
optical properties (color, opacity)

Introduction
• Direct Volume Rendering
– Optical properties assigned by
transfer functions
• Color and opacity
– Illumination may be added
• Typically Phong shading model
– Most used method: ray casting
• Sampling along viewing rays
• Compositing optical properties
of the samples for each ray

Introduction
• Visualization challenges
– Present the information in a proper
way
• Transfer functions definition
• Highlight features of interest
• Deal with occlusions of outer elements
• Convey the spatial arrangement …
– Deal with the ever-increasing size of
the data
• Overcome hardware limitations
– Bandwith
– Memory consumption …
– Provide a real-time exploration of the
data …

Big Data
¨In information technology, big data is a collection of data sets
so large and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications.
The challenges include capture, curation, storage, search,
sharing, analysis and visualization.¨
• Our interest
– Very large 3D volume data
11

Big Data
• Volume data growth
12
64 x 64 x 64
[Sabella 1988]
256 x 256 x 256
[Krüger 2003]
21494 x 25790 x 1850
[Hadwiger 2012]

Data size examples
13
Year Paper Data size Comments
2002 Guthe et al. 512 x 512 x 999 (500 MB)
2,048 x 1,216 x 1,877 (4.4 GB)
multi-pass, wavelet compression,
streaming from disk
2003 Krüger & Westermann 256 x 256 x 256 (32 MB) single-pass ray-casting
2005 Hadwiger et al. 576 x 352 x 1,536 (594 MB) single-pass ray-casting (bricked)
2006 Ljung 512 x 512 x 628 (314 MB)
512 x 512 x 3396 (1.7 GB)
single-pass ray-casting,
multi-resolution
2008 Gobbetti et al. 2,048 x 1,024 x 1,080 (4.2 GB) ‘ray-guided’ ray-casting with
occlusion queries
2009 Crassin et al. 8,192 x 8,192 x 8,192 (512 GB) ray-guided ray-casting
2011 Engel 8,192 x 8,192 x 16,384 (1 TB) ray-guided ray-casting
2012 Hadwiger et al. 18,000 x 18,000 x 304 (92 GB)
21,494 x 25,790 x 1,850 (955 GB)
ray-guided ray-casting
visualization-driven system
2013 Fogal et al. 1,728 x 1,008 x 1,878 (12.2 GB)
8,192 x 8,192 x 8,192 (512 GB)
ray-guided ray-casting

Scalability
• Traditional HPC, parallel rendering definitions
– Strong scaling (more nodes are faster for same data)
– Weak scaling (more nodes allow larger data)
• Our interest/definition: output sensitivity
– Running time/storage proportional to size of output instead
of input
• Computational effort scales with visible data and screen
resolution
• Working set independent of original data size
14

Large-Scale Visualization Pipeline
15

Scalability Issues
16
Scalability issues Scalable method
Data representation and storage Multi-resolution data structures
Data layout, compression
Work/data partitioning In-core/out-of-core
Parallel, distributed
Work/data reduction Pre-processing
On-demand processing
Streaming
In-situ visualization
Query-based visualization

Compressed DVR Example
17
Chameleon: 1024^3@16bit (2.1GB); Supernova: 432^3×60timesteps@ﬂoat (18GB)
Compression-domain Rendering from Sparse-Coded Voxel Blocks (EUROVIS2012)

Many use cases despite current
growth in graphics memory
• Still many «too large» datasets
– Hires models, multi-scalar data, time-varying data, multi-
volume visualization, …
• Mobile graphics
– Streaming, local rendering
• Need for loading entire datasets on graphics
memory
– Fast animation
– Cloud renderers with many clients
18

Overview

Compact Data Representation Models
20

Processing Methods and Architectures
21

Rendering Methods and Architectures
22

COMPACT DATA REPRESENTATION
MODELS AND PROCESSING
A Survey of Compressed GPU-based Direct Volume
Rendering
Next Session:
23

www.crs4.it/vic/
• Enrico Gobbetti
• Jose Díaz
• Fabio Marton
June 2015
A Survey of Compressed GPU-based Direct Volume Rendering
Compact Data Representation Models

Compression-domain DVR
• Trade-off:
– achieve high compression ratio
– high image quality
• Reconstruct in real-time
– fast data access
– fast computations
25

Bases and Coefficients
26
show relationship
between bases and
original

Types of Bases
27
pre-defined bases learned bases
DFT, DCT, DWT eigenbases (KLT, TA),
dictionaries

Modeling Stages
28
stages that can
be applied

Compact Models in DVR
• Pre-defined bases
– discrete Fourier transform (1990;1993)
– discrete Hartley transform (1993)
– discrete cosine transform (1995)
– discrete wavelet transform (1993)
– laplacian pyramid/transform (1995;2003)
– Burrows-Wheeler transform (2007)
• Learned bases
– Karhunen-Loeve transform (2007)
– tensor approximation (2010;2011)
– dictionaries for vector quantization (1993;2003)
– dictionaries for sparse coding (2012)
– fractal compression (1996)
29
most models are lossy
or near lossless (best
achievement), but
often used in a lossy
version

Compact Models in DVR
– discrete Fourier transform
– discrete Hartley transform
– discrete cosine transform
– discrete wavelet transform
– laplacian pyramid/transform
– Burrows-Wheeler transform
• Learned bases
– Karhunen-Loeve transform
– tensor approximation
– dictionaries for vector quantization
– dictionaries for sparse coding
– fractal compression
30

Wavelet Transform
• Use a wavelet to represent data
– similar to Fourier transform (represent data with sines/cosines)
• A wavelet is a set of basis functions
– high-pass and low-pass filtering
• Data transform
– into frequency domain AND spatial domain
– enhanced compared to Fourier transform (only frequency
domain)
31

32
2D Wavelet Example
multilevel
DWT
low frequency coefficients of level l
high frequency coefficients of level l

Wavelet full reconstruction
34

Quantized wavelet reconstruction 1
35

36

37

38

39
Dictionaries

40
dictionary
Dictionaries
...
words
used to represent
an approximated
volume

41
Dictionaries
• Basic idea
– many subblocks have a similar pattern
– search for each subblock a codeword
that best represents this subblock
– volume is represented with index list
– fast decompression (table-lookup)
– codewords can be pre-defined or learned
– lossless or lossy
• Main challenge
– find good codewords
– find a fast algorithm for dictionary generation
store indices to codewords
M codewords
volume divided
into subblocks

42
Vector Quantization (VQ)
• Vector represents subblock
• Limit number of codewords M

DICTIONARY
Vector Quantization
43

Hierarchical VQ
• For each block store:
– Average Value
– ∆ Delta to first level
– ∆ Delta to second level
45
LaplacianFilter

• Store differences in a multiresolution data structure
– block-based two-level laplacian pyramid
• Produce blocks of different frequency bands
• High frequency bands are compressed using VQ
• Covariance approach for dictionary generation
– splitting by PCA
• Similar: VQ and Karhunen-Loeve transform (KLT)
46
Hierarchical VQ

Sparse Coding
• Data-specific representation in which each block is
represented by a sparse linear combination of few
dictionary elements
– State-of-the-art in sparse coding [Rubinstein et al. 10]
– Compression is achieved by storing just indices and their
magnitudes
47

DICTIONARY
Sparse Coding
48
Generalization of VQ:
data is represented with a linear
combination of codewords
B = c0 * D[a0] + c1 * D[a1] + c2 * D[a2]

Sparse Coding
• The problem
• Generalization of vector quantization
– Combine vectors instead of choosing single ones
– Overcomes limitations due to dictionary sizes
• Generalization of data-specific bases
– Dictionary is an overcomplete basis
– Sparse projection
49

Sparse Coding
• Represent each block as linear combination of s
blocks from an overcomplete dictionary
– The dictionary can be learned or predefined (e.g., wavelet)
– Learned dictionaries typically perform better (but optimal
ones are hard to compute)
• Basic idea is that the dictionary is sparsifying
– Few dictionary elements can be combined to get good
approximations
• Combines the benefits of
– Dictionaries
– Projection onto bases
50

53
• Higher-order singular value decomposition
– here: so-called Tucker model used for TA (eigenbases)
Tensor Approximation (TA)
coefficients
(core tensor)
bases
(factor matrices)
approximation
spatial dimensions multilinear ranks

55
Feature Extraction with WT and TA
2‘200‘000 coeff. 310‘000 coeff. 57‘500 coeff. 16’500 coeff.
original size:
2563 = 16’777‘216
T
multiscale dental growth structures

Truncation and Quantization
• Lossy
– typically, we loose most information during this stage
• Truncation
– threshold insignificant coefficients
– threshold insignificant bases
• Quantization
– floating point to integer conversion
– vector quantization (fewer codewords)
56

57
Example: Truncate TA Bases
truncate
high ranks
truncate
low ranks

K-SVD vs HVQ vs Tensor
• Comparison of state-of-the-art GPU-based
decompression methods

Sparse Coding vs HVQ vs Tensor
• Comparison of state-of-the-art GPU-based
decompression methods
SPARSE
CODING
HVQ
TA

Sparse Coding Compression Results
60

61
Encoding Models
• Lossless
• RLE
– count number of equal values
• Entropy encoding
– short code for frequent words; long code for infrequent words
– Huffman coding, arithmetic coding, ...
– avoid inconvenient data access
• fixed-length Huffman coding combined with RLE
• significance maps (wavelets)
• decode before rendering
• Many Hybrids

62
Hardware Encoding Models
• Block truncation coding (texture compression)
– volume texture compression (VTC); 3D extension of S3TC
– ETC2
– adaptive scalable texture (AST)

63
Time-varying Data
• Compression even needed more
• Three categories
– extend compression approach from 3D to 4D
– use other approach for 4th dimension
– encode each time step separately
• Hybrids of the above

64
Model Summary
• Wavelet transform ...
– is good for smoothed averaging (multiresolution)
• Vector quantization ...
– makes a very fast reconstruction possible
– fast data access
– can be combined with various methods to generate the
dictionary
• Sparse coding
– sparse linear combination of dictionary words
– fast decompression, high compression rates
• Tensor approximation ...
– is good for multiresolution and multiscale DVR
– feature extraction at multiple scales

65
Conclusions
• Recently mostly learned bases
– variants of eigenbases are popular
– hybrid forms of dictionaries (VQ, multiresolution, sparse
coding)
• Variable compression ratio
– from near lossless to extremely lossy
• Hardware accelerated models
– block truncation coding (texture compression models)
• Various stages of compact modeling CAN be
applied
– transform or decomposition
– truncation
– quantization
– encoding

www.crs4.it/vic/
Processing and Encoding
• Enrico Gobbetti
• Jose Díaz
• Fabio Marton
June 2015

Processing and Encoding
67

A scalable algorithm!
68
• Able to face huge datasets
– O(10^10) samples, and more
• Scalable approach in terms of
– Memory
– Times
• Required features:
– Out-of-Core
– Asymmetric codecs
– Parallel settings
– High quality coding (depends on
representation)
Input Volume
Bricked
Compressed
Volume

Static and time-varying data
• Static datasets
– Handle 3D volume
– typically uses 3D bricked representation
– Mostly octree or flat multires blocking
• Dynamic datasets
– Uses techniques similar to static dataset
– Three approaches
• Separate encoding of each timestep
• Static compression method extended from 3D to 4D data
• Treat in different way 3D data and the time dimension as in
video encoding
69

Output volume
• Bricked
– Fast local and transient decompression
– Visibility
– Empty space skipping
– Early ray termination
• Compressed
– Bandwidth
– GPU memory
• LOD
– adaptive rendering
70
Compressed block

Output volume
• Bricked
– Fast local and transient decompression
– Visibility
– Empty space skipping
– Early ray termination
• Compressed
– Bandwidth
– GPU memory
• LOD
– adaptive rendering
71
Compressed block

Processing: two phases
• Processing depends on data representation
– Pre-defined bases …
– Learned bases …
• Global processing (some methods)
– Access all the dataset (works on billions of samples)
– Learn some representation
• Local Processing (all methods)
– Access independent blocks
– Encode [possibly using global phase representation]
72

73

Global Processing
• Find representation by globally analyzing whole
dataset
– Early transform coding methods [Levoy’92, Malzbender’93]
• … but modern methods apply transforms to small regions
– All dictionary based methods
• Also combined with other transformations
• Main problems
– Analyze all data within available resources
– Huge dataset introduce numerical instability problems
74

Finding a dictionary for huge
datasets: solutions
• Online learning
– Streaming all blocks
• Splitting into manageable tiles
– Separate dictionary per tile
• Coreset / importance sampling
– Work on representative weighted subsets
75

Global phase: Online Learning
76
State
Processor
Dictionary
N Iterations
Streaming all input blocks
• How it works
– Does not require concurrent access to all elements
– Stream over all elements
– Keep only local statistics => bounded memory!
– Iterates streaming multiple times

• Main example: Vector Quantization
– Find blocks representative of all the elements
• Classic approach: generalized Lloyd algorithm:
– [Linde 80, Lloyd 82]
– Based on 2 conditions:
• Given a codebook the optimal codevector is found with nearest
neighbor
• Given a codevector the optimal codebook is the centroid of the
partition
– Starting from initial partition and iterate until convergence:
• Associate all the vectors to corresponding cells
• Cells are updated to the centroids
77

• Vector quantization: convergence depends on
initial partition
– Good starting point: initial codevectors
based on PCA analysis (also in streaming!)
[Schneider 03,Knittel et al. 09]
• Each new seed is inserted in cell with max
residual distortion; split plane selected with
PCA of the cell producing two cells with
similar residual distortion
– Also applied hierarchically [Schneider 03]
• Similar concepts applied to online learning for
sparse coding
– [Mairal 2010, Skretting 2010]
78

• Pros
– Easy to implement
– Does not incur in memory limitations
• Cons
– Slow due to number of iterations on many data values (slow
convergence)
– Slow because of out-of-core operations
• Used by
– Vector quantization
– Sparse coding (but better results working on the full dataset)
– …
79

Global phase: Splitting
• Split dataset into
manageable tiles
– Process each tile
independently
– Produce representation
for each tile
80
Dictionary
Global
Processor
Dictionary
Global
Processor
Dictionary
Global
Processor
Dictionary
Global
Processor Dictionary
Global
Processor
Dictionary
Global
Processor
Dictionary
Global
Processor
Dictionary
Global
Processor

• Example: Transform Coding [Fout 07]
• Whole dataset subdivided in tiles, for each tile
– Transformed with Karhunen-Loeve
• Low frequencies contain significative representation
– Encode with VQ (then decoded)
– Residual encoded, iterated to store 3 representations:
• Low-pass
• Band-pass
• High-pass
– Low-pass enough to encode most of the elements
– Inverse transform computationally intensive:
• Data is reprojected during preprocessing.
• Reprojections stored into codebook instead of coefficients
81

• Pros
– Reasonably simple to implement
– Permit to solve with global algorithm
– High quality per tile
• Cons
– Discontinuity at tile boundaries
– Need to handle multiple contexts at rendering time
– Many active dictionary could lead to cache overflow at
rendering time
– Scalability limits
• Used by
– Most dictionary based methods
82

Dictionary
Global phase: Coreset/Importance
Sampling
• Build dictionary on smartly
subsampled and reweighted
subset of input data
– Coreset concept [Agarwal’05]
– Take a few representative subblocks
from original model
– Higher probability for more
representative samples
– Sample reweighting to remove bias
– Use global method on coreset
83
Global
Processor
Importance
sampler

Sampling
• Example: learning dictionary for sparse coding
of volume blocks
• K-SVD algorithm is extremely efficient on
moderate size data
– [Aharon et al. 2006]
– K-means generalization alternating
• sparse coding of signals in X, producing Γ given current D
• updates of D given the current sparse representations
– Update phase needs to solve linear systems of size
proportional to number of training signals!
• Use coreset concept!
– [Feldman & Langberg 11, Feigin et al. 11, Gobbetti ‘12]
84

Sampling
• We associate an importance to each of the
original blocks , being the standard
deviation of the entries in
– Picking C elements with probability proportional to
– See [Gobbetti 12] methods for coreset building and training
using few streaming passes
• Non-uniform sampling introduces a severe bias
– Scale each selected block by a weight where
is the associated probability
– Applying K-SVD to scaled coefficients will converge to a
dictionary associated with the original problem
85

Sampling
• Coreset scalability
86

Sampling
• Pros
– Proved to work extremely well for KSVD
– Extremely scalable
– No extra run-time overhead
• Cons
– Importance concept needs refinement especially for
structured data
– Tested so far only on limited datasets and few
representations.
– Needs more testing in multiple contexts
• Used by
– KSVD, VQ, HVQ
87

88

Local phase: Independent Block
Encoding
• Input model M is split into blocks B0…Bi
• For each block Bi
– Possibly use information gathered during the global phase (e.g.
dictionaries)
– Find the transformation/combination/representation that best fits
the data
– Store in the DB
89
Local
Processing
DB
B0 … Bi
M
B0
B1
B2
B3

Local phase: Compression models
• Learned bases
90

Local phase: Compression models
• Learned bases
91

Local phase: Discrete Wavelet
Transform
• Projecting into a set of pre-defined bases
– Concentrates most of the signal in a few low-frequency components
• Quantizing the transformed data can significantly
reduce data size
– Error threshold
• Entropy coding techniques are used for further
reducing size
– Exploiting data correlation
92

Local phase: Discrete Wavelet
Transform
• B-Spline wavelets [Lippert 97]
– Wavelet coefficients and positions (splatting)
– Delta encoding, RLE and Huffman coding
• Hierarchical partition (blocks/sub-blocks)
– 3D Haar wavelet, delta encoding, quantization, RLE and Huffman [Ihm ‘99,
Kim ‘99]
– Biorthogonal 9/7-tap Daubechies and biorthogonal spline wavelets
[Nguyen ‘01, Guthe ‘02]
– FPGA Hw deco [Wetekam ‘05]
– 3D dyadic wavelet transform [Wu ‘05]
• Separated correlated slices
– Temporal/spatial coherence [Rodler ‘99]
93
Wavelets are not trivial
to decode in GPU

Local phase: Tensor approximation
• Tensor decomposition is applied to N-dim input
dataset
– Tucker model [Suter ‘11]
– Product of N basis matrices and a N-dimensional core tensor
– HOSVD (Higher-order singular value decomposition)
• Applied along every direction of input data
• Iterative process for finding the core tensor
– Represents a projection of the input dataset
– Core tensor coefficients show the relationship between
original data and bases
94

Local phase: Tensor approximation
• Tensor decomposition (Tucker model)
TA = B x U(1) x U(2) x U(3)
– Decomposition obtained through HOSVD or HOOI (higher-order orthogonal
iteration)
– HOOI also produces rank-reduction
– Target is core tensor B (B is element of RR1xR2xR3 with Ri < Ii)
– After each iteration Ranki <= Ranki-1
– Core tensor is produced when the matrices cannot be further optimized
• [Suter 2011] Tensor decomposition coefficients are quantized:
8bit encoding for core tensor coefficients, 16bit encoding for
basis matrices
95
Convergence can be hard to achieve on large blocks

Local phase: Dictionary based
methods
• Vector Quantization
– Hierarchical Vector Quantization
– Local phase simple (nearest neighbor search)
• Sparse coding
– NP-hard problem
– Several efficient greedy pursuit approximation algorithms
[Elad ‘10]
– Order-Recursive Matching Pursuit(ORMP) via Choleski
Decomposition [Gobbetti ‘12]
96

Quantization and encoding
• Further reduce data size
– Quantization
– Variable/fixed length encoding
• Variable length encoding used in early compression algorithms
– High compression rates / typically decoded in CPU prior to GPU
data upload
– Good for leveraging network and disk access times
• Fixed length encoding is more well suited for GPU algorithms
– Regular memory access schema allows better GPU parallelization
• Fixed length Huffman + RLE provide high decompression speed
[Guthe ‘02, Shen ‘06, Ko ‘08]
97

Preprocessing of Time-varying Data
• Data structures combining spatial/temporal or
temporal/spatial domain grouping
– Time Space Partitioning Tree [Ma ‘00]
– Wavelet-based TSP [Wang ‘04, Shen ‘06]
– Hierarchical structure for each time step + spatial grouping
and encoding [Westermann ‘95, Wang ‘10]
98

Preprocessing of Time-varying Data
• Video-based encoding
– Delta encoding with respect to previous frame [Mensmann ‘10,
She ‘11] (+Quantization/LZO or HVQ)
– Intra coded frame or predictive frame (e.g. Haar wavelet / delta
encoding + wavelet) [Ko ‘08, Jang ‘12]
• Wavelet transform was frequently applied to time-
varying volume data
– [Westermann ‘95.. Many others, see paper for details]
99

Preprocessing summary
• Asymmetric codecs
– Most heavy work should be done at preprocessing time
• Specialized encoding techniques trade compression performance with
reconstruction speed (e.g., partial projections for KLT decoding)
• Global and local phases
– Most challenging part is global processing (heavy, hard)
• Recently coreset approaches in addition to online
learning or splitting
– Promising, need more research
• Many model specific methods
– Fixed bases relatively standard
– Recently more focus on computation methods for learned bases
(tensors, sparse coding)
– GPU decoding constrains type of transformations (e.g., variable
lenght codes still hard)
100

Next: Rendering
101

www.crs4.it/vic/
Decoding and Rendering
• Enrico Gobbetti
• Jose Díaz
• Fabio Marton
June 2015

Decoding and Rendering
• Introduction
• Architectures
• Decoding of compact representations
• Conclusions

Introduction
105
Slicing
[Guthe’04]
Raycasting
[Kaehler ‘06]
RenderingRuntimePreprocessing
0
L
Compressed
Multiresolution
Data Structure
Criteria:
transfer-function
view-dependent
 Small blocks
overhead

Introduction
• Out-of-core GPU methods
– Adaptively traverse space-partitioned GPU data structures
covering the full working set
Flat-multiresolution blocking [Ljung’06]

Introduction
• Data structures
– Grids
• Uniform or non-uniform
– Hierarchical grids
• Pyramid of uniform grids
• Bricked 2D/3D mipmaps
Uniform grid
Bricked mipmap

Introduction
• Data structures
– Tree structures
• KD-tree
• Quadtree
• Octree
Octree
wikipedia.org

Introduction
• Out-of-core GPU structures
Out-of-core GPU octree structure
[Gobbetti’08, Crassin‘09]

Introduction
• Per-block multiresolution
– Each block is encoded using a few levels-of-detail
• E.g. HVQ [Schneider’03]
• Global and per-block multiresolution
– FPGA Wavelet + Huffman encoding [Wetekam’05]
• DVR requires interactive performance
– We need decompression to be in real-time
• Asymmetric schemes oriented to work in GPU
– Interaction between decompression and rendering
• Bottleneck: GPU memory and bandwidth
110

Adaptive out-of-core octree DVR
111

Compressed DVR Example
112
Chameleon: 1024^3@16bit (2.1GB); Supernova: 432^3×60timesteps@ﬂoat (18GB)
Compression-domain Rendering from Sparse-Coded Voxel Blocks (EUROVIS2012)

RENDERING ARCHITECTURES
113

Architecture
114
• Typical architecture for massive datasets

Working Set Determination
115
• Traditional methods
– Global attribute-based culling (view-independent)
• Cull against transfer function, iso value, enabled objects, etc.
– View frustum culling (view-dependent)
• Cull bricks outside the view frustum
– Occlusion culling?
• Modern methods
– Visibility determined during ray traversal
– Rays determine working set directly

Working Set Determination (Traditional)
116
• Global attribute-based
culling
– Cull bricks based on attributes;
view-independent
• Transfer function
• Iso value
• Enabled segmented objects
– Often based on min/max bricks
• Empty space skipping
• Skip loading of ‘empty’ bricks
• Speed up on-demand spatial
queries

Working Set Determination (Traditional)
117
• View frustum, occlusion culling
– Cull all bricks against view frustum
– Cull all occluded bricks
– View-dependent

Working Set Determination (Modern)
118
• Visibility determined during ray traversal
– Implicit view frustum culling (no extra step required)
– Implicit occlusion culling (no extra steps or occlusion buffers)

Working Set Determination (Modern)
119
• Rays determine working set directly
– Each ray writes out list of bricks it requires (intersects) front-
to-back
– Use modern OpenGL extensions
(GL_ARB_shader_storage_buffer_object, …)

Architecture
120
• Rendering

Rendering Techniques
121
• Texture slicing [Cullip and Neumann ’93, Cabral et
al. ’94, Rezk-Salama et al. ‘00]
–  Minimal hardware requirements (can run on WebGL)
–  Visual artifacts, less flexibility

122
• GPU ray-casting
[Röttger et al. ‘03, Krüger and
Westermann ‘03]
–  standard image order
approach, embarrassingly
parallel
–  supports many performance
and quality enhancements

123
• Large volume data rendering
– Octree rendering based on texture-slicing
• [LaMar et al. ’99, Weiler et al. ’00, Guthe et al. ’02]
– Bricked single-pass ray-casting
• [Hadwiger et al. ’05, Beyer et al. ’07]
– Bricked multi-resolution single-pass ray-casting
• [Ljung et al. ’06, Beyer et al. ’08, Jeong et al. ’09]
– Optimized CPU ray-casting
• [Knoll et al. ’11]
– GPU octree traversal single-pass ray-casting
• [Gobbetti et al. ’12]

Architecture
124
• Decompression on CPU before rendering

Decompression on CPU before rendering
• Decompression (generally lossless) of data
streams
– E.g. Wavelet-based time-space partitioning [Shen et al.
2006]
•  Reduces the storage needs
•  Support faster and remote access to data
125

Decompression on CPU before rendering
• Decompression (generally lossless) of data
streams
– E.g. Wavelet-based time-space partitioning [Shen et al.
2006]
•  Large memory resources needed
•  Inefficient use of the GPU memory bandwidth
•  CPU decompression is generally not sufficient
for time-varying datasets
• Common in hybrid architectures [Nagayasu’08, Mensmann’10]
126

Architecture
127
• Decompression on GPU before rendering

Decompression on GPU before rendering
•  Speed-up decompression by using the GPU
– Full working set decompression
• [Wetekam’05, Mensmann’10, Suter’11]
•  Works for time-varying data
– … but each single time step should fit in the GPU memory
and has to be decoded fast enough
•  Limit the maximum working set size …
– … to its size in uncompressed format
128

Architecture
129
• Decompression during rendering

Decompression during rendering
• Decoding occurs on-demand during the
rendering process
– The entire dataset is never fully decompressed
– Data travels in compressed format from the CPU to the GPU
•  Make the most of the available memory
bandwidth
• On-demand, fast and spatially independent
decompression on the GPU:
– Pure random-access
– Local and transient decoding
130

Pure random-access
• Data decompression is performed each time the
renderer needs to access to a single voxel
– Directly supported by the GPU
• Fixed-rate block-coding methods
– E.g. OpenGL VTC [Craighead’04], ASTC [Nystad’12]
• GPU implementations of per-block scalar quantization
– [Yela’08, Iglesias’10]
131

Pure random-access
• Data decompression is performed each time the
renderer needs to access to a single voxel
– Not supported by the GPU but easily implemented
over it
• GPU implementations of VQ and HVQ
– [Schneider’03, Fout’07]
•  Costly in performance
•  Rarely implement multisampling or HQ shading
– For this they rely on deferred filtering architectures
132

Local and transient decoding
• Partial working set decompression
• Interleaving of decoding and rendering
•  Exploit better the GPU memory resources
133

Deferred filtering
• Deferred filtering solutions [Fout’05, Wang’10]
– Exploit spatial coherence and supports high-quality shading
– Decompress groups of slices which can be reused during
rendering for gradient and shading computations
– Proposed as single resolution
134

Deferred filtering (single resolution)
135
Rendering from a Custom Memory Format
Using Deferred Filtering [Kniss’05]
Gradient computation using
central differences [Fout’05]

Deferred filtering (multiresolution)
• Extended deferred filtering to work with
multiresolution data structures [Gobbetti’12]
• Frame-buffer combination of partial portions of an octree
136
Deferred Filtering of an octree structure in GPU [Gobbetti’12]
136

Multiresolution deferred filtering in 3 passes
• Octree:
1. Adaptive refinement
• Multiresolution
2. Partitioning into subtrees
• Constraint: GPU cache size
• Bottom-up recombination of
subtrees in GPU cache
3. Subtree decompression,
raycasting and compositing
• Front-to-back octree traversal
– real-time
137137137
Framebuffer
1 2
3 4
5
6 7

COVRA: multiresolution deferred filtering
138138138

Deblocking
•Wide range of compressed GPU
volume renderers based on
block-transform coding
– Blocks are compressed and
decompressed individually
– Allow adaptive data loading
– Decoding performed in parallel
for different volume parts
– Real-time performance thanks
to parallel computing
•Main issue
– High compression ratios may
produce artifacts between
adjacent blocks

Deblocking
•Wide range of compressed GPU
volume renderers based on
block-transform coding
– Blocks are compressed and
decompressed individually
– Allow adaptive data loading
– Decoding performed in parallel
for different volume parts
– Real-time performance thanks
to parallel computing
•General issue
– High compression ratios may
produce artifacts between
adjacent blocks

• Filter decompressed data before rendering
– Smooth transition across boundaries
– Good balance artifact removal / feature preservation
– Low impact on performance
Main idea

Deblocking architecture
•Compressed data
– Volume octree of bricks divided into small blocks
•At run-time
– LOD selection and uploading of blocks to the GPU

Deblocking architecture
•For each frame
– Subdivision into slabs orthogonal to the main view direction
– Decompression / deblocking / rendering of each slab
– Accumulation of the result into the frame buffer
• Ray casting algorithm - Front-to-back order

Results
9/14

Results
Johns Hopkins Turbulence
simulation (512 time steps)
512 x 512 x 512 32 bit
256 GB uncompressed
K-SVD [Gobbetti et al. 10]
Compressed to 1.6 MB
(0.2 bps)
(block size = 8, K = 4, D = 1024)

Summary: Rendering Architectures
• Decompression on CPU before rendering
• Decompression on GPU before rendering
• Decompression during rendering
147

Conclusions
• Local and transient reconstruction is a key
factor in compression domain volume rendering
– Requires design at different levels
• Compact data model
• Rendering architectures
• Pure voxel-based random-access and
online/transient decoders avoid the full
working set decompression
• Efficient decompression is specially important
in lighter clients
– e.g. mobile, constrained network speeds
148

DECODING COMPACT
REPRESENTATIONS
149

Decoding Compact Representations
• Transform domain
– Pure compression-domain
– Operate directly in the transformed domain
– Decoding: through inverse stages of the forward encoding
• Per-voxel access
– Pure random-access
• Per-block accesses
– Amortize decompression computations for group of voxels
150

Transform Models (1/4)
• Fourier Transform: projection slice theorem
[Levoy’92]
151
Projection-slice theorem [Levoy’92]
One-dimensional case of the Fourier
forward transform (F) and its inverse (f)

[Levoy’92]
– GPU slice-based rendering [Viola’04]
–  Attenuation only renderings (e.g. X-Ray modalities)
152
[Viola’04]

[Levoy’92]
153
[Viola’04]

[Levoy’92]
• In the case of the Wavelet transform there also
exist an inverse transformation. It has been
commonly combined with …
– … multiresolution approaches
• [Westermann’94, Guthe’02]
– … other compression techniques
• RLE [Lippert’97], Rice-coding [Wu’05]
– … time-varying structures
• Wavelet-based Time-Space Partitioning [Wang’04]
154

• Karhunen-Loeve Transform
– [Fout’07] optimized the inverse transform computation
• Precompute partial reprojections and store them in associated
codebooks
155
[Fout’07]

• Karhunen-Loeve Transform
– [Fout’07] optimized the inverse transform computation
• Precompute partial reprojections and store them in associated
codebooks
156
[Fout’07]

• Tensor Approximation
– [Wu’08] Hierarchical tensor-based transform. Summation of
incomplete tensor approximations at multiple levels
– [Suter’11] Combined with multiresolution octree hierarchy.
Reconstruction in GPU using tensor times matrix
multiplications
158
VS.

• Dictionaries
– The decoding is much faster and simpler than the encoding
– GPU implementations:
• VQ , HVQ , M = number of dictionaries
– [Schneider’03, Fout’07]
• RVQ [Parys’09], FCHVQ [Zhao’12]
• Sparse coding , S = sparsity
– Parallel voxel decompression [Gobbetti’12]
– Dictionaries can be uploaded to GPU memory
• Just one additional memory indirection is needed
159

Thresholding and quantization
• Per-block scalar quantization [Yela’08,
Iglesias’10]
– Compute min and max values per block
• Difference based compression scheme
[Smelyanskiy’09]
• Hybrid schemes [Mensmann’10]
160

Encoding Models
• Fixed-rate block coding techniques
– e.g. VTC, S3TC
• Adaptive Scalable Texture Compression
162
Partitions and Multiple Color Spaces [Nystad’12]

Encoding Models
• Adaptive Scalable Texture Compression
163
Procedural partition
functions (HW-gen)
[Nystad’12]

Encoding Models
• General approach for applying sequential data
compression schemes to blocks of data
[Fraedrich’07]
• Decoding using an index texture to the beginning of each stripe
• Rendering using axis-aligned texture-based volume rendering
164
[Fraedrich’07]

Conclusions (1/2)
• Rendering directly in the transformed domain
– Reduced rendering complexity
– Have several limitations since do not support perspective
rendering, TF and complex shading
• Methods with real-time performance
– Tensor and sparse coding will improve using variable per-
block rank/sparsity (not available in GPU reconstructors)
165

Conclusions (2/2)
• Quality
– All block-based methods working at low bit-rates
• block-based artifacts among adjacent blocks
• Can be reduced using real-time deblocking
– Uncompressed multiresolution also have LOD artifacts
• Data replication schemes [Guthe’04]
• Interblock interpolation schemes [Hege’08]
166

Related Papers
• Real-time deblocked GPU rendering of compressed
volumes.
– Marton et al., VMV 2014
• COVRA: A compression-domain output-sensitive volume
rendering architecture based on a sparse representation of
voxel blocks.
– Gobbetti et al. in Eurovis 2012
• Interactive Multiscale Tensor Reconstruction for Multiresolution
Volume Visualization.
– Suter et al. in IEEE Visualization, 2011
167

Related Books and Surveys
• Real-Time Volume Graphics.
– Engel et al., 2006
• High-Performance Visualization.
– Bethel et al. 2012
• Parallel Visualization.
– Wittenbrink 1998, Bartz et al. 2000, Zhang et al. 2005
• Real Time Interactive Massive Model Visualization.
– Kasik et al. 2006
• Vis and Visual Analysis of Multifaceted Scientific Data.
– Kehrer and Hauser 2013
• Compressed GPU-Based Volume Rendering.
– Balsa et al. 2013
• A Survey of GPU-Based Large-Scale Volume Visualization.
– Beyer et al. 2014
168

Acknowledgments
• Work done within the DIVA Project
– Data Intensive Visualization and Analysis
– EU ITN FP7/2007-2013/ REA grant agreement
290227
169

www.crs4.it/vic/
STAR CGF’14 original authors
CRS4 Visual Computing Group
– Marcos Balsa Rodriguez
– Enrico Gobbetti
– Fabio Marton
University of Zurich VMML
– Renato Pajarola
– Susanne K. Suter
University of Zaragoza
– José Antonio Iglesias Guitián

www.crs4.it/vic/
• Acknowledgments:
• EU FP7 People Programme (Marie
Curie Actions) REA Grant
Agreement
• n°290227 (DIVA)
• n°290227 (GOLEM)
• Swiss National Science Foundation
(SNSF)
• n°200021_132521
171
•Contacts:
•gobbetti@crs4.it
•marton@crs4.it
•jdiaz@crs4.it
•http://www.crs4.it/vic

A Survey of Compressed GPU-based Direct Volume Rendering

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to A Survey of Compressed GPU-based Direct Volume Rendering

Similar to A Survey of Compressed GPU-based Direct Volume Rendering (20)

More from CRS4 Research Center in Sardinia

More from CRS4 Research Center in Sardinia (20)

Recently uploaded

Recently uploaded (20)

A Survey of Compressed GPU-based Direct Volume Rendering