PhD defense talk (portfolio of my expertise)

GPU Data Structures
for Graphics and Vision
Promotionskolloquium, May 6th 2011
Gernot Ziegler
Dept. of Computer Graphics
(3D Video and Vision-Based Graphics Group)

Outline
Graphics Hardware:
Original Purpose and Recent Development
Classical Usage in Visual Computing
 Free Viewpoint Video Compression
 Color and Depth Reprojection
 Hierarchical Image Processing
General Data Processing
 Data Compaction with the HistoPyramid
 Quadtree and Octree Generation
 Data Expansion with the HistoPyramid
Conclusion

Graphics Hardware: Original Purpose
 Graphics hardware accelerates typical data operations of
computer graphics (pixel moves, triangle rasterization)
 GPU is simpler in design than CPU, but massively parallel.

Graphics Hardware: Capabilities
 ~2003: Graphics Hardware becomes programmable:
GPU (Graphics Processing Unit)

 Data can now be anything (floating point & integer)
 General Purpose Computing on GPU = GPGPU
"Classical Usage" in Visual Computing (still graphics-related)
– Computer Vision
– Video processing
– Volume analysis
– PDE / ODE solver
– Spatial Data Structure Generation
– Database Ops
– Etc…
Game of Life
(Early GPGPU by S. Green)

Classical Usage in Visual Computing

Free Viewpoint Video Compression
(Chapter 3)
 Map video footage into texture domain via proxy 3D model

Free Viewpoint Video Compression
(Chapter 3)
 Obtain texture surface masking via shadow mapping

Free Viewpoint Video Compression: Publications
 G. Ziegler, H. Lensch, N. Ahmed, M. Magnor, H.-P. Seidel.
Multi-Video Compression in Texture Space.
11th IEEE Intl Conference on Image Processing (ICIP 2004),
Singapore, pp. 2467-2470, 2004.
 G. Ziegler, H. Lensch, M. Magnor, H.-P. Seidel. Multi-Video
Compression in Texture Space using 4D SPIHT.
6th IEEE Workshop on Multimedia Signal Processing, Siena,
Italy, pp. 39-42, 2004.

Color and Depth Reprojection
(Chapter 4)
 Depth-map "Projection" via proxy mesh & vertex shader
Novel View reconstruction
from partial depth camera views

Color and Depth Reprojection
(Chapter 4)
Blending by View Angle Our per-pixel approach (Purple: Blended areas)

Hierarchical Image Processing: Stereo reconstruction
(Chapter 5.1)
 Projective texturing in plane-sweep (GPU feedback, coarse-to-fine)

Hierarchical Image Processing: Reduction
(Chapter 5.2)
 Mipmap-like reduction: Dominant feature region, noise reduction

Hierarchical Image Processing: Reduction
(Thesis Chapter 5.3)
 Histogram of local gradients guides lens warp compensation

 GPU has massive computation and memory throughput

Graphics Hardware: Limitations
 GPU is connected with CPU via narrow databus
(bandwidth bottleneck, approx. 4 GB/s)
 GPU is a Stream processor:
– 10K thread workload necessary to keep 100s cores busy
(data parallelization!)
– Thread switching lightweight, but synchronization expensive!
– Each thread can only write at a fixed position
 Algorithms must be redesigned for GPU!

Data Compaction
(Chapter 6)

Data-Parallel Algorithm Challenges
 Example from Computer Vision: List of all black pixels in an image
 Step 1: Detect black pixels:
 Step 2: Create a list of detected pixels

Previous approach to feature list generation
 Step 2 (List generation) was not possible on GPU!
 2a: GPU marks local features (e.g. thresholding, filtering)
 2b: CPU searches image and generates feature list
 But: Bus transfers expensive:
GPU useful only for complex feature isolation.
(e.g. large filter convolution & thresholding)

Our approach: Feature list generation on GPU
 We generate feature lists on the GPU using data compaction.
 Pixel/Voxels/Feature input is abstracted "data element stream”.
 Compaction keeps only elements deemed relevant for output.
1D example (keep all elements that are blue):
 Data flow:
Massive speedup due to strongly reduced bus dataflow!
1 1 0 1A B C D E F 1 1 1A B D E

Data Compaction: Problem task in 1D
 Keep number of elements from input, based on a Classifier:
 Implementation is trivial on CPU, single-thread.
 On GPU: Need to parallelize into 10k threads!
 First count number of output elements
using data-parallel reduction!

Data Compaction via HistoPyramid:Buildup
 First, count number of output elements,
e.g. 4:1 data-parallel reduction
 (Note the reduction pyramid, it is retained - HistoPyramid)
 Can now allocate compact output, no spill.
 But how are output elements generated?
Histogram pyramid /
HistoPyramid

Data Compaction via HistoPyramid: Traversal
 Output generate: Start one thread per output element
 Each output thread traverses reduction pyramid (read-only)
 No read/write hazards = Data-parallel output writing!
 As many threads as output elements

HistoPyramid: 2D Data Compaction
 1D was tutorial, actual implementation is 2D !
 Dataflow diagram:

GPU Data Compaction:Publications
 Data Compaction fast enough for real-time volume analysis
 First application: Mesh-to-volume-to-point cloud in real-time!
 G. Ziegler, A. Tevs, C. Theobalt, H.-P. Seidel
On-the-fly Point Clouds through Histogram Pyramids
11th International Fall Workshop on Vision, Modeling and
Visualization 2006 (VMV2006), 2006, pp. 137-144.

GPU Data Compaction:Publications
 Vector Field Contours: View-dependent vectorfield analysis to
visualize contour lines throughout the volume
 Data Compaction delivers seedpoints for contour lines in ms!
 T. Annen, H. Theisel, C. Rössl, G. Ziegler, H.-P. Seidel
Vector Field Contours
Graphics Interface 2008, Windsor/Canada, 2008, pp. 97-105

Quadtree and Octree Generation
(Chapter 8 and 9)

GPU Quadtrees: Introduction
 2D Reduction follows a quadtree-like reduction pattern.
 By tracking feature similarity in reduction,
quadtrees can be created from the reduction pyramid!

GPU QuadTree: Publications
 Speed (ms) enables real-time quadtree processing from video!
e.g. for Compression, Vision,..
 G. Ziegler, R. Dimitrov, C. Theobalt, H-P. Seidel.
Real-time Quadtree Analysis using HistoPyramids.
SPIE Electronic Imaging conference, San Jose/USA, 2007.

GPU Octree
(Chapter 9)
 Feature Clustering extended to 3D volumes
 Octrees from Volume Data
 New algorithm, pointer octrees
(e.g. for spatial data structures)
 Real-time creation of
high-resolution octrees
from meshes possible!

Data Expansion
(Chapter 7)

Data Expansion via HistoPyramid:
Problem task
 We have a predicate function that determines
how many output copies to create from each input element:
 Implementation is trivial on CPU
 GPU: Input can be divided amongst threads, but:
Where shall each thread write its output?
 Insight:
HistoPyramid traversal works even here!

HP Buildup
 First, count number of output elements, e.g. via 4:1 reduction:

HP Traversal (single output copy)
 Traversal for single output elements:
 Exactly like data compaction, but: Mind local key index kL

HP Traversal (multiple output copies)
 Traversal for multiple output elements:
 kL determines number of copy. Still: one thread for each copy!

HP Traversal (multiple output copies)
 Traversal for multiple output elements:
 kL determines number of copy.
 Observation: Thread can modify input before write-out!
 Thus: Output can be modified version of input based on kL.
 e.g. Geometry Creation:
 (Generic algorithm…)

Data Expansion: Eikonal Rendering (Publication I)
 Compute light transport through volume objects of varying refraction
 Both real-time rendering and precomputed lighting simulation
 Lighting simulation requires adaptive light wavefront simulation
 I. Ihrke, G. Ziegler, A. Tevs, C. Theobalt, M. Magnor, H.-P. Seidel
Eikonal Rendering: Efficient Light Transport in Refractive Objects
ACM Transactions on Graphics 26 (3): 59-1 - 59-8, 2007
http://www.mpi-inf.mpg.de/resources/EikonalRendering/

Eikonal Rendering: Lighting Simulation
 For given light-object position, precompute lighting inside
the volumetric object for real-time novel view rendering.

 Lighting simulation implements numerical ODE solver on GPU.
 Subdivide light's wavefront into a set of patches
 Patch corners move as GPU particle system
– Each particle follows ray optics
 During update, some patches:
– weaken too much (discard)
– leave volume (discard)
– grow too large (tesselate)
 Since patch list is on GPU:
– Discard: Data Compaction
– Tesselate: Data Expansion
Eikonal Rendering: Wavefront Propagation

Data Expansion: Marching Cubes (Publication II)
 Marching Cubes algorithm extracts iso-surfaces from volumes
 Reformulate: Stream of voxels ...
– is first compacted to the relevant iso-surface voxels
– then expanded, becoming a stream of triangle vertices
 C. Dyken, G. Ziegler, C. Theobalt, and H.-P. Seidel
High-speed Marching Cubes using HistoPyramids
Computer Graphics Forum 27 (8): 2028-2039, 2008
http://www.sintef.no/hpmc

Performance of OpenGL approach (2007):
 Geometry shader (GS), e.g. NVIDIA GeForce 8, enabled
hardware data compaction & expansion for geometry -
should obsolete HistoPyramids, but HP-MC outperforms
geometry shader (HP-GS)!
HP-MC was 2007 fastest known MC algorithm.
(frames per second)

Conclusion and Outlook
(Chapter 10)

Conclusion and Outlook
 GPUs increasingly useful in general data processing
 Programming Model Restrictions not always bad
– Force programmer to change thought model
– E.g.: Fixed Output Location created HistoPyramid traversal concept
– Can be more efficient, even on more capable hardware!
(atomic counters, geometry shaders have less performance)
 Data-Parallel Algorithm Design is hard
– But once done, parallelizable over any number of available cores
(if sufficient data available)
– Hard to imagine that auto-parallelization can achieve this
 Future work
– Connected components, distance transforms, SATs…
– Accelerate further using CUDA C and OpenCL

Other work based on presented algorithms
Quadtree
 C. N. Vasconcelos, A. Sá, P. C. Carvalho, M. Gattass.
QuadN4tree: A GPU-Friendly Quadtree Leaves Neighborhood Structure.
Proc. of Computer Graphics International Conference (CGI) 2008.
 C. N. Vasconcelos, A. Sá, P. C. Carvalho, M. Gattass.
Using Quadtrees for Energy Minimization Via Graph Cuts.
Proc. of VMV - 12th Vision, Modeling, and Visualization Workshop, pp. 71-80.
Data Expansion
 C. Dyken, M. Reimers, J. Seland.
Real-time GPU Silhouette Refinement using adaptively blended Bézier patches.
Computer Graphics Forum, Volume 27, number 1, pp. 1-12, 2007.
Data Compaction (Implementation)
 J. Fung, S. Mann.
OpenVIDIA: parallel GPU computer vision.
Proc. of 13th annual ACM international conference on Multimedia, pp. 849 - 852.
http://openvidia.sf.net

San Jose (CA) | September 23rd, 2010
Christopher Dyken, SINTEF Norway
Gernot Ziegler, NVIDIA UK
GPU-accelerated data expansion
for the Marching Cubes algorithm

HistoPyramid performance
 Accelerated HistoPyramids using CUDA C
 HistoPyramid BuildUp
— Reduce 5-to-1, but store only first four sums!
— Build several levels via on-GPU shared memory
(less video memory transactions)
 Marching Cubes specific
— Share scalar input data amongst neighbouring MC cells
(through shared memory)

Backpack (iso=0.4) (www.volvis.org)
Size 512x512x373 (187 mb)
Triangles 3 745 320 (0.039 tris/cell)
OpenGL HP4MC 13 fps (1291 mvps)
CUDA-OpenGL HP5MC 43 fps (4129 mvps)
Speedup
3.2x
Head aneuyrism (iso=0.4) (www.volvis.org)
Size 512x512x512 (256 mb)
Triangles 583 610 (0.004 tris/cell)
Speedup
5.1x
Christmas tree (iso=0.05) (TU Wien)
Size 512x499x512 (250 mb)
Triangles 5 629 532 (0.043 tris/cell)
Speedup
2.7x
5123-ish 16-bit performance

PhD defense talk (portfolio of my expertise)

More Related Content

What's hot

Similar to PhD defense talk (portfolio of my expertise)

Recently uploaded

PhD defense talk (portfolio of my expertise)