SBR07 Faster Ray Packets - Triangle Intersection through Vertex Culling

Faster ray packets
through
vertex culling
SBR07

Faster Ray Packets - Triangle Intersection through Vertex Culling
Faster ray packets
through
Alexander Reshetov
Intel Corporation

ABSTRACT early if either one of the five conditions is false. For implementations on
GPU [14] or Cell [3], branchless algorithms are more efficient.
Acceleration structures are used in ray tracing to sharply reduce number For most scenes, the great majority of ray- triangle intersection tests can be
of ray-triangle intersection tests at the expense of traversing such eliminated by creating special acceleration structures such as kd-trees,
structures. Bigger structures eliminate more tests, but their traversal grids, Bounding Volume Hierarchies (BVH), which exploit spatial
becomes less efficient, especially for ray packets, for which number of coherency of a scene ([2], [7], [22], [23]). During rendering, the

vertex culling
inactive rays increases at the lower levels of the acceleration structures. acceleration structure is traversed and all triangles in visited leaf nodes are
For dynamic scenes, building or updating acceleration structures is one tested for intersections. For structures which utilize spatial subdivision
of the major performance impediments. (kd-trees, grids), it is possible to create an instance of this type for which
We propose a new way to reduce the total number of tests by creating a total number of tests will be just slightly more than the number of ray-
special transient frustum every time a leaf is traversed by a packet of rays. segments (one test per ray), but this may not result in the best
This frustum contains intersections of active rays with a leaf node and performance. The optimal size of the acceleration structure is dependent
eliminates over 90% of all potential tests. It allows a tenfold reduction in on how fast it could be traversed compared with the average speed of the
size of acceleration structure whilst still achieving a better performance. used ray-triangle intersection test. For hierarchical acceleration structures,
!quot;#$%&'()!!quot;#$%&'&(quot;)'*++!quot;#$*,-(&./+0'.++&0'!()1/+0'. 2 &0(')1*,++()&,03,!&(quot;)!
+ + the higher levels of the hierarchy are usually the most effective in reducing
the number of potential tests. Going down in the spatial hierarchy, nodes
1 INTRODUCTION are becoming smaller and smaller and some of the rays in the packet may
Finding the intersection of a ray and a triangle is equivalent to solving miss them. This negatively affects the utilization of SIMD units [15].
the linear system of three equations In recent years, focus of ray tracing research has shifted to dynamic
!quot;#quot;$quot;%quot;4quot;&'quot;#quot;(quot;)&*quot;!quot;#'+quot;#quot;,quot;)&-quot;!quot;#'+ (1) scenes ([8], [12], [17], [19], [22], [23]), for which it is necessary to
optimize for the total execution time (build/update time plus rendering
with five additional requirements
time). Frequently, to improve the build time, only axis-aligned bounding
$quot;quot;%quot;quot;"quot;%quot;quot;"!.%quot; (2) boxes (AABB) of triangles are used even for kd-trees ([8], [17]). For
$quot;quot;%quot;quot;'(quot;quot;quot;$quot;quot;%quot;quot;)(quot;quot;quot;'*)quot;quot;%quot;quot;+

Alexander Reshetov
(3) this reason, it is important to analyze performance or ray-triangle
The left part of the system (1) defines a ray with the origin !quot; and the intersection tests in the situation when intersection could not be
direction %; the right part ! a point inside the triangle with vertices &'/quot;&*/quot; eliminated by testing a ray against the AABB of a triangle. This
andquot;&-. The unknown variables are: $ ! distance to the intersection point approach was chosen by Kensler and Shirley [10], in which different 3D
quot;#$%& '()& #*+,-& $#./.0, and (/quot; , ! Barycentric coordinates of the point versions of intersection algorithms were analyzed and the best one was
inside the triangle. It is required that the found intersection point be found via genetic optimization of the fitness function.
closer '$& '()& #*+,-&origin than the previously found one (2) and within Even when a ray does intersect the AABB, the probability of it
'()&'#.*0/1),-&2$304*#.)-&(3). intersecting the triangle is only about 20%. It is important to swiftly
One way to solve the system (1) is to use 5#*%)#,-&rule, which allows reject the remaining 80%. Amongst the approaches analyzed in the
finding numerical values directly for all three variables. This will result literature [4] are: use of interval arithmetic; testing the intersection of a
in the implementations similar to Möller-Trumbore [13]. On the other frustum containing rays (typically represented as 4 corner rays) with a
hand, the system (1) also could be solved using Gaussian elimination. It triangle [5]; and culling of the AABB of the individual triangle against
is computationally efficient only if the distance $ is calculated first the frustum. In these approaches per-packet data structures are computed
(which, of course, could also be computed directly by using the well- before the traversal and then used to eliminate unnecessary tests. The
known expression for the distance to a plane along a ray). By culling techniques could also be used for the individual rays as well, as
substituting the found value of $ in any two of the remaining equations, first proposed by Snyder and Barr in 1987 paper [18], in which a ray was
( and , variables could be found. This substitution geometrically first tested against the box formed by the intersection of the $2:);',-&
corresponds to projecting the triangle and the intersection point to the bounding box and a visited cell.

IEEE/EG Symposium
plane defined by the coordinates of the two chosen equations and leads Few years ago, ray-tracing researchers were mostly interested in
to algorithms comparable with one given by Badouel [1]. The S.S.E. rendering static scenes for which persistent data structures were created,
implementation for groups of four rays, proposed by Wald [21], is based optimized for all possible camera positions. Eventually, focus was shifted
on the equivalent 2D projection. Example of the S.S.E. based 3D to dynamic scenes, for which data structures are created (or updated)
approach (formulated in terms of Plücker coordinates) could be found in every frame. There are also initial studies in how to create kd-trees lazily,
6)0'(.0,-& 7(898 [2]. For vector implementations, conditions (2-3) are deeply subdividing only those areas of space that are visited during
usually converted to masks used to selectively update the $, (, and , rendering of a particular frame [9]. In a sense, we pursue this trend to the
values. It makes sense to use packets bigger than the intrinsic SIMD extreme, when transient data structures are created for every packet and

on
width of the targeted architecture as it amortizes per-triangle every visited leaf node. This allows very tight structures, optimized for a
computations and saves bandwidth. It also allows to make decisions given packet and a cell. Surprisingly, these fleeting structures could be
summarily for the whole packet without processing individual rays, for created and handled very efficiently, building on top of the clipping
example using frustum or interval arithmetic ([5], [7], [11], [22], [23]). algorithms, characteristic for the modern packet traversal techniques.
There is no one universal way of solving the system (1) which will be
suitable for all situations and architectures. In particular, CPU
implementations could use conditions (2-3) for exiting from the test

quot;*+,-.) ,.quot;/,0'quot;&1"(2quot;3%45-03quot;.16%+7

Interactive Ray Tracing
'!!,$&,5+"+6777879+:.#$quot;3(%#+quot;)+6)&,0'!&(;,+<'.+=0'!()1+>??@+ 8-9:"7 ;D+=A,+ &0')3(,)&+ H0%3&%#+ (3+!0,'&,5+,;,0.+ &(#,+ '+0'.+ $'!I,&+ ;(3(&3+ '+
A&&$B88CCCD%)(2%*#D5,80&?@8<=?@DA&#*+ *,'H+ )quot;5,+ J3quot;*(585'3A,5+ *(),3+ !quot;00,3$quot;)5+ "+ '!&(;,8()'!&(;,+ 0'.3KD+ L)*.+
E*#/+9,0#')./+:,$&,#F,0+G?2G>/+>??@7 '!&(;,+ 0'.3+ '0,+ %3,5+ "+ !quot;#$%&,+ &A,+ H0%3&%#D+ M,H&B+ 1,),0'*+ 0'.3D+ <(1A&B+
$0(#'0.+0'.3+J!quot;##quot;)+quot;0(1()KD+

2007
SBR07

Key contribution

SBR07

1/10 90 %
Spatial Data
Culling Rate
Struture Size
SBR07

Ray-triangle
intersection cost

SBR07

1 Cycle
(on average)

SBR07

It’s
extremely
fast!
SBR07

What VC do?
(VC = Vertex Culling)

SBR07

Every time
a packet enters
a leaf node,
do following
SBR07

Create packet

Precompute coeff

for each triangles
triangle-packet Cull
if failed

triangle-ray intersect
SBR07

Create packet
from active rays

Active ray

Inactive ray SBR07

Think about

triangle-packet
test

SBR07

Frustum plane

SBR07

v
o
(v - o) . n < 0
n = outside

SBR07

v any of dot(v, n) < 0
= outside of the frustum

SBR07

all of dot(v, n) >= 0
= inside of the frustum

v

SBR07

v0
all of (v .o) < 0
v1
= triangle is outside
of the plane
v2
o

n SBR07

v0
Finally, we got

v1 any of (v . n) > 0
= frustum and triangle
(possibly) intersects
v2

SBR07

1 vertex, 4 plane inside/
outside test computation
can be reduced to

d = [Vz, -Vy, -Vz, Vy]
+ [Vx, Vx, Vx, Vxx] q1 + q0

SBR07

d = [Vz, -Vy, -Vz, Vy]
+ [Vx, Vx, Vx, Vxx] q1 + q0

q1, q0 =
SIMD variable.
Pre-computable
when packet enters leaf node.
SBR07

Culling cost
per vertex.

SBR07

1 SIMD mul
+
2 SIMD add

SBR07

Very efﬁcient

SBR07

This is the core of
VC

SBR07

Just apply efﬁcient
vertex-frustum
culling for each
triangle vertex
SBR07

Culling cost per triangle
per packet =
3(vtx) *
3(mul,add) +
compare
SBR07

First do
corner rays and
triangle
test
SBR07

Such a case can not
be culled by VC.

SBR07

all(u) < 0 or
all(v) < 0 or
v u
all(u+v) > 1
=
All rays in the packet
does not intersect
triangle.SBR07

If all failed,
do
ray-triangle test
for each rays in
the packet. SBR07

Beneﬁt of VC

SBR07

less traversal much traversal
much VCs less VCs
SBR07

Right has 12x more
nodes than left,
but right is just 30%
faster than left. SBR07

spatial data structure
Coarse Fine
Sequential access Random access
more isects less isects
less traversals more traversals
less memory much memory
BVH
BIH
kd-tree
SBR07

spatial data structure
Coarse Fine
Sequential access Random access
more isects less isects
less traversals more traversals
less memory much memory

VC SBR07

The strongest
acceleration
structure
on the earth?
SBR07

Implementation

SBR07

Here is my
implementation of
VC + BVH
SBR07

C++, for VC C, for gcc

My impl.

Original SBR07

Performance
&
Proﬁling
SBR07

512x512
10k tris
SAH-BVH SBR07

Compile ﬂag
-O3
-ffast-math
-mfpmath=sse
-msse2 SBR07

4.7M rays/sec

@ 2.16 GHz
Core2 Mac
SBR07

= 460 Cycles/ray

Hmm...
SBR07

Statistics
nPackets : 8192
nTraversedLeafs(per packet) : 43066 (5.257080)
nTraversedInnerNodes(per packet) : 109922 (13.418213)
nPacketBBoxTests(per packet) : 228036 (27.836426)
nBVHAnyHits : 131968 (57.871564 %)
nBVHAllMisses : 61462 (26.952762 %)
nBVHLastResorts : 34606 (15.175674 %)
nBVHLastResortRayBBoxTests : 322288
(per last resort tests) : 9.313067
nTestedTriangles(per packet) : 314650 (38.409424)
nCulledByBeams(rate) : 223618 (71.068807 %)
nCulledByUV(rate) : 18536 (5.890990 %)
nActuallyTestedTrianglesPerRay : 2.060669

SBR07

Redundant calcs...

SBR07

Non SIMDized
codes...
SBR07

Yeah, I have to
optimize BVH
code before
optimizing VC :-)
SBR07

Implementation
period

SBR07

Vertex packet
Culling BVH

6 Days 2 Days
SBR07

Very easy to implement!

MLRTA requires a half
year in my case.

SBR07

Much more
optimization.

SBR07

Try to use VC
technique for
incoherent rays
SBR07

Here is my
VC + BVH
implementation

http://lucille.svn.sourceforge.net/svnroot/lucille/
angelina/eleonore/

SBR07

SBR07 Faster Ray Packets - Triangle Intersection through Vertex Culling

Recommended

Recommended

More Related Content

Similar to SBR07 Faster Ray Packets - Triangle Intersection through Vertex Culling

Similar to SBR07 Faster Ray Packets - Triangle Intersection through Vertex Culling (20)

More from Syoyo Fujita

More from Syoyo Fujita (7)

Recently uploaded

Recently uploaded (20)

SBR07 Faster Ray Packets - Triangle Intersection through Vertex Culling