Vertex Culling illustration at SBR07


Published on

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Vertex Culling illustration at SBR07

  1. 1. Faster ray packets through vertex culling SBR07
  2. 2. Faster Ray Packets - Triangle Intersection through Vertex Culling Faster ray packets through Alexander Reshetov Intel Corporation ABSTRACT early if either one of the five conditions is false. For implementations on GPU [14] or Cell [3], branchless algorithms are more efficient. Acceleration structures are used in ray tracing to sharply reduce number For most scenes, the great majority of ray- triangle intersection tests can be of ray-triangle intersection tests at the expense of traversing such eliminated by creating special acceleration structures such as kd-trees, structures. Bigger structures eliminate more tests, but their traversal grids, Bounding Volume Hierarchies (BVH), which exploit spatial becomes less efficient, especially for ray packets, for which number of coherency of a scene ([2], [7], [22], [23]). During rendering, the vertex culling inactive rays increases at the lower levels of the acceleration structures. acceleration structure is traversed and all triangles in visited leaf nodes are For dynamic scenes, building or updating acceleration structures is one tested for intersections. For structures which utilize spatial subdivision of the major performance impediments. (kd-trees, grids), it is possible to create an instance of this type for which We propose a new way to reduce the total number of tests by creating a total number of tests will be just slightly more than the number of ray- special transient frustum every time a leaf is traversed by a packet of rays. segments (one test per ray), but this may not result in the best This frustum contains intersections of active rays with a leaf node and performance. The optimal size of the acceleration structure is dependent eliminates over 90% of all potential tests. It allows a tenfold reduction in on how fast it could be traversed compared with the average speed of the size of acceleration structure whilst still achieving a better performance. used ray-triangle intersection test. For hierarchical acceleration structures, !quot;#$%&'()!!quot;#$%&'&(quot;)'*++!quot;#$*,-(&./+0'.++&0'!()1/+0'. 2 &0(')1*,++()&,03,!&(quot;)! + + the higher levels of the hierarchy are usually the most effective in reducing the number of potential tests. Going down in the spatial hierarchy, nodes 1 INTRODUCTION are becoming smaller and smaller and some of the rays in the packet may Finding the intersection of a ray and a triangle is equivalent to solving miss them. This negatively affects the utilization of SIMD units [15]. the linear system of three equations In recent years, focus of ray tracing research has shifted to dynamic !quot;#quot;$quot;%quot;4quot;&'quot;#quot;(quot;)&*quot;!quot;#'+quot;#quot;,quot;)&-quot;!quot;#'+ (1) scenes ([8], [12], [17], [19], [22], [23]), for which it is necessary to optimize for the total execution time (build/update time plus rendering with five additional requirements time). Frequently, to improve the build time, only axis-aligned bounding $quot;quot;%quot;quot;&quot;quot;%quot;quot;&quot;!.%quot; (2) boxes (AABB) of triangles are used even for kd-trees ([8], [17]). For $quot;quot;%quot;quot;'(quot;quot;quot;$quot;quot;%quot;quot;)(quot;quot;quot;'*)quot;quot;%quot;quot;+ Alexander Reshetov (3) this reason, it is important to analyze performance or ray-triangle The left part of the system (1) defines a ray with the origin !quot; and the intersection tests in the situation when intersection could not be direction %; the right part ! a point inside the triangle with vertices &'/quot;&*/quot; eliminated by testing a ray against the AABB of a triangle. This andquot;&-. The unknown variables are: $ ! distance to the intersection point approach was chosen by Kensler and Shirley [10], in which different 3D quot;#$%& '()& #*+,-& $#./.0, and (/quot; , ! Barycentric coordinates of the point versions of intersection algorithms were analyzed and the best one was inside the triangle. It is required that the found intersection point be found via genetic optimization of the fitness function. closer '$& '()& #*+,-&origin than the previously found one (2) and within Even when a ray does intersect the AABB, the probability of it '()&'#.*0/1),-&2$304*#.)-&(3). intersecting the triangle is only about 20%. It is important to swiftly One way to solve the system (1) is to use 5#*%)#,-&rule, which allows reject the remaining 80%. Amongst the approaches analyzed in the finding numerical values directly for all three variables. This will result literature [4] are: use of interval arithmetic; testing the intersection of a in the implementations similar to Möller-Trumbore [13]. On the other frustum containing rays (typically represented as 4 corner rays) with a hand, the system (1) also could be solved using Gaussian elimination. It triangle [5]; and culling of the AABB of the individual triangle against is computationally efficient only if the distance $ is calculated first the frustum. In these approaches per-packet data structures are computed (which, of course, could also be computed directly by using the well- before the traversal and then used to eliminate unnecessary tests. The known expression for the distance to a plane along a ray). By culling techniques could also be used for the individual rays as well, as substituting the found value of $ in any two of the remaining equations, first proposed by Snyder and Barr in 1987 paper [18], in which a ray was ( and , variables could be found. This substitution geometrically first tested against the box formed by the intersection of the $2:);',-& corresponds to projecting the triangle and the intersection point to the bounding box and a visited cell. IEEE/EG Symposium plane defined by the coordinates of the two chosen equations and leads Few years ago, ray-tracing researchers were mostly interested in to algorithms comparable with one given by Badouel [1]. The S.S.E. rendering static scenes for which persistent data structures were created, implementation for groups of four rays, proposed by Wald [21], is based optimized for all possible camera positions. Eventually, focus was shifted on the equivalent 2D projection. Example of the S.S.E. based 3D to dynamic scenes, for which data structures are created (or updated) approach (formulated in terms of Plücker coordinates) could be found in every frame. There are also initial studies in how to create kd-trees lazily, 6)0'(.0,-& 7(898 [2]. For vector implementations, conditions (2-3) are deeply subdividing only those areas of space that are visited during usually converted to masks used to selectively update the $, (, and , rendering of a particular frame [9]. In a sense, we pursue this trend to the values. It makes sense to use packets bigger than the intrinsic SIMD extreme, when transient data structures are created for every packet and on width of the targeted architecture as it amortizes per-triangle every visited leaf node. This allows very tight structures, optimized for a computations and saves bandwidth. It also allows to make decisions given packet and a cell. Surprisingly, these fleeting structures could be summarily for the whole packet without processing individual rays, for created and handled very efficiently, building on top of the clipping example using frustum or interval arithmetic ([5], [7], [11], [22], [23]). algorithms, characteristic for the modern packet traversal techniques. There is no one universal way of solving the system (1) which will be suitable for all situations and architectures. In particular, CPU implementations could use conditions (2-3) for exiting from the test quot;*+,-.) ,.quot;/,0'quot;&1&quot;(2quot;3%45-03quot;.16%+7 Interactive Ray Tracing '!!,$&,5+&quot;+6777879+:.#$quot;3(%#+quot;)+6)&,0'!&(;,+<'.+=0'!()1+>??@+ 8-9:&quot;7 ;D+=A,+ &0')3(,)&+ H0%3&%#+ (3+!0,'&,5+,;,0.+ &(#,+ '+0'.+ $'!I,&+ ;(3(&3+ '+ A&&$B88CCCD%)(2%*#D5,80&?@8<=?@DA&#*+ *,'H+ )quot;5,+ J3quot;*(585'3A,5+ *(),3+ !quot;00,3$quot;)5+ &quot;+ '!&(;,8()'!&(;,+ 0'.3KD+ L)*.+ E*#/+9,0#')./+:,$&,#F,0+G?2G>/+>??@7 '!&(;,+ 0'.3+ '0,+ %3,5+ &quot;+ !quot;#$%&,+ &A,+ H0%3&%#D+ M,H&B+ 1,),0'*+ 0'.3D+ <(1A&B+ $0(#'0.+0'.3+J!quot;##quot;)+quot;0(1()KD+ 2007 SBR07
  3. 3. Demo SBR07
  4. 4. Key contribution SBR07
  5. 5. 1/10 90 % Spatial Data Culling Rate Struture Size SBR07
  6. 6. Ray-triangle intersection cost SBR07
  7. 7. 1 Cycle (on average) SBR07
  8. 8. It’s extremely fast! SBR07
  9. 9. Algorithm SBR07
  10. 10. What VC do? (VC = Vertex Culling) SBR07
  11. 11. Every time a packet enters a leaf node, do following SBR07
  12. 12. Create packet Precompute coeff for each triangles triangle-packet Cull if failed triangle-ray intersect SBR07
  13. 13. Create packet Precompute coeff for each triangles triangle-packet Cull if failed triangle-ray intersect SBR07
  14. 14. Create packet from active rays Active ray Inactive ray SBR07
  15. 15. Create packet Precompute coeff for each triangles triangle-packet Cull if failed triangle-ray intersect SBR07
  16. 16. Think about triangle-packet test SBR07
  17. 17. Axis aligned SBR07
  18. 18. Frustum plane SBR07
  19. 19. v o (v - o) . n < 0 n = outside SBR07
  20. 20. v any of dot(v, n) < 0 = outside of the frustum SBR07
  21. 21. all of dot(v, n) >= 0 = inside of the frustum v SBR07
  22. 22. v0 all of (v .o) < 0 v1 = triangle is outside of the plane v2 o n SBR07
  23. 23. v0 Finally, we got v1 any of (v . n) > 0 = frustum and triangle (possibly) intersects v2 SBR07
  24. 24. 1 vertex, 4 plane inside/ outside test computation can be reduced to d = [Vz, -Vy, -Vz, Vy] + [Vx, Vx, Vx, Vxx] q1 + q0 SBR07
  25. 25. d = [Vz, -Vy, -Vz, Vy] + [Vx, Vx, Vx, Vxx] q1 + q0 q1, q0 = SIMD variable. Pre-computable when packet enters leaf node. SBR07
  26. 26. Culling cost per vertex. SBR07
  27. 27. 1 SIMD mul + 2 SIMD add SBR07
  28. 28. Very efficient SBR07
  29. 29. This is the core of VC SBR07
  30. 30. Create packet Precompute coeff for each triangles triangle-packet Cull if failed triangle-ray intersect SBR07
  31. 31. Just apply efficient vertex-frustum culling for each triangle vertex SBR07
  32. 32. Culling cost per triangle per packet = 3(vtx) * 3(mul,add) + compare SBR07
  33. 33. Create packet Precompute coeff for each triangles triangle-packet Cull if failed triangle-ray intersect SBR07
  34. 34. SBR07
  35. 35. First do corner rays and triangle test SBR07
  36. 36. Such a case can not be culled by VC. SBR07
  37. 37. SBR07
  38. 38. all(u) < 0 or all(v) < 0 or v u all(u+v) > 1 = All rays in the packet does not intersect triangle.SBR07
  39. 39. If all failed, do ray-triangle test for each rays in the packet. SBR07
  40. 40. Thats all! SBR07
  41. 41. Benefit of VC SBR07
  42. 42. 1/10 90 % Spatial Data Culling Rate Struture Size SBR07
  43. 43. less traversal much traversal much VCs less VCs SBR07
  44. 44. Right has 12x more nodes than left, but right is just 30% faster than left. SBR07
  45. 45. spatial data structure Coarse Fine Sequential access Random access more isects less isects less traversals more traversals less memory much memory BVH BIH kd-tree SBR07
  46. 46. spatial data structure Coarse Fine Sequential access Random access more isects less isects less traversals more traversals less memory much memory VC SBR07
  47. 47. The strongest acceleration structure on the earth? SBR07
  48. 48. Implementation SBR07
  49. 49. Here is my implementation of VC + BVH SBR07
  50. 50. C++, for VC C, for gcc My impl. Original SBR07
  51. 51. Performance & Profiling SBR07
  52. 52. 512x512 10k tris SAH-BVH SBR07
  53. 53. Compile flag -O3 -ffast-math -mfpmath=sse -msse2 SBR07
  54. 54. 4.7M rays/sec @ 2.16 GHz Core2 Mac SBR07
  55. 55. = 460 Cycles/ray Hmm... SBR07
  56. 56. Statistics nPackets : 8192 nTraversedLeafs(per packet) : 43066 (5.257080) nTraversedInnerNodes(per packet) : 109922 (13.418213) nPacketBBoxTests(per packet) : 228036 (27.836426) nBVHAnyHits : 131968 (57.871564 %) nBVHAllMisses : 61462 (26.952762 %) nBVHLastResorts : 34606 (15.175674 %) nBVHLastResortRayBBoxTests : 322288 (per last resort tests) : 9.313067 nTestedTriangles(per packet) : 314650 (38.409424) nCulledByBeams(rate) : 223618 (71.068807 %) nCulledByUV(rate) : 18536 (5.890990 %) nActuallyTestedTrianglesPerRay : 2.060669 SBR07
  57. 57. Core2 Mac SBR07
  58. 58. BVH VC SBR07
  59. 59. Redundant calcs... SBR07
  60. 60. Non SIMDized codes... SBR07
  61. 61. Yeah, I have to optimize BVH code before optimizing VC :-) SBR07
  62. 62. Implementation period SBR07
  63. 63. Vertex packet Culling BVH 6 Days 2 Days SBR07
  64. 64. Very easy to implement! MLRTA requires a half year in my case. SBR07
  65. 65. TODO SBR07
  66. 66. Much more optimization. SBR07
  67. 67. Try to use VC technique for incoherent rays SBR07
  68. 68. Here is my VC + BVH implementation angelina/eleonore/ SBR07
  69. 69. ? SBR07
  70. 70. Thank you! SBR07