2. AGENDA
Overview of Raytracing & KD Trees
Review of GCN Architecture
Mapping Raytracing to GPUs
Optimizing Raytracing using CodeXL
2 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
4. ACCELERATION STRUCTURES TRADE OFFS
Construction
Speed
Uniform Grid
Bounding
Volume
Hierarchies
KD Tree
Tracing Speed
4 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
5. HIERARCHICAL KD TREE – 2D
F
A
A
B
C
B
D
E
F
E
G
C
D
5 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
G
6. KD TREE – 3D
6 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
7. STACK BASED TRAVERSAL KD TREE – 2D
tMin
F
A
A
B
C
B
D
E
F
E
G
t2
G
C
t1
D
7 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
tMax
8. TRAVERSING KD TREES – PSEUDO CODE
stack.push(KDroot,sceneMin,sceneMax)
tHit=infinity
while !(stack.empty()):
(node,tStart,tEnd)=stack.pop()
while !(node.isLeaf()):
tSplit = ( node.value - ray.origin[node.axis] ) / ray.direction[node.axis]
(near, far) = findNear(ray.origin[node.axis], node.left, node.right)
if( tSplit >= tEnd or tSplit < 0)
node=near
else if( tSplit <= tStart)
node=second
else
stack.push( far, tSplit, tEnd)
node=near
tEnd=tSplit
for prim in node.primitives():
tHit=min(tHit,prim.Intersect(ray))
if tHit<tEnd:
return tHit
return tHit
8 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
10. First introduced with the “Southern Island” family of GPUs.
Is available with the upcoming “Kaveri” APU.
Scalar architecture.
ECC support. (with some models).
Double precision support.
Multiple concurrent queues for compute.
10 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
11. GPU SCALAR ARCHITECTURE VS CPU SSE EXTENSIONS
float x;
X = x+1;
Scalar code does not utilize the SSE capabilities of the CPU.
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Thread 8
Thread 9
Thread 10
Thread 11
Thread 12
Thread 13
Thread 14
Thread 15
Thread 16
11 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
12. HOW SCALAR CODE IS EXECUTED
float x;
X = x+1;
GCN
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16
12 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
13. IMPLICATIONS FOR RAY TRACING
Ray Packetization – having a single thread trace several rays in one KD tree
traverse to achieve better utilization of the SIMD and cache.
No explicit ray packetization is required on GCN.
The HW is implicitly packetizing every 64 threads. All 64 threads of a Wavefront
execute the same instruction together.
13 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
14. A SEQUENCER FOR EVERY COMPUTE UNIT
SQ
SQ
SQ
SQ
Compute
Unit
Compute
Unit
Compute
Unit
Compute
Unit
A sequencer is a HW block responsible for issuing program instructions.
A compute unit can run up to 40 Wavefronts each with a distinct program
counter.
GPU under-utilization due to long traversing rays may happen only on the
Wavefront level.
14 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
15. HOW MUCH ON CHIP MEMORY DO WE HAVE?
HD 7970 – “Tahiti”
256 KB VGPR per CU X 32 = 8.192 MB
8 KB SGPR per CU X 32 = 0.256 MB
16 KB L1 V-Data cache per CU X 32 = 0.512 MB
16 KB L1 S-Data cache per 4 CUs X 8 = 0.128 MB
32 KB instruction cache per 4 CUs X 8 = 0.256 MB
L2 Data Cache = 768 KB
LDS 64KB per CU X32 = 2.048 MB
Total : 12.16 MB
15 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
16. AMD CODE XL
Coherent, innovative and unified developer tools suite
‒ Debug, Profile, and Analyze applications
‒ Support OpenCL™ and OpenGL.
‒ AMD CPUs, GPUs and APUs
‒ Standalone and integrated into Microsoft® Visual Studio®
‒ Supported on Windows® and Linux®
‒ Does not require source code modifications
16 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
17. BE SURE YOUR KERNEL SIZE DOES NOT EXCEED
INSTRUCTION CACHE SIZE
17 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
19. HOW CAN A GPU TRAVERSE A TREE?
Node
Node
Node
Node
Node
Node
Node
Nest all the nodes on a buffer, wrap the buffer with CL mem object.
When using HSA we can leverage the unified memory architecture and
access the tree as-is.
19 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
20. HOW MUCH MEMORY DO WE NEED FOR THE STACK?
Per Wave front = Maximal Depth Of the Tree X size of frame X 64 .
25 X 12 X 64 = ~19 KB
Leads to GPR spilling to local memory or low scheduling
utilization.
GPR spilling is decided upon by the OCL compiler on compile
time.
GPRs spilled to local memory are also known as Scratch
Registers.
20 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
21. HOW TO DETECT SCRATCH REGISTERS USING CODEXL
21 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
22. STACKLESS TRACE – RESTART TRAVERSAL
tmin
F
A
A
B
D
C
E
F
t1
E
G
B
C
t2
t1 t2
t3
t1 tMax
t2 t3
t3 tMax
D
22 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
G
tMax
23. KD RESTART ALGORITHM
tStart=tEnd=sceneMin
timeHit=infinity
while (tEnd<sceneMax):
node=root
tStart=tEnd
tEnd=sceneMax
while (not node.isLeaf()):
axis = node.axis
tSplit = ( node.PlanePos - ray.origin[axis] ) / ray.direction[axis]
(near, far) = findNear(ray.origin[axis], node.left, node.right)
if( tSplit >= tEnd or tSplit <= 0)
node=near
else if( tSplit <= tStart)
node=far
else
node=near
tEnd=tSplit
for prim in node.primitives():
timeHit=min(tHit,prim.Intersect(ray))
if timeHit<tEnd:
return tHit
return tHit
23 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
24. EFFECT ON GPR SPILLAGE
24 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
27. CAN THIS BE FURTHER REFINED?
What on chip memory aren’t we using ?
LDS = Local Data Store.
Short Stack Algorithm – initialize a stack smaller than the
maximum depth of the tree. If we overflow, fall back to KDRestart algorithm.
If we place the short stack in the LDS, what should be
the depth of the “short stack”?
27 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
28. HOW MANY WAVEFRONTS ARE EXECUTED
CONCURRENTLY
Use CodeXL application trace to discover how many Wavefronts are executed
concurrently with stackless traversal
28 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
29. OCCUPANCY GRAPHS
29 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
30. WHAT SHOULD BE THE SIZE OF THE SHORT STACK?
64 KB / 12 wavefronts / 64 threads / sizeof
(Frame) = 7
30 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013