PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi Cohen

OPTIMIZING RAYTRACING
ON GCN WITH AMD
DEVELOPMENT TOOLS
TZACHI COHEN
NOVEMBER 2013

AGENDA
Overview of Raytracing & KD Trees

Review of GCN Architecture

Mapping Raytracing to GPUs

Optimizing Raytracing using CodeXL

2 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013

ACCELERATION STRUCTURES TRADE OFFS

 Construction
Speed
Uniform Grid

Bounding
Volume
Hierarchies

KD Tree

 Tracing Speed


HIERARCHICAL KD TREE – 2D
F

A
A
B

C
B

D

E

F

E

G

C

D

G

KD TREE – 3D


STACK BASED TRAVERSAL KD TREE – 2D
tMin

F

A

A
B

C
B

D

E

F

E

G

t2

G
C

t1

D

tMax

TRAVERSING KD TREES – PSEUDO CODE
stack.push(KDroot,sceneMin,sceneMax)
tHit=infinity
while !(stack.empty()):
(node,tStart,tEnd)=stack.pop()
while !(node.isLeaf()):
tSplit = ( node.value - ray.origin[node.axis] ) / ray.direction[node.axis]
(near, far) = findNear(ray.origin[node.axis], node.left, node.right)
if( tSplit >= tEnd or tSplit < 0)
node=near
else if( tSplit <= tStart)
node=second
else
stack.push( far, tSplit, tEnd)
node=near
tEnd=tSplit
for prim in node.primitives():
tHit=min(tHit,prim.Intersect(ray))
if tHit<tEnd:
return tHit
return tHit


 First introduced with the “Southern Island” family of GPUs.
 Is available with the upcoming “Kaveri” APU.
 Scalar architecture.
 ECC support. (with some models).
 Double precision support.
 Multiple concurrent queues for compute.


GPU SCALAR ARCHITECTURE VS CPU SSE EXTENSIONS
 float x;
 X = x+1;
 Scalar code does not utilize the SSE capabilities of the CPU.
 Thread 1

 Thread 2

 Thread 3

 Thread 4

 Thread 5

 Thread 6

 Thread 7

 Thread 8

 Thread 9

 Thread 10

 Thread 11

 Thread 12

 Thread 13

 Thread 14

 Thread 15

 Thread 16


HOW SCALAR CODE IS EXECUTED
 float x;
 X = x+1;

GCN
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16


IMPLICATIONS FOR RAY TRACING
 Ray Packetization – having a single thread trace several rays in one KD tree
traverse to achieve better utilization of the SIMD and cache.
 No explicit ray packetization is required on GCN.
 The HW is implicitly packetizing every 64 threads. All 64 threads of a Wavefront
execute the same instruction together.


A SEQUENCER FOR EVERY COMPUTE UNIT

SQ

SQ

SQ

SQ

Compute
Unit

Compute
Unit

Compute
Unit

Compute
Unit

 A sequencer is a HW block responsible for issuing program instructions.
 A compute unit can run up to 40 Wavefronts each with a distinct program
counter.
 GPU under-utilization due to long traversing rays may happen only on the
Wavefront level.

HOW MUCH ON CHIP MEMORY DO WE HAVE?
HD 7970 – “Tahiti”
256 KB VGPR per CU X 32 = 8.192 MB
8 KB SGPR per CU X 32 = 0.256 MB
16 KB L1 V-Data cache per CU X 32 = 0.512 MB
16 KB L1 S-Data cache per 4 CUs X 8 = 0.128 MB
32 KB instruction cache per 4 CUs X 8 = 0.256 MB
L2 Data Cache = 768 KB
LDS 64KB per CU X32 = 2.048 MB

Total : 12.16 MB


AMD CODE XL
 Coherent, innovative and unified developer tools suite
‒ Debug, Profile, and Analyze applications
‒ Support OpenCL™ and OpenGL.
‒ AMD CPUs, GPUs and APUs
‒ Standalone and integrated into Microsoft® Visual Studio®
‒ Supported on Windows® and Linux®
‒ Does not require source code modifications


BE SURE YOUR KERNEL SIZE DOES NOT EXCEED
INSTRUCTION CACHE SIZE


HOW CAN A GPU TRAVERSE A TREE?

Node
Node
Node

Node

Node
Node

Node

 Nest all the nodes on a buffer, wrap the buffer with CL mem object.
 When using HSA we can leverage the unified memory architecture and
access the tree as-is.


HOW MUCH MEMORY DO WE NEED FOR THE STACK?

Per Wave front = Maximal Depth Of the Tree X size of frame X 64 .

25 X 12 X 64 = ~19 KB
Leads to GPR spilling to local memory or low scheduling
utilization.
GPR spilling is decided upon by the OCL compiler on compile
time.
GPRs spilled to local memory are also known as Scratch
Registers.


HOW TO DETECT SCRATCH REGISTERS USING CODEXL


STACKLESS TRACE – RESTART TRAVERSAL
tmin

F

A

A
B
D

C
E

F

t1
E

G

B
C

t2
t1 t2
t3

t1 tMax
t2 t3
t3 tMax
D

G

tMax

KD RESTART ALGORITHM
tStart=tEnd=sceneMin
timeHit=infinity
while (tEnd<sceneMax):
node=root
tStart=tEnd
tEnd=sceneMax
while (not node.isLeaf()):
axis = node.axis
tSplit = ( node.PlanePos - ray.origin[axis] ) / ray.direction[axis]
(near, far) = findNear(ray.origin[axis], node.left, node.right)
if( tSplit >= tEnd or tSplit <= 0)
node=near
else if( tSplit <= tStart)
node=far
else
node=near
tEnd=tSplit
for prim in node.primitives():
timeHit=min(tHit,prim.Intersect(ray))
if timeHit<tEnd:
return tHit
return tHit

EFFECT ON GPR SPILLAGE


Optimizing
Raytracing
using CodeXL

CAN THIS BE FURTHER REFINED?
 What on chip memory aren’t we using ?
LDS = Local Data Store.
Short Stack Algorithm – initialize a stack smaller than the
maximum depth of the tree. If we overflow, fall back to KDRestart algorithm.
If we place the short stack in the LDS, what should be
the depth of the “short stack”?


HOW MANY WAVEFRONTS ARE EXECUTED
CONCURRENTLY
 Use CodeXL application trace to discover how many Wavefronts are executed
concurrently with stackless traversal


OCCUPANCY GRAPHS


WHAT SHOULD BE THE SIZE OF THE SHORT STACK?

64 KB / 12 wavefronts / 64 threads / sizeof
(Frame) = 7


RESULTS
120
110
100
90
80
70
60
Full stack

stackless

short stack Short stack on
LDS

 Results are in Million rays per second on Radeon™ HD 7970.


Questions?


DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
OpenCL™ is a trademark of Apple Inc. which is licensed to the Khronos organization. Linux™ is the trademark of Linus Torvalds.
Microsoft™ and Windows™ are the trademarks of Microsoft Corp. All other names used in this presentation are for
informational purposes only and may be trademarks of their respective owners.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be
trademarks of their respective owners.


REFERENCES
 Introduction to GCN
‒ http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf

 GCN white paper
‒ http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

 CodeXL home page
‒ http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/

 AMD OpenCL programmers guide
‒ http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL
_Programming_Guide.pdf


PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi Cohen

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi Cohen

Similar to PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi Cohen (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi Cohen