OPTIMIZING RAYTRACING
ON GCN WITH AMD
DEVELOPMENT TOOLS
TZACHI COHEN
NOVEMBER 2013
AGENDA
Overview of Raytracing & KD Trees

Review of GCN Architecture

Mapping Raytracing to GPUs

Optimizing Raytracing using CodeXL

2 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
Overview Of
Raytracing
ACCELERATION STRUCTURES TRADE OFFS

 Construction
Speed
Uniform Grid

Bounding
Volume
Hierarchies

KD Tree

 Tracing Speed

4 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HIERARCHICAL KD TREE – 2D
F

A
A
B

C
B

D

E

F

E

G

C

D
5 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013

G
KD TREE – 3D

6 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
STACK BASED TRAVERSAL KD TREE – 2D
tMin

F

A

A
B

C
B

D

E

F

E

G

t2

G
C

t1

D
7 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013

tMax
TRAVERSING KD TREES – PSEUDO CODE
stack.push(KDroot,sceneMin,sceneMax)
tHit=infinity
while !(stack.empty()):
(node,tStart,tEnd)=stack.pop()
while !(node.isLeaf()):
tSplit = ( node.value - ray.origin[node.axis] ) / ray.direction[node.axis]
(near, far) = findNear(ray.origin[node.axis], node.left, node.right)
if( tSplit >= tEnd or tSplit < 0)
node=near
else if( tSplit <= tStart)
node=second
else
stack.push( far, tSplit, tEnd)
node=near
tEnd=tSplit
for prim in node.primitives():
tHit=min(tHit,prim.Intersect(ray))
if tHit<tEnd:
return tHit
return tHit

8 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
GCN
ARCHITECTURE
 First introduced with the “Southern Island” family of GPUs.
 Is available with the upcoming “Kaveri” APU.
 Scalar architecture.
 ECC support. (with some models).
 Double precision support.
 Multiple concurrent queues for compute.

10 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
GPU SCALAR ARCHITECTURE VS CPU SSE EXTENSIONS
 float x;
 X = x+1;
 Scalar code does not utilize the SSE capabilities of the CPU.
 Thread 1

 Thread 2

 Thread 3

 Thread 4

 Thread 5

 Thread 6

 Thread 7

 Thread 8

 Thread 9

 Thread 10

 Thread 11

 Thread 12

 Thread 13

 Thread 14

 Thread 15

 Thread 16

11 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW SCALAR CODE IS EXECUTED
 float x;
 X = x+1;

GCN
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16

12 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
IMPLICATIONS FOR RAY TRACING
 Ray Packetization – having a single thread trace several rays in one KD tree
traverse to achieve better utilization of the SIMD and cache.
 No explicit ray packetization is required on GCN.
 The HW is implicitly packetizing every 64 threads. All 64 threads of a Wavefront
execute the same instruction together.

13 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
A SEQUENCER FOR EVERY COMPUTE UNIT

SQ

SQ

SQ

SQ

Compute
Unit

Compute
Unit

Compute
Unit

Compute
Unit

 A sequencer is a HW block responsible for issuing program instructions.
 A compute unit can run up to 40 Wavefronts each with a distinct program
counter.
 GPU under-utilization due to long traversing rays may happen only on the
Wavefront level.
14 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW MUCH ON CHIP MEMORY DO WE HAVE?
HD 7970 – “Tahiti”
256 KB VGPR per CU X 32 = 8.192 MB
8 KB SGPR per CU X 32 = 0.256 MB
16 KB L1 V-Data cache per CU X 32 = 0.512 MB
16 KB L1 S-Data cache per 4 CUs X 8 = 0.128 MB
32 KB instruction cache per 4 CUs X 8 = 0.256 MB
L2 Data Cache = 768 KB
LDS 64KB per CU X32 = 2.048 MB

Total : 12.16 MB

15 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
AMD CODE XL
 Coherent, innovative and unified developer tools suite
‒ Debug, Profile, and Analyze applications
‒ Support OpenCL™ and OpenGL.
‒ AMD CPUs, GPUs and APUs
‒ Standalone and integrated into Microsoft® Visual Studio®
‒ Supported on Windows® and Linux®
‒ Does not require source code modifications

16 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
BE SURE YOUR KERNEL SIZE DOES NOT EXCEED
INSTRUCTION CACHE SIZE

17 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
Mapping
Raytracing To
GPUs
HOW CAN A GPU TRAVERSE A TREE?

Node
Node
Node

Node

Node
Node

Node

 Nest all the nodes on a buffer, wrap the buffer with CL mem object.
 When using HSA we can leverage the unified memory architecture and
access the tree as-is.

19 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW MUCH MEMORY DO WE NEED FOR THE STACK?

Per Wave front = Maximal Depth Of the Tree X size of frame X 64 .

25 X 12 X 64 = ~19 KB
Leads to GPR spilling to local memory or low scheduling
utilization.
GPR spilling is decided upon by the OCL compiler on compile
time.
GPRs spilled to local memory are also known as Scratch
Registers.

20 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW TO DETECT SCRATCH REGISTERS USING CODEXL

21 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
STACKLESS TRACE – RESTART TRAVERSAL
tmin

F

A

A
B
D

C
E

F

t1
E

G

B
C

t2
t1 t2
t3

t1 tMax
t2 t3
t3 tMax
D
22 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013

G

tMax
KD RESTART ALGORITHM
tStart=tEnd=sceneMin
timeHit=infinity
while (tEnd<sceneMax):
node=root
tStart=tEnd
tEnd=sceneMax
while (not node.isLeaf()):
axis = node.axis
tSplit = ( node.PlanePos - ray.origin[axis] ) / ray.direction[axis]
(near, far) = findNear(ray.origin[axis], node.left, node.right)
if( tSplit >= tEnd or tSplit <= 0)
node=near
else if( tSplit <= tStart)
node=far
else
node=near
tEnd=tSplit
for prim in node.primitives():
timeHit=min(tHit,prim.Intersect(ray))
if timeHit<tEnd:
return tHit
return tHit
23 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
EFFECT ON GPR SPILLAGE

24 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
Demo
Optimizing
Raytracing
using CodeXL
CAN THIS BE FURTHER REFINED?
 What on chip memory aren’t we using ?
LDS = Local Data Store.
Short Stack Algorithm – initialize a stack smaller than the
maximum depth of the tree. If we overflow, fall back to KDRestart algorithm.
If we place the short stack in the LDS, what should be
the depth of the “short stack”?

27 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW MANY WAVEFRONTS ARE EXECUTED
CONCURRENTLY
 Use CodeXL application trace to discover how many Wavefronts are executed
concurrently with stackless traversal

28 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
OCCUPANCY GRAPHS

29 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
WHAT SHOULD BE THE SIZE OF THE SHORT STACK?

64 KB / 12 wavefronts / 64 threads / sizeof
(Frame) = 7

30 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
Demo
RESULTS
120
110
100
90
80
70
60
Full stack

stackless

short stack Short stack on
LDS

 Results are in Million rays per second on Radeon™ HD 7970.

32 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
Questions?

33 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
OpenCL™ is a trademark of Apple Inc. which is licensed to the Khronos organization. Linux™ is the trademark of Linus Torvalds.
Microsoft™ and Windows™ are the trademarks of Microsoft Corp. All other names used in this presentation are for
informational purposes only and may be trademarks of their respective owners.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be
trademarks of their respective owners.

34 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
REFERENCES
 Introduction to GCN
‒ http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf

 GCN white paper
‒ http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

 CodeXL home page
‒ http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/

 AMD OpenCL programmers guide
‒ http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL
_Programming_Guide.pdf

35 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013

PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi Cohen

  • 1.
    OPTIMIZING RAYTRACING ON GCNWITH AMD DEVELOPMENT TOOLS TZACHI COHEN NOVEMBER 2013
  • 2.
    AGENDA Overview of Raytracing& KD Trees Review of GCN Architecture Mapping Raytracing to GPUs Optimizing Raytracing using CodeXL 2 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 3.
  • 4.
    ACCELERATION STRUCTURES TRADEOFFS  Construction Speed Uniform Grid Bounding Volume Hierarchies KD Tree  Tracing Speed 4 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 5.
    HIERARCHICAL KD TREE– 2D F A A B C B D E F E G C D 5 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013 G
  • 6.
    KD TREE –3D 6 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 7.
    STACK BASED TRAVERSALKD TREE – 2D tMin F A A B C B D E F E G t2 G C t1 D 7 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013 tMax
  • 8.
    TRAVERSING KD TREES– PSEUDO CODE stack.push(KDroot,sceneMin,sceneMax) tHit=infinity while !(stack.empty()): (node,tStart,tEnd)=stack.pop() while !(node.isLeaf()): tSplit = ( node.value - ray.origin[node.axis] ) / ray.direction[node.axis] (near, far) = findNear(ray.origin[node.axis], node.left, node.right) if( tSplit >= tEnd or tSplit < 0) node=near else if( tSplit <= tStart) node=second else stack.push( far, tSplit, tEnd) node=near tEnd=tSplit for prim in node.primitives(): tHit=min(tHit,prim.Intersect(ray)) if tHit<tEnd: return tHit return tHit 8 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 9.
  • 10.
     First introducedwith the “Southern Island” family of GPUs.  Is available with the upcoming “Kaveri” APU.  Scalar architecture.  ECC support. (with some models).  Double precision support.  Multiple concurrent queues for compute. 10 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 11.
    GPU SCALAR ARCHITECTUREVS CPU SSE EXTENSIONS  float x;  X = x+1;  Scalar code does not utilize the SSE capabilities of the CPU.  Thread 1  Thread 2  Thread 3  Thread 4  Thread 5  Thread 6  Thread 7  Thread 8  Thread 9  Thread 10  Thread 11  Thread 12  Thread 13  Thread 14  Thread 15  Thread 16 11 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 12.
    HOW SCALAR CODEIS EXECUTED  float x;  X = x+1; GCN T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 12 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 13.
    IMPLICATIONS FOR RAYTRACING  Ray Packetization – having a single thread trace several rays in one KD tree traverse to achieve better utilization of the SIMD and cache.  No explicit ray packetization is required on GCN.  The HW is implicitly packetizing every 64 threads. All 64 threads of a Wavefront execute the same instruction together. 13 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 14.
    A SEQUENCER FOREVERY COMPUTE UNIT SQ SQ SQ SQ Compute Unit Compute Unit Compute Unit Compute Unit  A sequencer is a HW block responsible for issuing program instructions.  A compute unit can run up to 40 Wavefronts each with a distinct program counter.  GPU under-utilization due to long traversing rays may happen only on the Wavefront level. 14 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 15.
    HOW MUCH ONCHIP MEMORY DO WE HAVE? HD 7970 – “Tahiti” 256 KB VGPR per CU X 32 = 8.192 MB 8 KB SGPR per CU X 32 = 0.256 MB 16 KB L1 V-Data cache per CU X 32 = 0.512 MB 16 KB L1 S-Data cache per 4 CUs X 8 = 0.128 MB 32 KB instruction cache per 4 CUs X 8 = 0.256 MB L2 Data Cache = 768 KB LDS 64KB per CU X32 = 2.048 MB Total : 12.16 MB 15 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 16.
    AMD CODE XL Coherent, innovative and unified developer tools suite ‒ Debug, Profile, and Analyze applications ‒ Support OpenCL™ and OpenGL. ‒ AMD CPUs, GPUs and APUs ‒ Standalone and integrated into Microsoft® Visual Studio® ‒ Supported on Windows® and Linux® ‒ Does not require source code modifications 16 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 17.
    BE SURE YOURKERNEL SIZE DOES NOT EXCEED INSTRUCTION CACHE SIZE 17 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 18.
  • 19.
    HOW CAN AGPU TRAVERSE A TREE? Node Node Node Node Node Node Node  Nest all the nodes on a buffer, wrap the buffer with CL mem object.  When using HSA we can leverage the unified memory architecture and access the tree as-is. 19 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 20.
    HOW MUCH MEMORYDO WE NEED FOR THE STACK? Per Wave front = Maximal Depth Of the Tree X size of frame X 64 . 25 X 12 X 64 = ~19 KB Leads to GPR spilling to local memory or low scheduling utilization. GPR spilling is decided upon by the OCL compiler on compile time. GPRs spilled to local memory are also known as Scratch Registers. 20 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 21.
    HOW TO DETECTSCRATCH REGISTERS USING CODEXL 21 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 22.
    STACKLESS TRACE –RESTART TRAVERSAL tmin F A A B D C E F t1 E G B C t2 t1 t2 t3 t1 tMax t2 t3 t3 tMax D 22 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013 G tMax
  • 23.
    KD RESTART ALGORITHM tStart=tEnd=sceneMin timeHit=infinity while(tEnd<sceneMax): node=root tStart=tEnd tEnd=sceneMax while (not node.isLeaf()): axis = node.axis tSplit = ( node.PlanePos - ray.origin[axis] ) / ray.direction[axis] (near, far) = findNear(ray.origin[axis], node.left, node.right) if( tSplit >= tEnd or tSplit <= 0) node=near else if( tSplit <= tStart) node=far else node=near tEnd=tSplit for prim in node.primitives(): timeHit=min(tHit,prim.Intersect(ray)) if timeHit<tEnd: return tHit return tHit 23 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 24.
    EFFECT ON GPRSPILLAGE 24 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 25.
  • 26.
  • 27.
    CAN THIS BEFURTHER REFINED?  What on chip memory aren’t we using ? LDS = Local Data Store. Short Stack Algorithm – initialize a stack smaller than the maximum depth of the tree. If we overflow, fall back to KDRestart algorithm. If we place the short stack in the LDS, what should be the depth of the “short stack”? 27 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 28.
    HOW MANY WAVEFRONTSARE EXECUTED CONCURRENTLY  Use CodeXL application trace to discover how many Wavefronts are executed concurrently with stackless traversal 28 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 29.
    OCCUPANCY GRAPHS 29 |Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 30.
    WHAT SHOULD BETHE SIZE OF THE SHORT STACK? 64 KB / 12 wavefronts / 64 threads / sizeof (Frame) = 7 30 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 31.
  • 32.
    RESULTS 120 110 100 90 80 70 60 Full stack stackless short stackShort stack on LDS  Results are in Million rays per second on Radeon™ HD 7970. 32 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 33.
    Questions? 33 | OptimizingRaytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 34.
    DISCLAIMER & ATTRIBUTION Theinformation presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. OpenCL™ is a trademark of Apple Inc. which is licensed to the Khronos organization. Linux™ is the trademark of Linus Torvalds. Microsoft™ and Windows™ are the trademarks of Microsoft Corp. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 34 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
  • 35.
    REFERENCES  Introduction toGCN ‒ http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf  GCN white paper ‒ http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf  CodeXL home page ‒ http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/  AMD OpenCL programmers guide ‒ http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL _Programming_Guide.pdf 35 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013