Computer Science Thesis Defense

1

GPU Ray Tracing
with CUDA
BY TOM PITKIN

Bill Clark, PhD
Stu Steiner, MS, PhC

Objectives


Develop a sequential CPU and parallel GPU ray tracer



Illustrate the difference in rendering speed and design of a CPU and
GPU ray tracer

2

Outline


Introduction to Ray Tracing



CUDA



Parallelization with CUDA / Results



Future Work



Questions

3

What is Ray Tracing?


Rendering technique used in computer graphics



Simulates the behavior of light



Can produce advanced optical effects

4

Light in the Physical World

5
Light Source

Film
Object with
Red Reflectivity

Pinhole

The Virtual Camera Model


Eye Position – camera location in 3D space



Reference Point – point in 3D space where the camera is pointing



Orientation Vectors (u, v, n) – camera orientation in 3D space



Image Plane – projected plane of the camera’s field of view

Reference Point
v (Up Vector)
n

u
Eye Position

6

Ray Generation


Map the physical screen to the image plane



Divide the image plane into a uniform grid of pixel locations



7

Send a ray through the center of each pixel location

𝐼𝑚𝑎𝑔𝑒 𝑃𝑙𝑎𝑛𝑒 𝐻𝑒𝑖𝑔ℎ𝑡
𝑆𝑐𝑟𝑒𝑒𝑛 𝐻𝑒𝑖𝑔ℎ𝑡

Pixel
Eye Position
𝐼𝑚𝑎𝑔𝑒 𝑃𝑙𝑎𝑛𝑒 𝑊𝑖𝑑𝑡ℎ
𝑆𝑐𝑟𝑒𝑒𝑛 𝑊𝑖𝑑𝑡ℎ

Ray Intersection Testing


Ray – Sphere Intersection



Ray – Triangle Intersection

8

Phong Reflection Model

Ambient

+

Diffuse

+

Specular

9

=

Phong Reflection

Specular Reflection


Recursive Ray Tracing

10

Outline





CUDA






Future Work



Questions

11

What is CUDA?


Compute Unified Device Architecture (CUDA)



Parallel computing platform



Developed by Nvidia

12

Kernel Functions


Specifies the code to be executed in parallel



Single Program, Multiple Data (SPMD)

13

Kernel Execution


Grids



Blocks



Threads

14

Memory Model


Global Memory



Constant Memory



Texture Memory



Registers



Local Memory



Shared Memory

15

Outline





CUDA






Future Work



Questions

16

Thread Organization


2D array of blocks



2D array of threads



17

Each thread represents
a ray

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Image Plane

Testing Environment


OS – Ubuntu Gnome Remix 13.04



CPU – Core i7-920




Core Clock – 2.66 GHz

GPU – Nvidia GTX 570


Core Clock - 742 MHz



CUDA Core - 480



Memory Clock - 3800 MHz



Video Memory - GDDR5 1280MB

18

Test Objects


Teapot






Surfaces: 1

Triangles: 992

Al





Surfaces: 174
Triangles: 7,124

Crocodile


Surfaces: 6



Triangles: 34,404

19

Single Kernel

20
Single Kernel
Single Thread

160 (0.16 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

23,003 (23 sec)

411 (0.41 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

55,260 (55.26 sec)

5,867 (5.87 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

1,617,160 (26.95 min)
1

10

100

1,000

10,000

Milliseconds

100,000

1,000,000

10,000,000

Kernel Complexity and Size


Driver timeout



Register Spilling

21

Replacing Recursion


Iterative Loop



Layer based stack




Layers store color values returned from rays

Final image from convex combination of layers

22

Multi-Kernel

23
Multi-Kernel
Single Kernel (Previous Kernel)

381 (0.38 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

160 (0.16 sec)

967 (0.97 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

411 (0.41 sec)

13,217 (13.22 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

5,867 (5.87 sec)
1

10

100

1,000

Milliseconds

10,000

100,000

Multi-Kernel with Single-Precision Floating Points

24

Multi-Kernel with Single-Precision
Floating Points
Multi-Kernel (Previous Kernel)
46 (0.05 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

381 (0.38 sec)

118 (0.12 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

967 (0.97 sec)

1,556 (1.56 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

13,217 (13.22 sec)
1

10

100

1,000

Milliseconds

10,000

100,000

Caching Surface Data


Object’s surface data stored on shared memory



All threads in same block have access to cached surface data



Removes duplicate memory requests



Data reuse

25

Multi-Kernel with Surface Caching

26

Multi-Kernel with Single-Precision
Floating Points (Previous Kernel)

30 (0.03 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

46 (0.05 sec)

133 (0.13 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

118 (0.12 sec)

1,007 (1.01 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

1,556 (1.56 sec)
1

10

100

Milliseconds

1,000

10,000

Simplifying Mesh Data

27



Triangle data originally stored as three points (vertices)



Optimize data by storing triangles as one point (vertex) and two edges


Calculate edges on host before kernel call

0.5, 1

0, 0

0.5, 1

1, 0

Multi-Kernel with Mesh Optimization

28

Multi-Kernel with Mesh Optimization
(Previous Kernel)

27 (0.03 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

30 (0.03 sec)

127 (0.13 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

133 (0.13 sec)

873 (0.87 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

1,007 (1.01 sec)
1

10

100

Milliseconds

1,000

10,000

Final Results

29
Multi-Kernel with Intersection
Optimization
Single Thread

27 (0.03 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

23,003 (23 sec)

127 (0.13 sec)

Al
(Surfaces: 174)
(Triangles: 7,124)

55,260 (55.26 sec)

873 (0.87 sec)

Crocodile
(Surfaces: 6)
(Triangles: 34,404)

1,617,160 (26.95 min)
1

10

100

1,000

10,000

Milliseconds

100,000

1,000,000

10,000,000

Outline





CUDA






Future Work



Questions

30

Future Work


Spatial partitioning



Multiple GPUs



Optimize code for different GPUs

31

Computer Science Thesis Defense

More Related Content

What's hot

Viewers also liked

Similar to Computer Science Thesis Defense

Recently uploaded

Computer Science Thesis Defense

Editor's Notes