1

GPU Ray Tracing
with CUDA
BY TOM PITKIN

Bill Clark, PhD
Stu Steiner, MS, PhC
Objectives


Develop a sequential CPU and parallel GPU ray tracer



Illustrate the difference in rendering speed and de...
Outline


Introduction to Ray Tracing



CUDA



Parallelization with CUDA / Results



Future Work



Questions

3
What is Ray Tracing?


Rendering technique used in computer graphics



Simulates the behavior of light



Can produce ...
Light in the Physical World

5
Light Source

Film
Object with
Red Reflectivity

Pinhole
The Virtual Camera Model


Eye Position – camera location in 3D space



Reference Point – point in 3D space where the c...
Ray Generation


Map the physical screen to the image plane



Divide the image plane into a uniform grid of pixel locat...
Ray Intersection Testing


Ray – Sphere Intersection



Ray – Triangle Intersection

8
Phong Reflection Model

Ambient

+

Diffuse

+

Specular

9

=

Phong Reflection
Specular Reflection


Recursive Ray Tracing

10
Outline


Introduction to Ray Tracing



CUDA



Parallelization with CUDA / Results



Future Work



Questions

11
What is CUDA?


Compute Unified Device Architecture (CUDA)



Parallel computing platform



Developed by Nvidia

12
Kernel Functions


Specifies the code to be executed in parallel



Single Program, Multiple Data (SPMD)

13
Kernel Execution


Grids



Blocks



Threads

14
Memory Model


Global Memory



Constant Memory



Texture Memory



Registers



Local Memory



Shared Memory

15
Outline


Introduction to Ray Tracing



CUDA



Parallelization with CUDA / Results



Future Work



Questions

16
Thread Organization


2D array of blocks



2D array of threads



17

Each thread represents
a ray

Block (0, 0)

Bloc...
Testing Environment


OS – Ubuntu Gnome Remix 13.04



CPU – Core i7-920




Core Clock – 2.66 GHz

GPU – Nvidia GTX 5...
Test Objects


Teapot






Surfaces: 1

Triangles: 992

Al





Surfaces: 174
Triangles: 7,124

Crocodile


Surfa...
Single Kernel

20
Single Kernel
Single Thread

160 (0.16 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

23,003 (23 sec)

411...
Kernel Complexity and Size


Driver timeout



Register Spilling

21
Replacing Recursion


Iterative Loop



Layer based stack




Layers store color values returned from rays

Final imag...
Multi-Kernel

23
Multi-Kernel
Single Kernel (Previous Kernel)

381 (0.38 sec)

Teapot
(Surfaces: 1)
(Triangles: 992)

160 ...
Multi-Kernel with Single-Precision Floating Points

24

Multi-Kernel with Single-Precision
Floating Points
Multi-Kernel (P...
Caching Surface Data


Object’s surface data stored on shared memory



All threads in same block have access to cached ...
Multi-Kernel with Surface Caching

26

Multi-Kernel with Surface Caching
Multi-Kernel with Single-Precision
Floating Point...
Simplifying Mesh Data

27



Triangle data originally stored as three points (vertices)



Optimize data by storing tria...
Multi-Kernel with Mesh Optimization

28

Multi-Kernel with Mesh Optimization
Multi-Kernel with Surface Caching
(Previous K...
Final Results

29
Multi-Kernel with Intersection
Optimization
Single Thread

27 (0.03 sec)

Teapot
(Surfaces: 1)
(Triangle...
Outline


Introduction to Ray Tracing



CUDA



Parallelization with CUDA / Results



Future Work



Questions

30
Future Work


Spatial partitioning



Multiple GPUs



Optimize code for different GPUs

31
Questions?

32
Upcoming SlideShare
Loading in...5
×

Computer Science Thesis Defense

444

Published on

For my thesis, I developed and compared a sequential CPU and parallel GPU implementation of a ray tracer written in C++ and CUDA respectively. Here are the presentation slides from my thesis defense.

Published in: Technology
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total Views
444
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
23
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • Used C++ and CUDA
  • Forward Ray TracingBackward Ray Tracing
  • Pixel – picture element that represents one point on an image. Consists of a single color
  • Don’t forget to mention what happens if a ray misses completely
  • Ambient Light – indirect light reflected off of other objects in the sceneDiffuse Light – direct light reflected off the surface in all directionsSpecular light – direct light reflected off the surface in a single direction
  • Block and Threads have unique identifier
  • Register Memory – 50x faster than Global MemoryL2 Cache – LRU (Least Recently Used)L1 Cache – Spatial Locality (Quickly access memory in nearby location of current memory reference), Caches per-thread stack and other local data structures
  • Logarithmic ScaleSingle Pass, 640 x 480
  • 1852X speedup!
  • Computer Science Thesis Defense

    1. 1. 1 GPU Ray Tracing with CUDA BY TOM PITKIN Bill Clark, PhD Stu Steiner, MS, PhC
    2. 2. Objectives  Develop a sequential CPU and parallel GPU ray tracer  Illustrate the difference in rendering speed and design of a CPU and GPU ray tracer 2
    3. 3. Outline  Introduction to Ray Tracing  CUDA  Parallelization with CUDA / Results  Future Work  Questions 3
    4. 4. What is Ray Tracing?  Rendering technique used in computer graphics  Simulates the behavior of light  Can produce advanced optical effects 4
    5. 5. Light in the Physical World 5 Light Source Film Object with Red Reflectivity Pinhole
    6. 6. The Virtual Camera Model  Eye Position – camera location in 3D space  Reference Point – point in 3D space where the camera is pointing  Orientation Vectors (u, v, n) – camera orientation in 3D space  Image Plane – projected plane of the camera’s field of view Reference Point v (Up Vector) n u Eye Position 6
    7. 7. Ray Generation  Map the physical screen to the image plane  Divide the image plane into a uniform grid of pixel locations  7 Send a ray through the center of each pixel location 𝐼𝑚𝑎𝑔𝑒 𝑃𝑙𝑎𝑛𝑒 𝐻𝑒𝑖𝑔ℎ𝑡 𝑆𝑐𝑟𝑒𝑒𝑛 𝐻𝑒𝑖𝑔ℎ𝑡 Pixel Eye Position 𝐼𝑚𝑎𝑔𝑒 𝑃𝑙𝑎𝑛𝑒 𝑊𝑖𝑑𝑡ℎ 𝑆𝑐𝑟𝑒𝑒𝑛 𝑊𝑖𝑑𝑡ℎ
    8. 8. Ray Intersection Testing  Ray – Sphere Intersection  Ray – Triangle Intersection 8
    9. 9. Phong Reflection Model Ambient + Diffuse + Specular 9 = Phong Reflection
    10. 10. Specular Reflection  Recursive Ray Tracing 10
    11. 11. Outline  Introduction to Ray Tracing  CUDA  Parallelization with CUDA / Results  Future Work  Questions 11
    12. 12. What is CUDA?  Compute Unified Device Architecture (CUDA)  Parallel computing platform  Developed by Nvidia 12
    13. 13. Kernel Functions  Specifies the code to be executed in parallel  Single Program, Multiple Data (SPMD) 13
    14. 14. Kernel Execution  Grids  Blocks  Threads 14
    15. 15. Memory Model  Global Memory  Constant Memory  Texture Memory  Registers  Local Memory  Shared Memory 15
    16. 16. Outline  Introduction to Ray Tracing  CUDA  Parallelization with CUDA / Results  Future Work  Questions 16
    17. 17. Thread Organization  2D array of blocks  2D array of threads  17 Each thread represents a ray Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Image Plane
    18. 18. Testing Environment  OS – Ubuntu Gnome Remix 13.04  CPU – Core i7-920   Core Clock – 2.66 GHz GPU – Nvidia GTX 570  Core Clock - 742 MHz  CUDA Core - 480  Memory Clock - 3800 MHz  Video Memory - GDDR5 1280MB 18
    19. 19. Test Objects  Teapot    Surfaces: 1 Triangles: 992 Al    Surfaces: 174 Triangles: 7,124 Crocodile  Surfaces: 6  Triangles: 34,404 19
    20. 20. Single Kernel 20 Single Kernel Single Thread 160 (0.16 sec) Teapot (Surfaces: 1) (Triangles: 992) 23,003 (23 sec) 411 (0.41 sec) Al (Surfaces: 174) (Triangles: 7,124) 55,260 (55.26 sec) 5,867 (5.87 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 1,617,160 (26.95 min) 1 10 100 1,000 10,000 Milliseconds 100,000 1,000,000 10,000,000
    21. 21. Kernel Complexity and Size  Driver timeout  Register Spilling 21
    22. 22. Replacing Recursion  Iterative Loop  Layer based stack   Layers store color values returned from rays Final image from convex combination of layers 22
    23. 23. Multi-Kernel 23 Multi-Kernel Single Kernel (Previous Kernel) 381 (0.38 sec) Teapot (Surfaces: 1) (Triangles: 992) 160 (0.16 sec) 967 (0.97 sec) Al (Surfaces: 174) (Triangles: 7,124) 411 (0.41 sec) 13,217 (13.22 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 5,867 (5.87 sec) 1 10 100 1,000 Milliseconds 10,000 100,000
    24. 24. Multi-Kernel with Single-Precision Floating Points 24 Multi-Kernel with Single-Precision Floating Points Multi-Kernel (Previous Kernel) 46 (0.05 sec) Teapot (Surfaces: 1) (Triangles: 992) 381 (0.38 sec) 118 (0.12 sec) Al (Surfaces: 174) (Triangles: 7,124) 967 (0.97 sec) 1,556 (1.56 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 13,217 (13.22 sec) 1 10 100 1,000 Milliseconds 10,000 100,000
    25. 25. Caching Surface Data  Object’s surface data stored on shared memory  All threads in same block have access to cached surface data  Removes duplicate memory requests  Data reuse 25
    26. 26. Multi-Kernel with Surface Caching 26 Multi-Kernel with Surface Caching Multi-Kernel with Single-Precision Floating Points (Previous Kernel) 30 (0.03 sec) Teapot (Surfaces: 1) (Triangles: 992) 46 (0.05 sec) 133 (0.13 sec) Al (Surfaces: 174) (Triangles: 7,124) 118 (0.12 sec) 1,007 (1.01 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 1,556 (1.56 sec) 1 10 100 Milliseconds 1,000 10,000
    27. 27. Simplifying Mesh Data 27  Triangle data originally stored as three points (vertices)  Optimize data by storing triangles as one point (vertex) and two edges  Calculate edges on host before kernel call 0.5, 1 0, 0 0.5, 1 1, 0
    28. 28. Multi-Kernel with Mesh Optimization 28 Multi-Kernel with Mesh Optimization Multi-Kernel with Surface Caching (Previous Kernel) 27 (0.03 sec) Teapot (Surfaces: 1) (Triangles: 992) 30 (0.03 sec) 127 (0.13 sec) Al (Surfaces: 174) (Triangles: 7,124) 133 (0.13 sec) 873 (0.87 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 1,007 (1.01 sec) 1 10 100 Milliseconds 1,000 10,000
    29. 29. Final Results 29 Multi-Kernel with Intersection Optimization Single Thread 27 (0.03 sec) Teapot (Surfaces: 1) (Triangles: 992) 23,003 (23 sec) 127 (0.13 sec) Al (Surfaces: 174) (Triangles: 7,124) 55,260 (55.26 sec) 873 (0.87 sec) Crocodile (Surfaces: 6) (Triangles: 34,404) 1,617,160 (26.95 min) 1 10 100 1,000 10,000 Milliseconds 100,000 1,000,000 10,000,000
    30. 30. Outline  Introduction to Ray Tracing  CUDA  Parallelization with CUDA / Results  Future Work  Questions 30
    31. 31. Future Work  Spatial partitioning  Multiple GPUs  Optimize code for different GPUs 31
    32. 32. Questions? 32
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×