FAKULT ¨AT F ¨UR INFORMATIK
DER TECHNISCHEN UNIVERSIT ¨AT M ¨UNCHEN
Master’s Thesis in Informatics
Efficiency Optimization of Realtime GPU
Raytracing in Modeling of Car2Car
Communication
Alexander Zhdanov
FAKULT ¨AT F ¨UR INFORMATIK
DER TECHNISCHEN UNIVERSIT ¨AT M ¨UNCHEN
Master’s Thesis in Informatics
Efficiency Optimization of Realtime GPU Raytracing in
Modeling of Car2Car Communication
Steigerung der Effizienz von Realtime GPU Raytracing
bei der Modellierung von Fahrzeug-zu-Fahrzeug
Kommunikation
Author: Alexander Zhdanov
Supervisor: Prof. Dr.-Ing. habil. Alois Knoll
Advisor: Dipl.-Ing. Manuel Schiller
Date: March 17, 2014
I confirm that this master’s thesis is my own work and I have documented all sources and
material used.
M¨unchen, den 17. M¨arz 2014 Alexander Zhdanov
Acknowledgments
I would like to thank Professor Knoll for the opportunity to work in his lab, my supervi-
sor Manuel Schiller, Christoph Reisinger for valuable advice and also my parents Nikolay
and Olga for their support.
vii
Abstract
This thesis is dedicated to efficiency optimization of a software designed to simulate Car-2-
Car communication system. Namely, it aims to improve the part of the system responsible
for modelling of propagation channel implemented using realtime GPU raytracing. The
research investigates a possible solution to the problem using reordering of the ray data
and utilization of frame coherence. In the beginning, it has been carried out a review of
the existing caching schemas exploiting an innerframe and intraframe coherence, tech-
niques for the ray reordering on GPU and some of the GPU data structures. It have been
considered conditions influencing the solution. It have been proposed algorithms for im-
plementation of the ray sorting on the CPU using the space-filling curves. It is offered a
method for caching of the tracing data for radiation sources rapidly changing its positions.
It is shown a way for standard implementation of the ray reordering on the GPU using
Morton codes and Radix sort. It is proposed an implementation of the caching method us-
ing data structures utilizing different synchronization mechanisms. It has been analysed
the system efficiency with the ray sorting. It is given an analisys of the system perfor-
mance for both static and dynamic scenes and performed calculation of the system error
for the caching. The system analysis shows that the ray reordering is capable to signifi-
cantly increase the system efficiency. Also during the implementation stage, it have been
revealed some limitations imposed by a third party software used for the GPU raytracing
and proposed a work-around solution to overcome them. The proposed solution allows
to increase the initial performance with varying degrees of success for different caching
schemes. Nevertheless, evalution of the system performance in condition of interaction
between two methods (ray reordering and caching) shows that the ray reordering prevails
and currently nullify costs for the caching.
ix
x
Contents
Acknowledgements vii
Abstract ix
Outline of the Thesis xv
I. Introduction 1
1. Introduction 3
1.1. Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1. Car-to-Car Communication System . . . . . . . . . . . . . . . . . . . 3
1.2.2. VANET Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3. Simulation of Propagation Channel . . . . . . . . . . . . . . . . . . . 4
1.3. Thesis Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1. Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2. Ray Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5. Software System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
II. Literature Review and Problem Solution 7
2. Literature review 9
2.1. Ray caching and frame coherence . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2. GPU data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3. GPU programming model and memory types . . . . . . . . . . . . . . . . . . 17
2.3.1. GPU programming model . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2. GPU memory types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4. Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3. Problem and Solution 27
3.1. An experiment with dimensionality of context launches . . . . . . . . . . . 27
3.1.1. Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2. Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3. Problem Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2. Application of frame coherence . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1. Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
xi
Contents
3.2.2. Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3. Formulation of Caching Scheme . . . . . . . . . . . . . . . . . . . . . 34
III. Analysis and Implementation 41
4. Analysis and Modelling 43
4.1. Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1. Code Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2. Frame Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1. Code Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2. Selection of Data Structure . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3. Selection of Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.4. Selection of Hash Function . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.5. Selection of Mapping Function . . . . . . . . . . . . . . . . . . . . . . 52
4.2.6. Selection of Synchronization Mechanism . . . . . . . . . . . . . . . . 53
5. Implementation 55
5.1. Implementation of Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2. Implementation of Frame Coherence . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1. Implementation of Data Model . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2. Implementation of Data Structure . . . . . . . . . . . . . . . . . . . . 58
IV. Evaluation and Testing 67
6. Testing 69
6.1. System Configuration before Testing . . . . . . . . . . . . . . . . . . . . . . . 69
6.2. Testing of Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.1. Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.2. Tests Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3. Testing of Frame Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.1. Static Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.2. Dynamic Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
V. Discussion and Conclusions 89
7. Discussion 91
7.1. Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2. Frame Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2.1. Caching Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2.2. Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2.3. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8. Conclusions 93
xii
Contents
Appendix 97
A. Space-Filling Curves 97
A.1. Morton Codes Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.2. 2D Hilbert Curve Implementation . . . . . . . . . . . . . . . . . . . . . . . . 97
A.3. 3D Hilbert Curve Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B. A flow diagram for the main tracing loop 105
C. Implementation Hash Tables 107
C.1. Implementation of Cache Key . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
C.2. Implementation of the Chained Hash Table . . . . . . . . . . . . . . . . . . . 109
C.3. Implementation of the Open-Addressed Hash Table . . . . . . . . . . . . . . 112
Bibliography 121
xiii
Contents
Outline of the Thesis
Part I: Introduction
CHAPTER 1: INTRODUCTION
The chapter introduces to a reader the area in which the research is performed. In this
chapter, the main goals of the thesis are formulated .
Part II: Literature Review and Problem Solution
CHAPTER 2: LITERATURE REVIEW
In this chapter it is given an overview of articles dedicated to ray reordering, ray caching
and GPU data structures.
CHAPTER 3: PROBLEM AND SOLUTION
In this chapter, tasks formulation are given and proposed an algorithmic or schematic
solution.
Part III: Analysis and Implementation
CHAPTER 4: ANALYSIS AND MODELLING
In this chapter, it is presented an analysis of the existing code, design decisions and UML
diagram of data model for caching.
CHAPTER 5: IMPLEMENTATION
The chapter presents implementation of solutions presented in the second part. The chap-
ter identifies ways for the ray reordering on GPU and gives implementation of the caching
method.
Part IV: Evaluation and Testing
CHAPTER 6: TESTING
The chapter presents a desription of testing approaches and procedures and also gives an
overview and analysis of the testing results.
xv
Contents
Part VI: Discussion and Conclusions
CHAPTER 7: DISCUSSION
The chapter briefly discusses results of the research.
CHAPTER 8: CONCLUSION
The chapter articulates conclusions of the research.
xvi
Part I.
Introduction
1
1. Introduction
Software efficiency often refers to algorithmic efficiency which is one of the central topics
in computer science. According to Oxford Dictionary of Computing, algorithm efficiency
is “a measure of the average execution time necessary for an algorithm to complete work
on a set of data. Algorithm efficiency is characterized by its order.” [22]. On the other
hand, according to Robert Sedgewick [49], program optimization is a process of modifying
a software system to make some aspect of it work more efficiently or use fewer resources.
The latter implies that there is a software system that behaviour needs to be optimizied.
However, many experts in computer science (e.g Donald Knuth [25]) believe that the crit-
ical code sections have to be throughly verified and found before the optimization takes
place. In our case, efficiency optimization means that it is rather adding a new functional-
ity helping to increase performance then optimization of the existing code. With the term
optimization closely connected terms caching and performance analysis.
1.1. Thesis Statement
The aim of the project is modification of existing software system developed for modeling
of Car-to-Car Communication to increase its efficiency (decrease average time taken for
data processing). The system uses realtime GPU raytracing for simulation of wireless com-
munication between cars. A hypothesis of the research is that the tracing process could be
speeded up by taking into account an innerframe and outerframe coherence i.e. caching of
resulting data for future requests. It is also supposed that the efficiency optimization could
be achieved by altering tracing parameters, for example, by changing an order of rays shot
on a scene. Both hypothesis are tested using performance analysis (benchmarking).
1.2. Motivation
In the Motivation section, it is given a description of a general direction in which the cur-
rent work is carried out. In the first subsection it is briefly described Car-to-Car system,
its general purpose. In the second subsection, it is given some general information about
virtual testing of such systems. The last subsection considers an impact of realtime gpu
raytracing on the virtual drive system and on other urgent scientific areas.
1.2.1. Car-to-Car Communication System
Car-to-Car Communication system is a wireless network “between vehicles and their en-
vironment in order to make the vehicles of different manufactures interoperable and also
enable them to communicate with road-side units.[8]”. According to the Car2Car Com-
munication Consortium, the system shall provide the following top level features:
3
1. Introduction
• automatic fast data transmission between vehicles and between vehicles and road
side units
• transmission of traffic information, hazard warnings and entertainment data
• support of ad hoc Car 2 Car Communications without need of a pre-installed net-
work infrastructure
• the Car 2 Car system is based on short range Wireless LAN technology
The Car 2 Car Communication System has the following goals:
• enable the cooperation between vehicles
• increase driver awareness
• extend driver’s horizon
• enable entirely new safety functions
• reduce accidents and their severity
• include active traffic management applications
• help to improve traffic flow
Thus, the main scenarios for which the system is designed include safety, traffic efficiency,
infotainment and some others.
1.2.2. VANET Simulation
In general, Car-2-Car communication systems represent a type of Vehicular ad-hoc Net-
works (VANET). Specifics of using a wireless connection in such networks requires an
active development of new network protocols suitable for the task. However, high costs of
full-scale real tests make them disadvantageous. An important part in simulation of such
systems is realistic motion model [12]. Another important issue in VANET simulation is
realistic modelling of propagation channel [26]. For the later problem there are two possi-
ble solutions: statistical and deterministic channel models. The deterministic method uses
a ray-tracing to model wave propagation [10]. The deterministic approach provides a real-
istic simulation taking into account geometrical and radio properties of the environment.
1.2.3. Simulation of Propagation Channel
In modelling of the propagation channel using the ray tracing, there are different ap-
proaches. Some authors, for example, create a radio map using a pre-calculation [15].
Others use a mixed (statistical and deterministic) approach for the channel simulation [6].
All authors, however, agree on that statistical models are unable to provide a necessary
precision in the network simulation. On the other hand, the ray tracing which provides
the seek accuracy suffers from low performance. Thus, the problem of increasing efficiency
of the realtime ray tracing becomes an important step towards building accurate VANET
simulators. The problem is also important in other areas of computer science, for example,
in computer graphics [16].
4
1.3. Thesis Goals
1.3. Thesis Goals
The main goal of the thesis is to increase the efficiency of realtime GPU raytracing in a sys-
tem used for VANET simulation. The main functionality of the system has been developed
by the start of this work. So design goals can be formulated as follows:
1. Increase the system efficiency using the ray sorting
Design of sorting methods (algorithms)
Implementation of sorting methods (CPU)
Testing of the system performance with the ray reordering
2. Increase the system efficiency using caching
Design of caching method
Implementation of caching (GPU)
Testing of the system performance using caching
1.3.1. Ray Reordering
GPU performs calculations separating the streams on groups which are called ”warps“.
The main problem with warp is ”divergence” [4]. Such problem occurs, for example, in
code containing branches when some streams inside a ”warp“ take one branch of the ex-
ecution flow, other suspend on the evaluation point. When the first finish their execu-
tion, others take another branch, so the first group of threads now suspend till the second
group finish their execution. Presorting of rays helps to fully utilize the hardware such
that threads in one warp take the same execution paths.
1.3.2. Ray Cache
On the other hand, caching helps to reduce intensity of computational load. Cache stores
calculations which have been already computed in the system for future requests. The
main problem here is development of the caching schema and also implementation of the
caching on the GPU. The other important thing is testing and evaluation of testing results.
Tests development should be performed using automation tools as much as possible.
1.4. Contributions
The main contibution of this work is development of caching method for a system simulat-
ing VANET with low and medium intraframe coherence. This means that the geometrical
configuration of rays completely changes from frame to frame preserving a certain degree
of correlation. Other techniques use high ray coherence between frames, it means that par-
tially beam configuration stays stable between frames, this allows to reuse tracing results
between frames. In this case, radiation sources change their positions with relatively high
speed between frames.
5
1. Introduction
The main problem solved in the implementation part is selection of efficient data struc-
ture allowing to build cache on the GPU side. Next, during testing of the data structure
for dynamic scenes, there have been an error that could be contributed to OptiX bugs.
The error has been solved by design of synchronization schema for writing/reading data
to/from buffer. The error in more detail will be described in the Testing chapter.
1.5. Software System Overview
In this section, it is given a brief system overview. Roughly the system contains the fol-
lowing modules:
Wavetracer This module is responsible for the ray tracing using OptiX engine: reading
of configuration file, creation of context, initialization of parameters, tracing programs,
launching of the tracing, processing of the output data, writing of the processed data to
the output file. This is the main module which will be amended.
Sceneviewer This module diplays results of the ray tracing both statically and dynami-
cally using OpenSceneGraph [41].
Osgloader Extracts information out of loaded 3D models.
Optix wrapper The module is a wrapper of OptiX api. C++ API of OptiX does not meet
needs of the application, for example, iterating the scene graph.
edgedetector Detection of diffraction edges.
adtf coupling The software component is responsible for encapsulation of modules into
ADTF [1]. These plugins are osgplugin, testdriverplugin, vtdplugin and wavetracer plu-
gin. All the plugins should inherit basic interfaces of the ADTF.
6
Part II.
Literature Review and Problem
Solution
7
2. Literature review
The study was conducted in the following directions: ray caching techniques & frame
coherence, GPU data structures and ray reordering. Also it has been written a review
on GPU programming model and memory types. Ray caching and frame coherence is
presented in one subsection while GPU data structures and ray reodering in separate.
2.1. Ray caching and frame coherence
Realtime ray tracing requires high computational power. One of the methods which
could be used to reduce computational expences is caching. Results of the tracing proce-
dure could be stored in a ray cache which will reduce a response time for future requests.
The main question is how to build an adequate and efficient caching strategy. Several au-
thors were selected which used caching techniques in their work (Chan [7], Debattista [9],
Popov [42], Ruff [46], Tole [50], Scherzer [48]).
The goals of the study:
1. find ”postulates“ for ray caching (how ray caching could be performed in general)
2. the seek strategy shall exploit ray coherence in rapidly changing environments
3. the seek strategy shall be implementationally gpu-compatible
The following literature review attempts to find such a strategy.
In a research by Chan et al. [7] the ray coherence is exploited to accelerate a sound
rendering process in an interactive environment. The article postulates the following prin-
ciples for the ray caching.
1. Rays with the same geometric properties (starting point, directions) as contained in
cache do not have to be traced again.
2. To maintain the intersection history objects are subdivided into discrete patches.
3. The cache represents a graph with object patches as nodes and rays as edges.
Once a ray hits a patch in the cache, the whole intersection history for the given patch
be taken which replaces costly intersection tests. Since this article is important for the
research, it is necessary to highlight some implementation details here.
Patch subdivision An object is subdivided into small patches so that two rays hitting the
same patch are considered have the same intersection points. Angular ray directions are
quantized
9
2. Literature review
Coherence Intra-frame coherence occurs when several rays share the same path inside
one frame. Inter-frame coherence occurs when several rays share parts of paths contained
in the cache
Ray cache The cache consists of a tree and a graph attached to tree. Every node of the
tree is identified by a complex index (object, patch, patch, ..., division)
Purging The cache is purged according to Least Recently Used strategy using timestamp
Source movement When a ray source changes its position all cache entries connected
with it are removed from the cache.
Chan got significant performance improvements using the ray cache. The method was
beneficial in multi-user interactive environments with high ray coherence. Hovewer, the
method has several disadvantages. Firstly, the method is implemented on cpu using so-
phisticated data structures which will be hardly efficient on gpu. Secondly, the method
is limited to step - by - step movements with high correlations between frames. Thirdly,
the cache is purged when a source changes its position, it means that in succeeding frames
the cache cannot be reused. Nevertheless, the article states fundamental ideas for the ray
cache implementation.
Debattista et al. [9] used several techniques based on irradiance caching [54] in render-
ing dynamic scenes with global illumination [52]. The main contribution of the article is
detection of invalid ray paths after geometric transformations. Authors considered five
cases for invalidation of their instant cache.
Case 1 Occlusion of a light path by moving object (occlusion of a secondary light source)
Case 2 Deocclusion of a light path by moving object (deocclusion of a secondary light source)
Case 3 Occlusion of a visibility ray by moving object
Case 4 Deocclusion of a visibility ray by moving object
Case 5 Cache sample lies on a dynamic object
Figure 2.1 gives a summary of all cases. Overall, Debattista used caching for calculation of
a radiance integral where the cache stores illumination instead of visibility. Secondly, the
method is limited to static scene with moving objects. However, some of the ideas for ray
purging (ray invalidation) could be used in this work as well.
Popov et al. [42] in their work exploited an idea of lightcuts [53]. Authors introduced a
fundamental point-to-point visibility caching algorithm with could be incorporated in any
ray tracer. Also they developed an adaptive quantization scheme which helps to control
trade-off between performance and quality. The algorithm was implemented on gpu using
hash table which is of a particular interest for the research.
10
2.1. Ray caching and frame coherence
Figure 2.1.: The five cases that invalidate the instant cache [9]
One of the main entities in Popov’s work is a binary visibility function. It is defined as
V (X, Y ) =
®
1 if X and Y are mutually visible
0 otherwise
(2.1)
The visibility function is approximated using quantization of path domain and mapping
K( ¯pe, ¯pl) → N which relates a pair of surface points to a unique cluster. The quantized
visibility function is defined as
¯V ( ¯pe, ¯pl) ≈ V C
(K( ¯pe, ¯pl)) (2.2)
The quantization error is controlled using equation
|A( ¯pe)||A(¯pl)| =
(CE)2
P( ¯pe)P(¯pl)Np
(2.3)
The figure 2.2 summarizes the concept.
To define K(.), the scene surface is divided into a set of virtual multi-resolution, overlap-
ping and differently oriented voxel grids. For vertex X with N(X) a quantized direction
is defined as
ωq
=
ñ
N(X) + 1
2
CN
ô
, dz =
2ωq
CN
− 1 (2.4)
K(.) returns a tuple of 14 integers: 3 for the orientation of X, 3 for the coordinates contain-
ing X, and 1 for the grid resolution R( ¯pe); integers for Y are chosen analogously.
11
2. Literature review
Figure 2.2.: Similar paths are grouped together and share the same visibilty query [42]
The concept is illustrated by the figure 2.3.
Figure 2.3.: Visibility domain quantization [42]
Results of the visibility queries are stored in a hash table. To define a particular bin,
researchers calulate a 32 bit key k(j) from j = K(.) and use modulo division k(j) mod CT
where CT is the hash table size. They also employ a direct-mapped cache which does not
resolve collisions simply overwriting the data. One important implemenation detail is that
the algorithm uses a counter which controls how many threads in the warp need to trace
rays. If the counter is less then some predefined threshold, the local state of each thread is
saved in a small pre-warp queue and and the rays are not trace immediately. Whenever
the number of threads in the queue exceeds 32, the tracing is performed. This helps to
utilize a gpu performance and load it uniformely.
For the method assessment Popov uses three metrics: Quality metric, Performance met-
ric and Cache performance metric. As a quality writers employ a predictive QMOS pro-
posed by Mantiuk et. all [30]. As a performance estimation it is taken a shadow ray re-
duction and the total frame rendering time. The cache performance was measured using
ratio
(1 − HitRatio) = MissRatio = 1/(RayDirection + 1) (2.5)
Results show significant shadow ray reduction, up to 50×, preserving image quality with
QMOS above 77%. The total rendering speedup varies from 2.5× to 6.7× for different
scenes.
12
2.1. Ray caching and frame coherence
All in all, the article is very valuable. Firstly, it gives a theoretical background for the
visibility caching. Secondly, the approach described could be used in rapidly changing
environments as in car driving since the quantization allows to exploit the inter-frame co-
herence when trace paths change entirely from frame to frame. Thirdly, the algorithm is
efficiently implemented on gpu using hash table and direct mapping which makes caching
very fast. On the other hand, quantization introduces too much parameters (14) which
makes hash table indexing a bit complicated using digest, prepending and modulo op-
erations. The higher performance makes memory utilization higher. On the hand, fast
changing scenes impose much stress on gpu buffers which could potentially cause mem-
ory destruction.
Ruff et al. [46] investigate the question of using caching textures for real time tracing
in OptiX. For each reflective object of the scene, researchers generate a set of 6 caching
textures. Before tracing a ray leaving the object the algorithm queries color information for
that ray in the caching textures. If the information is available it is taken from the cache,
otherwise the tracing procedure takes place. Authors introduce geometrical scheme how
reflection rays are saved in the textures. Results shows that the method produces pictures
that are visually equivalent to the reference images. A speed up achieved comparing to
conventional ray tracing varies from 2% to 168% depending on the number of additional
reflective objects.
Developers themself mention in their article that their method is tailored to static scenes
with convex objets and auto-reflection features. However, the idea of using the cubic box
as caching structure could be beneficial in this work as well.
Tole et al. [50] examine in their paper how to build a system for interactive computa-
tion of global illumination[52] in dynamic scenes. The system stores illumination samples
generated by pixel based rendering algorithms and then applies interpolation between
samples using graphics hardware. The shading cache represents a hierarchical patch tree
with every patch containing the last computed shading values for its vertices. The patch
could be used in three ways, either its value is used for interpolation or the patch could
be refined further or its value is updated. If the cache grows above the threshold, patches
which are not longer seen on the scene removed together with their children using ”not
recently used” strategy. When an object on the scene or light moves the patch values are
recalculated using ”age priority“. Comparison with other cache rendering systems, shows
that the system suits best for applications like interactive lighting design and modeling.
Altogether, the system likewise Debattista [9] uses illumination cache, spatial coherence is
exploited using interpolation, while temporal coherence maintained by reusage of patches
from previous frames. All this makes the approach practically useless in this study.
Work by Scherzer [48] is notes for the course with the same name: ”Exploiting Tempo-
ral Coherence in Real-Time Rendering”. He determines temporal (frame) coherence (TC)
as “the existence of a correlation in time of the output of a given algorithm”. He further
states that the coherence could be used to accelerate a given algorithm making it incre-
mental in time and for quality improvement by taking into account results obtained in the
13
2. Literature review
previous frames. Next, Scherzer describes the Reverse Reprojection Cache which reuses
shading results from previously rendered frames. The basic idea of the method is to allow
renderer to use a shading information which is available for a given point in the previous
frame buffer. In order to do this Scherzer introduces a reverse reprojection operator. For
refreshing the cache, the screen is divided into n parts which are updated in a round-robin
fashion. The method shows good acceleration results for a few pixel shaders. It is used for
stereoscopic rendering, simulation of motion blur and depth of field effects, view frustum
culling techniques and others. Again, this method is basically designed to being used as
shading/illumination cache, it could be used for exploiting the temporal coherence in ob-
ject space in culling techniques but this does not help much in our task. The formulation
of the temporal coherence could be used to calculate how good frames correlate with each
other.
Upon the whole, the objectives are achieved. It has been selected the main strategy for
caching proposed by Chan et al. [7]. Based on work by Popov [42] the strategy could be
extended using the quantization (Chan also uses it intoducing divisions). Also the system
could be efficiently implemented on gpu using hash table and bin indexing. Some of the
policies for ray purging could be used from work by [9].
2.2. GPU data structures
In this subsection, it is given an overview for gpu data structures using book ”GPU Gems
2“ [31] and articles by Lefohn [28], Foley [13] and Prabhakar [34].
The main goal of the review to find an efficient parallel gpu data structure for the ray
cache implementation.
Lefohn, in chapter 33 of ”GPU Gems 2”[31] explains how fundamental data structures
are implemented using GPU programming model. According to Lefohn, GPU has the
following data structures: multidimensional arrays, static and dynamic sparse structures.
Multidimensional arrays 2-D textures with nearest-neighbour filtering are the substrate
on which most of the GPGPU structures are built. All multidimensional array use
address translation to convert N-D array address to 2-D texture. GPU implementa-
tion of the address translation suffers from limitations on floating-point addressing.
1-D array 1D arrays are implemented by mapping the data to 2D texture. Currently,
a maximum width for a 1D texture is 227 = 134, 217, 728 [37]. Each time an
element in a 1D array accessed by a GPU program, the address is translated
into a 2D texture indeces.
2-D array 2D arrays are represented as 2D textures. Their maximum size is limited
by the GPU driver.
14
2.2. GPU data structures
Figure 2.4.: Representation of 1D arrays on GPU
3-D array 3D arrays can be implemented in three ways: as 3D textures, as several
levels of 2D textures or directly mapped to single 2D textures. Every implemen-
tation has its pluses and minuses. For example, the simplest 3D implementation
does not have an address translation and for this structure could be used native
GPU trilinear filtering to create high-quality data renderings. As a disadvan-
tage, this structure requires many passes to write to the whole array.
structures There are two possible solutions: a stream of structures and a structure of
streams. The stream of structures is a problematic solution because every member
of structure has a different stream index and they cannot be easily updated. Con-
trariwise, in the structure of streams, a separate flow is created for every structure
member.
sparse data structures Implementation of sparse data structures as lists, trees or sparse
matrices is problematic on the GPU. Firstly, because this involves writing to a com-
puted memory address (scattering). Secondly, traversing of such data structures in-
volves an inhomogenous number of pointer dereferencing operations to access data
which has difficulties based on processing properties of SIMD architecture. Elements
which are processed by single SIMD should contain exactly the same instructions.
static sparse structures The static sparse structures are not changed during the
GPU computation. All of these structures contain one or more levels of indirec-
tion.
There are two methods for solving the problem of irregular access in these pat-
terns: the first one is to divide the whole frame into blocks where all blocks
have the same random access model and can be handled together. The second
method is to have a stream to process one member from its scheduled list per
render passage.
dynamic sparse structures Dynamic sparse data structures is a very active research
area. One of two noticable works are Purcell et al. 2003 [44] and Lefohn at al
[17], [27].
15
2. Literature review
Figure 2.5.: Purcell et al. 2002 [45]. Sparse Ray-Tracing Data Structure
A photon map [44] is a cache which stores intersection points and incoming di-
rections for light packets called photons.There are two techniques which allow
to build the photon map on the GPU. The first one computes addresses and data
for writing then it performs a scattering by performing parallel sorting opera-
tions on the buffers. The second method uses vertex processor.
Lefohn [27] created efficient dynamic data structure on GPU for implicit sur-
face deformation. He solves scattering problem by sending small messages to
the CPU when the GPU needs to be updated. The structure uses the blocking
strategy.
Weber et al. [55] presented efficient implementation for sparse matices on
GPU in solving of sparse linear systems in dynamic applications.
performance considerations In case of dependent memory reads there is a possibility
to create a memory-incoherent memory accesses. It could be prevented by creation
of coherent blocks of similar computations, small lookup tables and minimization of
dependency levels. Another important performance concerns include optimization
of computational frequency on the GPU, program specialization and a proper use of
pbuffers.
Foley and Sugerman [13] presents a GPU implementation for kd-tree traversal algorithm
suitable for raytracing but they build data structures on the CPU. This is of no interest for
the work.
Lefohn et al. [28] presented an abstract generic template library for complex random-
access data structures on the GPU. The structures, a stack, an octree, a quadtree are build
using standard library components. Firstly, ptx programs generated by nvcc compiler
should conform to restrictions imposed by OptiX API. This makes impossible usage of
16
2.3. GPU programming model and memory types
some CUDA libraries. Secondly, we do not need so complicated data structures as octrees,
on the other hand functions for construction and utilization of data primitives as 1D, 2D,
3D arrays are built in OptiX API. All these make usage of the library unreasonable.
Lock-free data structures represent a certain interest in this work. Prabhakar and Chaud-
huri [34] evaluate their performance on the GPU. They consider lock-free linked list [18],
hash table, skip list [18] and priority queue [18]. The data structures are evaluated using a
mix of add, delete and search operations for different key ranges. For the lock-free linked
list, the GPU implementation has a moderate speedup up to 7.4 times for small to medium
key ranges comparing to the CPU implementation. The hash-table on the GPU outper-
forms the CPU implementation of the same structure in the all key ranges with the maxi-
mum speedup at 11.3. GPU realization of the lock-free skip-list is beneficial for small and
medium key ranges with the maximum speedup at 30.7. For the lock-free priority queue,
GPU benchmarks have the same pattern as for the skip-list with the maximum speedup
of 30.8. They close discussion by comparing performance of the GPU implementations of
hash-table and linked list. The hash-table is 36 to 538 times better then the linked-list. They
conclude that the GPU helps the hash-table to reveal its concurrent potential making it the
best data structure for arbitrary key ranges.
To sum up, in this subsection, it is considered problems of building data structures on
the GPU. It has been mentioned that the question of sparse data structures construction is
challenging task. However, many developers and researchers have already contributed to
this area. A lock - free data structures is of a particular interest since they offer efficient
GPU implementation. The hash table proved to be the best data structure of the afore-
mentioned due to its constant performance benefits and design well-suited for usage in
multithreaded GPU applications.
2.3. GPU programming model and memory types
In this section, it is given an overview of CUDA C programming model [37] and consid-
ered different types of GPU memory.
2.3.1. GPU programming model
Kernels
CUDA C allows a programmer to write C functions (kernels) which during invocation are
executed in parallel by N different CUDA threads. A kernel is defined using the global
identifier. The quantity of CUDA threads which are going to execute the kernel for the
given call is defined using a new syntax <<<...>>>. Each CUDA thread is given a unique
thread id which is accessible in the kernel body using the built-in threadIdx variable.
Thread hierarchy
For convenience, threadIdx is a three component vector so that every thread can have a 1-
dimensional, 2-dimensional or 3-dimensional index to form a 1-dimensional, 2-dimensional
17
2. Literature review
or 3-dimensional block.
The thread id and it index inside the block is put in one-to-one correspondence using
the following equations:
for the 1D block: thread ID = x where (x) is the thread index
for the 2D block of size (Dx, Dy): thread ID = (x + yDx) where (x, y) is the thread index
for the 3D block of size (Dx, Dy, Dz): thread ID =(x + yDx + zDxDy)
There is a limit in the number of threads combined into one block since all the threads
should be processed by one processor core thus sharing the same memory. Presently, mod-
ern GPU allows to have blocks with a maximum of 1024 [37] threads per block.
Nonetheless, a kernel can be executed by a multiple blocks so the total number of
threads executing the kernal equals to the number of blocks multiplied by the number
of threads in the block. The blocks are organized in one-dimensional, two-dimensional,
three-dimensional grids as illustrated by figure 2.6.
Figure 2.6.: Grid of thread blocks [37]
18
2.3. GPU programming model and memory types
The size of the grid is defined by the data being processed of by the number of processors
in the system.
The number of threads per block and the number of blocks in grid are defined by the
syntax <<<...>>> and can be of int or dim3 types.
Each block in the grid is identified by one-dimensional, two-dimensional and three-
dimensional index which can be accessed from within the kernel using the global variable
blockIdx. The dimension of the block is accessed via the global blockDim variable.
The thread blocks has to be executed independently from each other. It must be possible
to execute them in any order, in parallel or in series. This independence requirement allows
to execute blocks independently for any number of processors.
Threads in one block can cooperate using shared variables and synchronization of mem-
ory accesses. Namely, opportunities exist to define synchronization points in the kernel
body by calling intrinsic function syncthreads(). The function acts as barrier at which the
block threads wait till any is allowed to proceed.
Memory Hierarchy
CUDA threads can have a different access to data using multiple memory spaces during
their execution as shown on figure 2.7.
Every thread has its own private memory. All threads within one block share the same
memory which has the same lifetime as the block. All threads can access the same global
memory.
There exist two additional read-only memory areas accessible for all threads: the con-
stant and texture memory spaces. The global, constant and texture memories are opti-
mized for different memory usages. The texture memory offers a variety of addressing
models as well as data filtering for some data formats.
Heterogeneous Programming
As illustrated by figure 2.8 the CUDA programming model assumes that kernels are exe-
cuted on a separate physical device which acts as a coprocessor to the host running the C
program. The CUDA programming model also expect that the host and the device main-
tain a separate memory spaces in DDRAM. Thereby the program controls the global, con-
stant and texture memory areas through calls to CUDA runtime. This include allocation
and deallocation of device memory and transfer of data between the host and the device.
19
2. Literature review
Figure 2.7.: Memory hierarchy [37]
Serial code is executed on the host while parallel code is executed on the device.
Compute capability
Compute capability is defined by major and minor revision numbers. Devices that share
the same major revision number are of the same core architecture. The minor revisions
represent incremental improvement of the core architecture, possibly including new fea-
tures.
2.3.2. GPU memory types
In this subsection, it is given an overview of different GPU memory types.
20
2.3. GPU programming model and memory types
Figure 2.8.: Heterogeneous programming [37]
Device memory accesses
An istruction which accesses a memory address could be performed multiple times de-
pending on the distribution of memory addresses across the threads within one warp.
How the distribution influences performance is peculiar to each memory type and de-
scribed in the following subsections. For instance, for the global memory, the rule of thumb
is that the more scattered addresses are the less performance is.
Global Memory
Global memory exists in device memory which is accessed using 32-, 64- and 128-byte [37]
memory transactions. These transactions should be naturally lined up: only 32-, 64- and
128-byte segments of device memory that are lined up to their size (i.e. the first address of
a segment is a multiple of its size) can be read or written by these transactions.
When a warp executes an istruction that accesses the global memory, it joins all memory
addresses for all threads within one warp into one or more memory transations depending
21
2. Literature review
on the size of word accessed by each thread and distribution of the memory accesses across
the threads. The more transactions are necessary the more unused words are transferred
in addition to those words which are actually accessed by the threads reducing instruction
throughput.
How much transations are necessary and what the throughput of the device is fully
depends on the computing capability of the device. For devices of compute capability 1.0
and 1.1 [37] requirements to get any coalescense are very high. They are more relaxed for
devices with higher compute capability. For devices of compute capability 2.x and higher
[37] the memory transactions are cached so that data localization is used to reduce impact
on the throughput.
In order to maximize the throughput of the global memory it is necessary to maximize
the coalescing by:
• Following the most optimal patterns for devices with computing capabilities of 1.x,
2.x and 3.x [37]
• Utilizing data types which comply with requirements of data size and alignment
• Padding of data in some cases when, for example, accessing two-dimensional arrays
Size and Alignment Requirement
Instruction in global memory support writing and reading of words with size of 1,2,4,8 or
16 bytes [37]. Any access to data in the global memory compiled to a single istruction in
global memory if and only if the data size does not exceed these numbers and the data is
naturally aligned. If the requirements are violated then muliple instructions with different
access patterns are which hinder data coalescing.
The alignment is automatically done for built-in data types like char, short, int, long,
longlong, float, double like float2 or float4 [37].
For structures, the size and alignment requirements can be fullfilled using special speci-
ficators like align (8) or align (16).
Any variable which is located in the global memory is returned by a driver routine for
memory allocation or by runtime API aligned to at least with 256 bytes.
Reading of not naturally aligned 8-byte or 16-byte words [37] could lead to incorrect
results, a special attention should be paid to maintain alignment of a value or an array of
values of these types. This is a typical error which occurs when memory allocation for
multiple arrays using common calls to function cudaMalloc or cuMemAlloc are replaced by
allocation of a single large block in memory partitioned into multiple arrays. In this case
the starting addresses of the arrays are shifted with regard to the initial address of the
block.
22
2.3. GPU programming model and memory types
Two-Dimensional Arrays
A common access pattern is when a thread with index tx, ty tries to access an element in
2D array of width width using the following mapping BaseAddress + width ∗ ty + tx. In
order these accesses are fully coalesced, the width of the thread block as well as the width
of the array should be be a multiple of the warp size.
Local Memory
Only automatic variables could be placed to the local memory. The automatic variables
are:
• Arrays for which cannot be detemined that they are indexed with constant values
• Large structures of arrays which otherwise consume too much register memory
• Any variable if kernel uses more registers then available
The local memory is located in device memory and as a consequence has a high latency
and a low bandwidth as the global memory and is a subject to the same requirements for
memory coupling. However the local memory is organized in that way that consecutive
32-bit words accesses are performed by threads with consecutive IDs. Therefore the ac-
cesses are coalesced as long as all threads in one warp access the same relative address.
For devices with compute capability 2.x and higher [37], all accesses to the local memory
are cached in the same way as accesses to the global memory.
Shared Memory
Shared memory has much lower latency and much higher instruction throughput then lo-
cal or global memory because it is placed directly on the chip. To achieve high bandwidth,
the shared memory is divided into equally-sized memory modules called banks which can
be accessed simultaniously. Thus processes of reading and writing to memory which refer
to locations seating in n memory banks can served simultaneously resulting in n times
overall bandwidth increase.
If two threads access addresses in the same bank then serialization is necessary. These
requests are divided into as many conflict-free queries as necessary. If thus n queries oc-
curs, the initial memory request is said to cause n-way bank conflicts. In order to maximize
performance it is necessary to minimize bank conflicts. This is specific to different device
types because of mechanisms of mapping memory addresses to memory banks.
Constant Memory
Constant memory lies in device memory and cached in constant cache for devices with
compute capabilities of 1.x and 2.x [37]. For devices with compute capability 1.x [37] a
request to constant memory for a warp is divided into two requests for an every half-
warp which then served independently. These requests are further divided to subrequests
depending on the number of memory addresses contained in the initial query. The overall
23
2. Literature review
throughput is reduced by the number of subrequests. These subrequests are served at the
cache bandwidth in case of cache hit or at device bandwidth otherwise.
Texture and Surface Memory
Texture and surface memory reside in device memory and are cached in texture cache,
thus the cost of texture and surface memory access equals to cost of reading from the
device memory in case the data is not in cache, otherwise it costs reading from the texture
cache. The texture cache is optimized for 2D spatial localization, threads of the same warp
located in 2D space near each other achieve the best performance. Therefore the cache is
designed for streaming ingress with constant latency. Thus a number of cache hits reduces
the DDRAM bandwidth demand but not the fetch latency.
Reading of device memory using texture or surface memory has a number of benefits
which makes it advantageous alternative comparing to reading the device memory from
global or constant memory:
• If accesses to global or constant memory are carried out with violation of perfor-
mance rules, a higher bandwidth can be achieved providing that there is a localiza-
tion in texture fetches or surface reads.
• Operations of addresses calculation are performed outside of the kernel using special
units
• Packed data can be transferred to separate variables using single operation
• 8 or 16-bit integers can be cast to 32 floating-point values in the range of [0.0,1.0] or
[-1.0, 1.0] [37]
2.4. Ray Reordering
In this subsection, some ray reordering techniques are considered using articles by Garanzha
and Loop [14] and Moon et al. [35].
The goal here is to find suitable for the task ray reordering methods.
Garanzha and Loop [14] use ray sorting to boost efficiency of ray tracing by revealing
of coherence between rays and reducing a number of execution branches within SIMD
processor. For the ray sorting they propose a method based on compression of key-index
pair. Then the compressed data is sorted and decompressed.
The sequence of key-index pairs is generated by using a ray id as the index and a hash
value for the given ray as the key. Coordinates of ray sources are quantized using virtual
uniform 3D grid within the volume bounding the scene. Ray directions are also quan-
tized using virtual uniform grid. Using these grids, authors calculate ray ids which then
merged into a 32-bit hash value. Rays which have the same hash value are considered to
24
2.4. Ray Reordering
be coherent in space. Then compressed data is sorted using radix sort. After the data is de-
compressed, packet ranges are extraced uing the same compression procedure. Once the
ranges are extracted, the next step is to create a frustum for each packet. The frustums are
traversed using the breadth-first algorithm. Next, the active frustums are decomposed into
chunks of 32 rays max analogously to CUDA warp. This eliminates execution branches
within a CUDA warp. Primary rays are indexed and sorted according to a screen-space
Z-curve. Binary BVH is build on CPU using binning algorithm. The algorithm is com-
pared with Depth-first algorithm for the ray tracing. They get significant performance
improvements for soft shadow rays at 1024 × 768 × 16 samples. Comparing to CPU, GPU
implementation is 4× faster. However, authors assume that memory consumption could
be sufficient. Also the bad case for the algorithm if one frustum captures all of the BVH
leaves which could cause very unbalanced workload.
Moon et al. [35] implemented the ray tracing with cache-oblivious ray reordering. For
the ray sorting, authors introduce a Hit Point Heuristic. A hit point is computed as the first
intersection point between the scene and a line starting from the ray origin pointing in the
ray direction. After this, points are reordered using a space-filling curve (Z-curve). Dur-
ing implementation stage, Hilbert curves were also considered but they gave only slight
performance benefits (e.g. 2%) while having much more complex implementation. The
ray tracer is implemented on the CPU. The method is tested for path tracing as well as
for photon mapping. For the path tracing, their method achieves a significant 16.83 times
performance improvement compared to without reodering. For the photon mapping, the
method in different configurations gives from 3.77 to 12.28 times performance improve-
ment. Rays reordering is also the cause of a higher cache utilization. Also ray reodering
based on Hit Point Heuristic shows better performance then ray reodering based on ori-
gin + direction reordering. However, authors mention that there is no guarantee that the
method will improve performance of the ray tracing because of the overhead.
Altogether, the goal is accomplished, it have been considered methods for ray reordering
using different techniques. Hash values for the rays could be generated using ray origin
and direction quantization as well as based on quantization of spatial information like hit
points coordinates. Rays according to their hash values could be sorted in different ways.
For example, using radix sort or space-filling curves. Also, in considered articles authors
report about significant performance improvements for tracing with ray sorting.
25
2. Literature review
26
3. Problem and Solution
3.1. An experiment with dimensionality of context launches
The first problem that is considered is the problem of how dimensionality of a computa-
tional problem affects efficiency of ray tracing.
3.1.1. Problem Description
The problem described in Redmine ticket #155 ”Experiment with dimensionality of context
launches“. The ticket has the following content:
”Chapter 9.Performance Guidelines” of OptiX Programming Guide [38] states that the
maximum coherence between threads of a tile is achieved by choosing an appropri-
ate dimensionality for the launch. For example, common problems with 2D images
has 2D complexity. Thus the problem is reduced to the determination of the launch
dimension and investigation of how this affects efficiency.
3.1.2. Theory
To describe the solution to the problem it is necessary to start from definitions for space-
filling and Hilbert curves.
Space-filling Curve
The space-filling curve is defined in the following way [3]:
Given a mapping f : I → Rn, then the corresponding curve f∗(I) is called a space-filling
curve, if the Jordan content of f∗(I) is larger then 0.
Hilbert Curve
The Hilbert curve is defined as [3]:
• each parameter t ∈ I = [0, 1] is contained in a sequence of intervals
I ⊃ [a1, b1] ⊃ ... ⊃ [an, bn] ⊃ ...
where each interval results from a division-by-four of the previous interval
• each such sequence of intervals can be uniquely mapped to a corresponding se-
quence of 2D intervals (subsquares)
• the 2D sequence of intervals converges to a unique point q in q ∈ Q = [0, 1] × [0, 1] -
q is defined as h(t)
f : I → Q defines a space-filling curve, the Hilbert curve.
27
3. Problem and Solution
Grammar for 2D Hilbert Curve
Grammar for 2D Hilbert curve can be constructed in the following way [3]:
• No-terminal symbols: H, A, B, C, start symbol H
• terminal characters: ↑, ↓, ←, →
• productions:
H ← A ↑ H → H ↓ B
A ← H → H ↑ H ← B
B ← C ← H ↓ H → B
C ← B ↓ H ← H ↑ B
• replacement rule: in any word, all non-terminals have to be replaced at the same
time → L-System (Lindenmayer)
Arrows describe the iterations of the Hilbert curve in ”turtle graphics“[43]. Figure shows
a sample 2D Hilbert curve generated using the grammar.
Figure 3.1.: An example of 2D Hilbert curve
Grammar for 3D Hilbert Curve
L-Systems in three dimensions could be described using ”turtle graphics” [43]. The ba-
sic idea is to represent the turtle orientation in 3D space using a set of vectors [ ˆH, ˆL, ˆU]
representing the turtle’s heading, left direction and upward direction accordingly. Vectors
[ ˆH, ˆL, ˆU] form an orthonormal basis. Spatial rotations of the turtle can be described:
[ ˆH , ˆL , ˆU ] = [ ˆH, ˆL, ˆU]R (3.1)
where R is a 3 × 3 rotation matrix. Rotations by angle α around vectors ¯U,¯L and ¯H are
represented by the following matrices:
28
3.1. An experiment with dimensionality of context launches
Figure 3.2.: Controlling turtle in 3D [43]
RU (α) =



cos(α) sin(α) 0
−sin(α) cos(α) 0
0 0 1


 RL(α) =



cos(α) 0 −sin(α)
0 1 0
sin(α) 0 cos(α)


 RH(α) =



0 0 1
0 cos(α) −sin(α)
0 sin(α) cos(α)



(3.2)
The following symbols determine turtle’s orientation in space:
+ Turn left by angle δ, using rotation matrix RU (δ)
- Turn left by angle δ, using rotation matrix RU (−δ)
& Pitch down by angle δ, using rotation matrix RL(δ)
∧ Pitch up by angle δ, using rotation matrix RL(−δ)
 Roll left by angle δ, using rotation matrix RH(δ)
/ Roll right by angle δ, using rotation matrix RH(δ)
— Turn around, using rotation matrix RU (180◦)
An interested reader could find a grammar for 3D Hilbert curve in appendix A.3.
Z Curve
Z-curves are defined in terms of Morton codes [24]. In order to calculate Morton codes it is
necessary to consider a binary representation of point coordinates in 3D space, as shown
by figure 3.3. Firstly, for each coordinate, the binary code is expanded by insertion of two
additional ”gaps“ after each bit. Secondly, the binary codes of all coordinates are joined
(interleaved) to form one binary number. If thus resulting codes are sorted in ascending
order, this will determine the sequence of z-curve in 3D space (the left part of figure 3.3).
The sorting could be performed using radix sort.The expansion and interleaving of bits
could be done efficiently by utilizing the arcane bit-swizzling properties of integer multi-
plication. A curious reader will find a listing in appendix A.1.
29
3. Problem and Solution
Figure 3.3.: Generation of Morton codes [24]
3.1.3. Problem Solution
The problem solution could be sketched in the following way:
1. Sort ray buffer according to spatial ray coordinates
2. Initialize the sorted buffer with the context depending on the dimensionality
For 1D, –
For 2D, map 1D indeces to 2D array structure using 2D Hilbert curve
For 3D, map 1D indeced to 3D array structure using 3D Hilbert curve
Next will be discussed in more details points of this sketch.
Sorting
Histogram and Hilbert Curve The first approach is to generate indices for every ray in
3D and sort them according to Hilbert curve.
Ray indices Rays generated for the wavetracer represent uniformely distributed
points on a unit sphere. Their coordinates could be quantized. The quanti-
zation is the same as the redistribution of rays on a three-dimensional spatial
data structure (a histogram) according to their directions. For each element in-
side the ray buffer could be generated indices depending on the number of bins
in the histogram. Using these indices the ray could be added to a bin of the
histogram. A pseudocode for the algorithm is shown by Algorithm 1. A geo-
metrical interpretation of the algorithm is shown by figure 3.4.
Hilbert curve Once, the data structure is obtained, it could be sorted using a 3D
Hilbert curve or, which is the same, mapped from 3D to 1D data structure. The
sorted buffer could be used directly for the context initialization. The Hilbert
curve generated for 16 × 16 × 16 bins is shown by figure 3.5.
Morton codes and Radix Sort The second approach is a logical continuation of the pre-
vious. Morton codes sort rays according to their spatial neighborhood in z-order.
Using morton codes, rays could be sorted with radix sort.
30
3.2. Application of frame coherence
Algorithm 1 Algorithm for histogram generation
for all element in Buffer do
x0 ⇐ (element.x + 1)/2)
y0 ⇐ (element.y + 1)/2)
z0 ⇐ (element.z + 1)/2)
x ⇐ floor(x0 ∗ bin num/2)
y ⇐ floor(y0 ∗ bin num/2)
z ⇐ floor(z0 ∗ bin num/2)
list ⇐ histogram[x][y][z]
v.x ⇐ x0
v.y ⇐ y0
v.z ⇐ z0
list.pushBack(v)
histogram[x][y][z] ⇐ list
end for
Morton codes Efficient implementation for generation Morton codes was shown
in the previous subsections. The figure 3.6 shows a pattern generated by the
algorithm when sorting rays.
Radix Sort CUDA already has an efficient implementation for this algorithm.
Transformation between Spacial Structures
To experiment with dimensionality, it is necessary to transform spatial structures from 1D
to 2D or 3D data structures. This also could be achieved using Hilbert curves.
Mapping from 1D to 2D The sorted buffer could be mapped from 1D to 2D using 2D
Hilbert curve.
Mapping from 1D to 3D The mapping between 1D and 3D could be achieved using 3D
Hilbert curve.
Implementation details will be discussed in the appropriate chapter.
3.2. Application of frame coherence
The second problem is an investigation of influence of frame coherence on the wavetracer
performance.
3.2.1. Problem Description
The task is formulated in Redmine ticket # 218 ”Exploit frame coherence”. The task has
the following objectives:
1. Find types of coherence which exist in the system
a) Measure coherence
31
3. Problem and Solution
Figure 3.4.: Uniformely distributed rays & histogram bins
2. Find schemas(algorithms) which allow to exploit them
3. Propose efficient implementation for the algorithms
4. Implement the proposed solution
5. Measure performance
6. What are the costs for coherence utilization in the system?
In the following subsections, the first two points will be considered.
3.2.2. Coherence
According to ”A Dictionary of Statistics” [51], coherence is a “term used to describe the
resemblance between the fluctuations displayed by two time series; an analogue of corre-
lation”.
Innerframe Coherence
In the context of the given work, the innerframe coherence means that there is a correlation
between results of ray tracing inside one frame. Rays with high coherence could be com-
32
3.2. Application of frame coherence
Figure 3.5.: Hilbert curve generated for 16 × 16 × 16 bins
bined into groups. Inside these groups it necessary to trace only one ray [42]. Questions
which could be posed here:
1. How to know what rays have to be coalesced into one group?
2. How results of the tracing are to be stored in the cache?
3. How to calculate the error which introduces this approach?
Intraframe Coherence
The intraframe coherence means that there is a correlation between results of ray tracing
procedure for different frames. A result of the tracing procedure for any ray could be
stored in cache and used in the next frames. Questions which could be posed here:
1. How to measure coherence between frames?
2. How to know what data could be reused in the next frames?
3. What caching strategy to choose to purge the cache?
Coherence Measurement
Both innerframe and intraframe coherence could be measured using Pearson product cor-
relation coefficient [32].
rxy =
N
i=1(xi − ¯x)(yi − ¯y)
»
n
i=1(xi − ¯x)2 n
i=1(yi − ¯y)2
(3.3)
Where x and y are two random variables with N observations.
For the innerframe coherence could be used autocorrelation function [32].
rk =
N−k
i=1 (xi − ¯x)(xi+k − ¯x)
n
i=1(xi − ¯x)2
(3.4)
33
3. Problem and Solution
Figure 3.6.: Z-curve
The quantity rk is called the autocorrelation coefficient at lag k. Calculation of these values
will be covered in more detail in the chapter dedicated to testing.
3.2.3. Formulation of Caching Scheme
In the Literature Review it has been done already a survey of caching schemas. Using
ideas stated by Chan [7] it is possible to build a caching system for the given task.
Cache Tree
In Chan’s [7] work, each object consists of several surfaces, each surface is divided into
several levels of patches and every patch is further quantized depending on angular values
of incident rays.
Objects In this work, there is only one type of objects: a model instance. Models are
distinguished using their ids which are defined in the configuration file. An environment
is also loaded as a model instance which usually has -1 as id. So roots of cache trees could
be featured using these identificators.
Patches There is a natural division of such objects into patches which are called primitive
indices. These primitive indices introduce patches of native precision for the objects. So
results cached for the given object could be discriminated using these indices.
Divisions Patches or primitive indices are responsible for spatial accuracy. However,
rays have to be distinguished also by angular precision. Angular values of incident rays
represent spatial coordinates within a sphere of unit length. These coordinates are also
quantized using some big number introducing quantization precision. Quantized coordi-
nates introduce divisions which further dinstinguish incident rays.
34
3.2. Application of frame coherence
Division Index Thus results of the tracing procedure are stored in cache using a multi-
component division index. The index consist of the model instance id, primitive index id
and of three quantized angular coordinates. The concept is shown by figure 3.7.
Figure 3.7.: Cache Tree
Cache Construction
Initially the cache is empty. In general, a trace path is represented in the system as a se-
quence of points. Data associated with results of tracing for any ray could be of different
types, for example, miss, reflection, diffraction, receiver hit, emitter launch. Per ray data
carries all the information neccessary for any of these types. The most important informa-
tion is origin of a point where intersection or some other tracing event occurs, a new ray
direction (i.e. direction of an outgoing ray), instance id (i.e. identificator of a model where
a tracing event occurs) and a primitive index. When a ray is constructed, results of the
previous tracing are used. Thus it can be observed that between results of the previous
tracing and results of a current tracing there exists a one-to-one association. Consequently,
results of the previous tracing can be used as a key whereas results of the current tracing
can be seen as data. In case when there are no previous results, i.e when a ray is taken
from the buffer, for the first key both instance id and primitive index id are set to zero and
divisions are obtained by quantization of the initial ray direction. The cache construction
is illustrated by figure 3.8.
All subsequent rays request cache using multicomponent index. If such enty exists then
the result stored for the entry is taken for the ray. Otherwise the ray is traced and the result
is saved to cache using the multicomponent index. An algorithm for cache construction is
shown by Algorithm 2. An attention should be paid to that if query returns false then the
entry is overwritten with a new value. For convenience of retrieving, cache entries which
belong to the same path could be linked in a list.
35
3. Problem and Solution
Figure 3.8.: Cache Construction
Cache Purging
To purge the cache it is necessary to know what entries are not valid any more. There are
three cases of changing of a steady-state ray configurtion. They are emitter movement,
receiver movement and (reflection) object movement.
Emitter Movement In this paragraph, it is analysed how to purge the cache in case of
emitter movement. There two possible solutions: Position Purging and Precision Purging.
Position Purging We could define for what range of variation of emitter positions change
will not produce new tracing results. In order to do this, emitter positions are quantized
which introduces further divisions for a prmitive index (patch). Thus a cache entry is
characterized by three additional coefficients which represent an emitter position for the
ray. These coefficients are stored as an additional key for the entry (position stamp). Any
ray has its own position stamp. Thus the cache is queried with a multicomponent key, if the
entry has an outdated position stamp it should be purged. An algorithm for cache query
is shown by Algorithm 3. This position stamp obtained with a quantization of emitter
36
3.2. Application of frame coherence
Algorithm 2 Algorithm for cache construction
for all ray in rayStack do
mcid ⇐ generate multicomponent index(ray)
pk ⇐ generate position key(emitter position)
query ⇐ cache.contain(pk, mcid)
if query then
result ⇐ cache.get entry(mcid)
else
result ⇐ trace(ray)
save to cache(pk, mcid, result)
end if
end for
Algorithm 3 Algorithm for cache query
Require: multicomponet index ∨ position stamp
if cache.contain(multicomponet index) then
node ⇐ cache.get(multicomponet index)
if node.position stamp equals position stamp then
return TRUE
end if
end if
return FALSE
position introduces a degree of flexibility which allows to use cache data between frames
if variation of emitter position is within a certain precision range.
Precision Purging Precision purging is characterized by the fact that changing of the
emitter position hash does not necessarily mean that the cache has to be purged. Instead,
it is calculated an estimation of how far parameters of the requested element are from that
is currently at this address in the cache. This is called residual value. If this value exceeds
a threshold then the cache is purged at the requested address. Implemenation will be
described in the implementation chapter.
Receiver Movement Clearing the cache when moving the receiver can be performed as
follows. As the trace path is represented in the system by a sequence of points, then there
are three cases:
1. intersection point is on the environment
2. intersection point is on other moving object
3. intersection point is on receiver
In case the intersection point is on the environment, it cannot be claimed that associated
cache entry is not valid since the ray path has not reached a receiver yet. So this entry is
valid since it is not associated with the receiver which changed its position. The second
37
3. Problem and Solution
case is the same as the first one. The third case can be checked by verifying a hit point po-
sition against receiver positions which present on the scene. The hit point position should
be within a radius of emitter anttena which in this case is 1. So the entry is invalidated, if
the test is unsuccessful. The concept is illustrated by figure 3.9. A pseudocode for this test
Figure 3.9.: Receiver movement
is represented by algorithm 4.
Object Movement This is the most complex case. It happens when a ray reflects from
other moving object (not environment), for example, other car. One possible approach
for the problem solution is to create a more complex position key. It means that the key
will reflect current positions of all moving objects on the scene, a sort of map. Once any
object in the map makes a movement, a new key generated. It will differ from previous, if
the movement exceeds a quantization precision. However, in the current implementation a
number of such cases is neglectable since most of the virtual drive recording have the same
speed for all cars. It means that they are moving with constant speed without overtaking
each other.
38
3.2. Application of frame coherence
Algorithm 4 Algorithm to check intersection with receiver
Require: node ∨ radius
data ← node.data
for all antenna in antennaBuffer do
if data.type equals RECEIV ER HIT then
recPos ← antenna.position
pos ← data.origin
dx ← abs(recPos.x − pos.x)
dy ← abs(recPos.y − pos.y)
dz ← abs(recPos.z − pos.z)
dr ← sqrt(dx2 + dy2 + dz2)
if dx ≤ radius then
return TRUE
end if
end if
return FALSE
end for
39
3. Problem and Solution
40
Part III.
Analysis and Implementation
41
4. Analysis and Modelling
4.1. Ray Reordering
4.1.1. Code Analysis
The best place to put ray sorting is where the ray buffer is created in class RandomEmitterBuffer
in module antennageometry.cpp.
Existing Code
In the constructor of RandomEmitterBuffer, the ray buffer is created and formatted us-
ing OptiX context. Constructor calls resize method. The method resize sets the size of
buffer, fills the buffer elements calculating (uniformly) distributed spherical coordinates
and calls rtBufferMarkDirty which notifies OptiX that the content of buffer is changed.
Modification
For convenience, it is possible to create variable reorder which indicates that the buffer is
to be sorted or not. If the variable is set to true then sort method is called. Listing 4.1 show
modification of method resize.
Listing 4.1: Resize with calling sort method
void RandomEmitterBuffer : : resize ( const RTsize size )
{
buffer−>setSize1D ( size ) ;
f i l l ( ) ;
i f ( reorder )
sort ( ) ;
buffer−>markDirty ( ) ;
}
4.2. Frame Coherence
4.2.1. Code Analysis
Schematically the main tracing loop is represented by figure 4.1. It consists of three main
stages:
1. Analysis of the stack top element
43
4. Analysis and Modelling
2. Ray tracing
3. Storing the results on the top of the stack
Figure 4.1.: The main tracing loop
Analysis of the stack top element
This part of the algorithm is the most complex with a lot of branches. Its purposes are:
1. Choice of direction for the top element
2. Write results of the tracing to the WayPointBuffer depending on configuration
parameters
3. Unwind stack
An interested reader will find a chart in appendix B.
44
4.2. Frame Coherence
Ray tracing
On this stage, data is taken from the stack, a ray is constructed and traced. This step
fits naturally for the cache implementation. An extended flowchart is shown by figure
4.2. Here, it is introduced variable cacheEnabled for convenience of turning off/on the
Figure 4.2.: Ray tracing with cache
cache. If cache is enabled, the cache is queried if it contains an entry with a key defined
by prev data. In case the cache contains the key, the data is directly set in the function call.
Otherwise the ray is traced and results are saved to cache. The last state is setting the value
of data to prev data that is to use it on the next iteration.
Data saving
The last stage consists of pushing data to stack and incrementing the stack counter. The
stage is shown by figure 4.3.
4.2.2. Selection of Data Structure
The following questions arise when selecting a data structure for the cache construction:
1. How to implement tree described in the previous chapters for the cache construc-
tion?
45
4. Analysis and Modelling
Figure 4.3.: Saving data
2. If such trees can be efficiently constructed, how to maintain them?
3. How to implement a fast data search in such trees?
In our case, the cache trees cannot be implemented directly as they described by Chan [7],
firstly, because it requires a dynamic allocation of memory which is disabled in OptiX. Sec-
ondly, a search of elements in such data structures is challenging. In order to make it more
convenient, it is necessary to maintain additional data structures performing indexation.
Assuming all these, construction of such trees will be difficult.
Binary Tree
The first idea that comes is to use a binary tree. It has an easy construction and main-
tainance. Complexity of search in binary tree is O(log(n)). An object hierarchy could be
maintained using hashing of multicomponet indices. On the other hand, hash values could
be used to determine a total ordering of elements of the tree. Selection of an appropriate
hash function is a separate issue which will be regarded later. The problem with dynamic
allocation could be solved using a buffer with predefined elements. The tree is illustrated
using the figure 4.4.
However, the binary tree has one major defect: all elements of the tree have to be added
using the root element. The root element has to contain a counter which indicates the next
element in the buffer of predefined elements. In case of massively parallel computations,
GPU streams have to access the counter successively in order to provide data consistency.
This counter introduces the main bottleneck of the binary tree.
Chained Hash Table
A natural solution to the problem described in the previous paragraph is to constuct trees
in parallel. This resolves collisions of streams trying to access the counter and reduces the
waiting time in the queue. The streams are distributed by various root elements depending
on the hash value of added element. Such data structure is called a chained hash table [29].
It is more preferred then the binary tree, however, performance analysis of the chained
table shows that access to the root elements (buckets) and to chained elements of the table
46
4.2. Frame Coherence
Figure 4.4.: Binary Tree
have different times. This difference overall reduces the table performance making it not
very profitable. Also since the element buffer has a fixed size, it is necessary to reserve a
certain number of elements for each bucket. This creates a fragmentation of the element
buffer. The data structure is illustrated by figure 4.5.
Implementation of the chained hash table is described in Implementation of Data Struc-
ture 5.2.2. Performance analysis is described in the Testing chapter 6.
Open-Addressed Hash Table
Further improvement of the data structure is permission of the direct access to elements of
the table. This solution has several benfits:
1. Further reduction of stream collisions
2. Absence of buffer fragmentation
3. A fast access to elements
Direct mapping allows to further reduce collisions of GPU streams. Absence of trees con-
struction solves the probem of buffer fragmentation. Direct access to buffer elements pro-
vides a possibility of fast read/write cache operations implementation. Nevertheless, there
exists an implementation pitfall connected with OptiX buffer which will be described in
the implementation subsection.
Implementation of the open-addressed hash table [29] is described in Implementation of
Data Structure subsection 5.2.2. Performance analysis is described in the Testing chapter
6.
47
4. Analysis and Modelling
Figure 4.5.: Chained Hash Table
4.2.3. Selection of Data Model
According to theoretical justification described in chapter Problem and Solution 3 and the
selected data structure, in this subsection, the data model is described.
Cache Key
CacheKey is a structure to hold ray multicomponent key parameters according to defini-
tion in 3.2.3. A cache key contains the following members:
primitiveIndex - identificator of mesh triangle
instanceId - identificator of loaded model
div x - x division of direction angle
div y - y division of direction angle
div z - z division of direction angle
hash - hash value of the key
calc hash() - function which calculate hash value using the members of the key
Position Key
A position key contains the following members:
pos x - x division of emitter position
48
4.2. Frame Coherence
pos y - y division of emitter position
pos z - z division of emitter position
phash - hash value of the key
pos hash() - function which calculate hash value using the members of the key
Cache Node
A cache node or element of the hash table contains the following members:
ckey - cache key
pkey - position key
data - per ray load data
timestamp - timestamp of creation
used - usage marker
hit - number of hits
These are the basic elements of the node. The node member list will be extended depend-
ing on the task.
Per Ray Data
To get the overall view of the data model, it is necessary to give a description of PerRay-
Data. A struct PerRayData has the following members:
diffractionCnt - parameter from the main loop to the hit programs
type - The type of the waypoint. This value is set by the hit program.
nextOrigin - origin of the next launch
nextDirection - direction of the next launch
receiverDistance, diffEdge, diffStepSize, diffSteps - all these are used in the closest
hit program of the receiver
primitiveIndex - the same of the cache key
instanceId - the same of the cache key
normal - the normal of the triangle on which the ray was reflected
emitterSlot - the slot of the antenna which emitted this ray
emitterModelInstanceId - the same as modelInstanceId in the cache key
49
4. Analysis and Modelling
Data Model
Thus the overall data model is shown by figure 4.6. In reality, CacheNode does not contain
CacheKey and PositionKey, it only contains their hash values. This is done to minimize a
memory size of CacheNode and consequently of the hash table on the GPU.
Figure 4.6.: Data Model
4.2.4. Selection of Hash Function
Selection Criteria
Usually, when a hash function is selected, the following criteria are used:
1. The hash function should spread elements accross the table in a random and uniform
manner.
2. Collision of hash values for different elements should be minimal.
50
4.2. Frame Coherence
The first condition is necessary in order to distribute elements accross all buckets in uni-
form manner so that all the buckets contain approximately equal number of elements. The
second condition ensures that any element is identified in unique way.
However, based on the task solution approach, it is necessary to add another condition,
namely that keys with similar parameters should have close hash values. The question is
whether the conditions one and two are compatible. This should not be a problem because
rays are already uniformely distributed on a unit sphere.
Hash Function with Uniform Distribution
The key represents an array of integers. It is necessary to generate a hash value based on
the array. The code for such function could be taken, for example, from Morin [36]. Listing
4.2 shows the code.
Listing 4.2: Hash function for integer array
unsigned hashCode ( ) {
long p = (1L<<32)−5; / / prime : 2ˆ32 − 5
long z = 0x64b6055aL ; / / 32 b i t s from random . org
int z2 = 0x5067d19d ; / / random odd 32 b i t number
long s = 0;
long zi = 1;
for ( int i = 0; i < x . length ; i ++) {
/ / reduce to 31 b i t s
long long xi = ( ods : : hashCode ( x [ i ] ) * z2 ) >> 1;
s = ( s + zi * xi ) % p ;
zi = ( zi * z ) % p ;
}
s = ( s + zi * (p−1)) % p ;
return ( int ) s ;
}
In this listing, x is an array of integers. Integers are hashed using a multiplicative hash
function with d = 31 to reduce a hash code to 31 bits representation. This is done in order
additions and multiplications can be carried out using a 63-bit arithmetic. Probability for
two sequences to contain have the same hash code is defined as [36]
2
231
+
r
(232 − 5)
(4.1)
Hash Function Preseving Data Locality
Space filling curves again could be used to generate codes for multicomponent keys. The
key represents an integer array with 5 componets: instanceId, primitiveIndex, ,div x,
div y, div z. Listing 4.3 shows how the Morton Codes generator could be altered for 5D
[20].
51
4. Analysis and Modelling
Listing 4.3: Morton Codes generator for 5D
unsigned int SeparateBy4 ( unsigned int x ) {
x &= 0 x0000007f ;
x = ( x ˆ ( x << 16)) & 0x0070000F ;
x = ( x ˆ ( x << 8 ) ) & 0x40300C03 ;
x = ( x ˆ ( x << 4 ) ) & 0x42108421 ;
return x ;
}
MortonCode MortonCode5 ( unsigned int x ,
unsigned int y ,
unsigned int z ,
unsigned int u ,
unsigned int v ) {
return SeparateBy4 ( x ) |
( SeparateBy4 ( y ) << 1) |
( SeparateBy4 ( z ) << 2) |
( SeparateBy4 (u) << 3) |
( SeparateBy4 ( v ) << 4 ) ;
}
SeparateBy4 inserts four blank bits between every two bits in the binary representation of
an integer. MortonCode5 interleaves binary representations using shift and or operations.
Double Hashing
In open-addressed hash tables, it is also used a mixed hash function [29]. The function for
double hashing is defined as:
h(k, i) = (h1(k) + i ∗ h2(k)) mod m (4.2)
where h1 and h2 are two auxilliary hash functions and i goes from 1 to m − 1 where m is
the number of positions in the table. In this work, however, it is used a simpler equation
where i and m are set to 1. Justification for this is that probing of hash table is not used,
i.e. insertion code does not look for unoccupied places in the table. Instead entire range of
hash function values is mapped directly to a discrete set of buffer indices. The mapping
function discribed in the next subsection.
4.2.5. Selection of Mapping Function
A mapping function associates a range of hash function values to the set of buffer indices.
The following formula performs the mapping:
f(h) = (1.0 + h/INT MAX) ∗ m/2; (4.3)
where h is a hash value, INT MAX is a constant denoting the maximum integer in a
system, m is the hash table size. The size of INT MAX is defined by ANSI standard. For
unix 32-bit systems it is 2,147,483,648 [56]. Thus for m = 500000 a range of hash values for
one bucket is approximately 8590.
52
4.2. Frame Coherence
4.2.6. Selection of Synchronization Mechanism
Selection of synchronization mechanism depends on the data structure which is going to
be implemented. In case of chained hash tables, it is possible to use a lock-free synchro-
nization [34]. Synchronization of reading/writing access in open-addressed hash tables
could be implemented using atomic locks [47].
Lock-free Synchronization
In lock-free style of programming [40], at least one thread always do a progress. All threads
try to write their results to the concurrent data structure. On failure, a thread repeats the
operation. For synchronization, atomic operation are used usually. The following code
shows the atomicCAS operation how it is defined in CUDA [40].
Listing 4.4: atomicCAS [40]
int atomicCAS ( int *p , int cmp, int v){
exclusive single thread
{
int old = *p ;
i f (cmp == old ) *p = v ;
}
return old ;
}
The next listing shows insertion of element to a lock-free linked list.
Listing 4.5: Insertion to lock-free linked list [40]
void i n s e r t ( ListNode mine , ListNode prev )
{
ListNode old , link = prev−>next ;
do{
old = link ;
mine−>next = old ;
link = atomicCAS(&prev−>next , link , mine ) ;
}while ( link != old )
}
Idea behind the lock-free data updates is that on every new cycle it is generated a new
value based on current data. Then performed an atomicCAS operation trying to change
the current data to the new value. If the operation unsuccessful it is repeated again.
Atomic Lock Synchronization
In the locking style of programming [40], all threads are trying to get the lock. One thread
aquires the lock, does its work and release the lock and so on. The next listing shows a
mutex synchronization using atomic locking.
53
4. Analysis and Modelling
Listing 4.6: Addition using atomic lock [40]
int locked = 0;
bool try lock ( )
{
int prev = atomicExch(&locked , 1 ) ;
i f ( prev == 0)
return true ;
return false ;
}
bool unlock ( )
{
int prev = atomicExch(&locked , 0 ) ;
i f ( prev == 1)
return true ;
return false ;
}
double atomicAdd ( double * data , double val )
{
while ( try lock ( ) == false ) ;
double old = * data ;
* data = old + val ;
unlock ( ) ;
return old ;
}
54
5. Implementation
5.1. Implementation of Ray Reordering
During the research stage, it have been implemented 2D and 3D Hilbert curves, Z-curve
and the ray histogram. The implementation is performed on the CPU side, because the task
that is to explore how the ray sorting affects efficiency. On the basis of investigation results,
it is found that the most efficient solution is a combination of Z-curve and Radix sort. An
implementation of Z-curve has been described in the previous chapter and CUDA has
already an efficient implementation for Radix sort algorithm. Thus there exists a standard
GPU implementation. Results of the benchmarking are described in the next chapter.
Hilbert Curves
2D Hilbert Curve The curve is implemented using a turtle graphics with at most one
turn after a step [21]. An interested reader could find the implementation in appendix A.2
3D Hilbert Curve Implementation of the 3D Hilbert curve on the CPU is a straightfor-
ward. It follows the syntax given in appendix A.3. Showing the implementation would be
tedious for the reader.
Z Curve
Implementation for Z-curve strictly follows the algorithm given in appendix A.1.
Ray Histogram
Ray histograming is described by algorithm 1. The implementation exactly corresponds to
the algorithm.
Radix Sort
The algorithm for Radix sort has already an efficient implementation in CUDA, see, for
example [33].
5.2. Implementation of Frame Coherence
In this section, it is described an implementation of frame coherence according to selected
data model 4.2.3 and data structure 4.2.2.
55
5. Implementation
5.2.1. Implementation of Data Model
The implementation of data model includes implementation of CacheKey and CacheNode.
Cache Key
CacheKey is a data structure which contains parameters of a multicomponent key 3.2.3.
Interface
setIndices
device void setIndices(uint16 t instanceId, uint32 t primitiveIndex, float3 inc dir, int
hash method, int div)
Return value void
Parameters
instanceId : identificator of loaded model
primitiveIndex : identificator of mesh triangle
inc dir : a ray direction angle
hash method : id of hash function
div : quantization precision
Description Set the key members and calculate the hash value for the key data.
equals
device bool equals(Key other)
Return value boolean
Parameters
other : a key for comparison
Description Returns true if the key is equal to the key provided
calc hash1
device bool calc hash1()
Return value unsigned integer
Description Calculates a hash value from key members using uniform random distribu-
tion [36]
56
5.2. Implementation of Frame Coherence
calc hash2
device bool calc hash2()
Return value unsigned integer
Description Calculates a hash value from key members using morton codes [20]
calc hash3
device bool calc hash2()
Return value unsigned integer
Description Calculates a hash value using a mix of two hashing functions [29]
separateBy4
device unsigned int separateBy4(unsigned int x)
Return value unsigned integer
Parameters
x : a number which binary representation to be shifted by 4
Description Separate bits by 4 bit places in binary representation of a number provided
in the method [20]
mortonCode5
device unsigned int mortonCode5(unsigned int x, unsigned int y, unsigned int z, unsigned
int u, unsigned int v)
Return value unsigned integer
Parameters
x : x coordinate
y : y coordinate
z : z coordinate
u : u coordinate
v : v coordinate
Description Constructs morton codes by interleaving x, y, z, u, v using oring and shifting
[20]
57
5. Implementation
Interface Implementation A curious reader could find the interface implementation of
the cache key in appendix C.1
5.2.2. Implementation of Data Structure
The present subsection describes implementation of data strutures selected in the previous
subsections.
Chained Hash Table
Configuration Parameters Two parameters are added to a configuration file. These pa-
rameters are:
1. cache buffer size
2. cache load factor
cache buffer size is necessary to define a size of buffer which elements are used to build
the hash table on the device side. The buffer is created on the CPU side using OptiX
context. This buffer is filled with elements of CacheNode type. The initial state of these
elements is set and then the buffer passed to the device side. The primary cause why the
buffer is used for the hash table construction is impossibility for dynamic allocation of the
table elements on the device side using OptiX [38]. There are also some efficiency consid-
erations why using such buffer could be beneficial [34] for the hash table construction. The
following listing shows the buffer initalization on the CPU through the context.
Listing 5.1: Nodes buffer initialization
nodeBuffer = context−>createBuffer (RT BUFFER INPUT OUTPUT ) ;
nodeBuffer−>setElementSize ( sizeof ( CacheNode ) ) ;
context−>setBuffer ( BufferVariable : : CACHE NODE BUFFER) ( nodeBuffer ) ;
cache load factor defines a maximun number of elements which could be expected in
a bucket. The number of buckets for the hash table is determined using the following
formula:
number of bins =
cache buffer size
cache load factor
(5.1)
Key Parameters CacheKey has the same parameters as described in 4.2.3. Divisions are
calculated using the following code:
Listing 5.2: Divisions
div x = ( i n c d i r . x + 1)* div /2;
div y = ( i n c d i r . y + 1)* div /2;
div z = ( i n c d i r . z + 1)* div /2;
Here, inc dir is variable of type float3 containing direction cosines of a ray. Directions
are always aligned to positive floating numbers by adding one, after that the floats are
multiplied by large integer variable which is denoted by div. The variable is responsible
for the quantization precision, large values are responsible for greater accuracy.
58
5.2. Implementation of Frame Coherence
Cache Node It is necessary to mention, that in the current configuration of the hash table,
the CacheNode does not contain PositionKey as in 4.2.3 because the table is mostly tested
for static scenes. Also the CacheNode contains some additional components, they are:
left - integer, index of the left element in the buffer
right - integer, index of the right element in the buffer
parent - integer, index of the parent in the buffer
index - integer, index of the given node in the buffer
queue - integer, index of the next element on the path in the buffer
Construction of Interface
writeToCache
inline device void writeToCache(PerRayData prev data, PerRayData data, CacheNode∗
&cachedNodeWrite)
Return value void
Parameters
prev data - data structure with results of the previous tracing (key)
data - data structure with results of the current tracing (data)
cachedNodeWrite - a variable which is used to link elements of one trace
Description Adds data to cache with a key generated from prev data
getFromCache
inline device CacheNode∗ getFromCache(PerRayData data)
Return value Returns a pointer to requested element, NULL if the element is not there
Parameters
data - data structure with results of the previous tracing (key)
Description Gets an element by its key generated from data
59
5. Implementation
hasKey
inline device bool hasKey(PerRayData data)
Return value Returns true if the requested element is in the cache, false otherwise
Parameters
data - data structure with results of the previous tracing (key)
Description Indicates an element’s existance in the cache by its key generated from data
get bucket index
inline device uint get bucket index(int hash)
Return value Returns true if the requested element is in the cache, false otherwise
Parameters
data - data structure with results of the previous tracing (key)
Description Indicates an element’s existance in the cache by its key generated from data
Interface Implementation
writeToCache constructs a binary tree in the bucket determined by the hash value of
node being inserted. Method writeToCache uses a lock-free synchronization paradigm
[19]. Before the loop which seeks a vacant place starts, the method gets pointers to the
root node of the tree and to the node in buffer which has to be inserted. Then the loop
starts, in both subtrees the method tries to atomicCAS indices pointing to the left and right
elements. If the operation successful, the loop terminated, otherwise to root is assigned a
new value, root− > left or root− > right and the operation repeats again.
getFromCache acesses elements of a tree located in a bucket obtained using the hash code
of the search element. If the element is found, returns it, otherwise returns NULL. No syn-
chronization is necessary. The only modification which is done is counter incrementation.
hasKey returns true if getFromCache returns value which is not NULL.
getBucketIndex maps hash values to the buffer indices. It first divides a hash value by
the maximum integer, adds one and multiplies the resulting number by a number of ele-
ments in the element buffer divided by 2.
A curious reader could find the implementation of the chained hash table in appendix
C.2
60
5.2. Implementation of Frame Coherence
Open-Addressed Hash Table
Configuration Parameters The following parameters were added to the configuration
file:
cache readers number is the number of streams which can read successively from a table
bucket without update. After the number is exceeded, the data is purged and its
content is overwritten. This parameter is a part of a synchronization mechanism.
cache residual value is a threshold for residual calculation in precision purging 3.2.3.
use cache is a parameter for turning the cache on/off
use cache residual is a parameter to turn on/off precision purging 3.2.3.
hash method is a parameter for selecting a hash function 4.2.4.
Key Parameters CacheKey has the same parameters as described in 4.2.3. According to
the diagram 4.6, CacheKey has also a second key PositionKey. Both keys are stored using
their hash values.
Cache Node A cache element has the same parameters as described in 4.2.3.
Construction of Interface
writeToCache
inline device void writeToCache(int base index, PerRayData prev data, PerRayData
data, CacheNode∗ &cachedNodeWrite, int pos hash)
Return value void
Parameters
base index : specifies an offset of buffer indices for each model, defined as num of buckets×
modelInstanceId
prev data : data structure with results of the previous tracing (key)
data : data structure with results of the current tracing (data)
cachedNodeWrite : a variable which is used to link elements of one trace
pos hash : a hash value of position key(emitter position)
Description Adds data to cache with a key generated from prev data
61
5. Implementation
getFromCache
inline device bool getFromCache(int base index, PerRayData prev data, PerRayData
&data, int pos hash)
Return value boolean
Parameters
base index : specifies an offset of buffer indices for each model, defined as num of buckets×
modelInstanceId
prev data : data structure with results of the previous tracing (key)
data : data structure which ready to filled with cached data
pos hash : a hash value of position key(emitter position)
Description Fills data with cached data and returns true if prev data(key) exists. Returns
false, if pos hash provided in the method call does not coincide with that of an
element residing by the address for the key.
getFromCacheRes
inline device bool getFromCacheRes(int base index, PerRayData prev data, PerRayData
&data, int pos hash)
Return value boolean
Parameters
base index : specifies an offset of buffer indices for each model, defined as num of buckets×
modelInstanceId
prev data : data structure with results of the previous tracing (key)
data : data structure which ready to filled with cached data
pos hash : a hash value of position key(emitter position)
Description Fills data with cached data and returns true if prev data(key) exists. Returns
false, if in case hash values for an element and that of the key are not equal and a
residual value exceeds a threshold specified in the configuration file 5.2.2.
makeKey
inline device Key makeKey(Key key, PerRayData prd, bool debug, int division, int
hashmethod);
Return value CacheKey
62
5.2. Implementation of Frame Coherence
Parameters
key : to be initialized
prd : data with parameters to initialize the key
debug : print debug information
division : precision of angle quantization
hashmethod : id of hashing function
Description Receives a key in parameter list, fills with data from prd and returns.
getBucketIndex
inline device uint getBucketIndex(unsigned int hash)
Return value unsigned int
Parameters
hash : a key hash value
Description Returns an index in the node buffer for the hash value specified in the
method call
getBaseIndex
inline device int getBaseIndex(int modelInstanceId)
Return value int
Parameters
modelInstanceId : id of model
Description Returns a starting index in the node buffer for the model id provided in the
method call
searchInMap
inline device int searchInMap(int modelInstanceId)
Return value int
Parameters
modelInstanceId : id of model
Description Returns an offset index for modelInstanceId provided in the method call.
63
5. Implementation
calculateResidual
inline device float calculateResidual(float3 origin, float3 otherOrigin)
Return value float
Parameters
origin : the first point
otherOrigin : the second point
Description Calculates L1 distance between two points provided in the function call.
checkIntersections
inline device bool checkIntersections(const CacheNode∗ node, float R)
Return value boolean
Parameters
node : node to be checked
R : radius of antenna
Description Returns true if coordinates of the node provided in the method call are within
a unit sphere of any of the receivers which present currently on the scene. This
method checks a validity of RECEIV ER HIT 3.2.3
Interface Implementation
writeToCache writes data to cache using prev data as a key. Synchronization of both
write and read accesses is performed using an atomic locking pardigm [47]. In the write
method, after the pointer to the bucket is obtained, a stream tries to lock the node on
writing using atomicCAS operation and writeLock. If the operation successful, then the
stream change the data inside the node and unlock the readLock. This operation unlocks
the object for reading. If the operation is not successful the stream simply leaves the section
without writing since, firstly, both results cannot be stored and, secondly, it could be faster
to trace ray then wait on the writeLock.
getFromCache reads from the cache data using data provided in the method call as key.
The method also uses the same synchronization [47] paradigm as writeToCache. A stream
gets a pointer to bucket and then tries to acquire the readLock. Every stream which ac-
quires the readLock check whether position stamp of the bucket is valid. If it is not, then
the method sets readN to 0, unlocks the bucket for writing setting writeLock to 0 and
64
5.2. Implementation of Frame Coherence
returns false. This means that data of the given bucket will be automatically overwrit-
ten. If the stamp is valid, then, firstly, the stream increments the total number of reads for
the given bucket, this number is denoted by variable readN. If this number is less then
cache readers number defined in the configuration file, the stream after reading the data
releases the readLock. Otherwise, it sets the total number of reads readN to 0 and unlocks
the bucket for writing setting the writeLock to 0.
getFromCacheRes reads from cache data using data variable provided in the method call
as key. The method uses the same synchronization mechanism as in readFromCache.
The only difference is that the method uses a different approach for purging 3.2.3. In case
hashes of the requested element and that of the bucket are not equal, the method calculate a
residual value. If the residual value is less then a threshold denoted by cache residual value
provided in the configuration file then the stream reads data and returns true. Otherwise
it returns false which means that data in the bucket will be overwritten.
makeKey is an auxiliary method which serves for construction of key from prev data.
getBucketIndex returns an index in the node buffer for the hash provided in the method
call. The method implementation differs from the chained hash-table in that instead of
INT MAX it uses UINT MAX because hash values have now unsinged int type.
searchInMap is an auxiliary method helping to determine on offset for a particular model.
The buffer cache emitter map has an associated offset index which shall be multiplied by
the number of elements in the buffer divided by the number of models loaded.
calculateResidual is an auxiliary method helping to determine L1 [5] distance between
two points provided in the method call. The method is used in getFromCacheRes to
calculate the residual value.
checkIntersections checks coordinates of node provided in the method call against coor-
dinates of receivers are currently on the scene. If the node coordinates are within radius R
provided in the method call, the method returns true, otherwise returns false.
A curious reader could find the implementation in the appendix C.3
65
5. Implementation
66
Part IV.
Evaluation and Testing
67
6. Testing
6.1. System Configuration before Testing
In this section it is briefly described the system on which the testing is performed.
System The operation system has the following parameters
• Operating System: Linux-x86 64
• Release: Ubuntu 12.04 (precise)
• Kernel: 3.2.0-58-generic
CPU Parameters of the CPU are
• CPUs: 4
• Model Name: Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
• Frequency: 2494.001 MHz
• L2 cache: 3072 KB
Memory Parameters of the system memory are
• Memory total: 3888 MiB
NVIDIA The graphic card has the following parameters
• Graphics Processor: GeForce GTX 680
• CUDA Cores: 1536
• Total Memory: 2048 MB
• Memory Interface: 256-bit
• NVIDIA Driver Version: 331.38
69
6. Testing
6.2. Testing of Ray Reordering
6.2.1. Approach
The task of ray reordering is to increase the efficiency of tracing. Ray reordering does
not introduce errors that is why no error estimation needs to be calculated. The approach
comprises testing of all possible ways of ray reordering for enough large range of the ray
count number and to compare the tracing times with the base version without reordering.
Results are presented as a diagram of tracing times for ray thousands.
6.2.2. Tests Description
The following types of ray reodering are used during the testing:
1. without sorting (mnemonic: WS)
2. sorting of 1D buffer using a histogram and bypassing of histogram using 3D Hilbert
curve (mnemonic: H3DH)
3. sorting of 1D buffer using a morton codes (mnemonic: Z)
4. sorting of 1D buffer using a histogram and bypassing the histogram using morton
codes (mnemonic: HZ)
5. sorting of 1D buffer using a histogram and bypassing the histogram using 3D Hilbert
curve, mapping of the resulting buffer to 2D using 2D Hilbert curve (mnemonic:H3DH2DH)
6. sorting of 1D buffer using a histogram and bypassing the histogram using 3D Hilbert
curve, mapping of the resulting buffer to 3D using 3D Hilbert curve (mnemonic:
H3DH3DH)
Further mnemonics will be used instead of full names for methods. For all types of reoder-
ing, it is generated from 50000 to 90000000 rays.
6.2.3. Results
The abscissa of figure 6.1 shows the number of rays in millions. One unit on the ordinate
corresponds to 1 second. Red color corresponds to WS in 6.2.2, green to H3DH, blue to Z,
pink to HZ, turquoise to H3DH2DH , yellow to H3DH3DH.
Overall, results shows that ray reordering has a significant impact on the performance.
The best time indicator has the ray reordering using H3DH (green) and also HZ (pink). The
worst time has tracing without reordering (WS) for all ray numbers. The ray reordering
using Z (blue) has slightly worse performance then the method based on histograming and
space-filling curves (both H3DH and HZ). Mapping of the initial 1D buffer to 2D buffer
using H3DH2DH is at least not worse then Z (blue) but performance sharply degrades after
50 millions of rays. Study of such a behaviour is not performed. Results for H3DH3DH
(3D buffer) are not worse then for H3DH2DH (2D buffer) till approximately 16 mill of rays.
70
6.3. Testing of Frame Coherence
Figure 6.1.: Results for Ray Reordering
Benchmarking for this type of reordering cannot proceed furher because the depth of 3D
Hilbert related to the amount of rays by the following formula
depth =
log(size)
log(8)
(6.1)
where size is the size of the ray buffer. Thus the size of the ray buffer for depth = 8 is
16777216, the next number for depth = 9 is 134217728 which goes out of the range for the
ray count number.
6.3. Testing of Frame Coherence
6.3.1. Static Scene
In the following section it is described an approach and tests for the static scene.
Analysis of Chained Table Performance
Analysis for the static scene starts from performance study for the chained table. In this
subsection, it is given a performance anatomy for this type of table with different hash
functions.
Performance Analysis of Write/Read Operations Analysis of read/write operations is
vital for the whole efficiency of the hash table. The study begins from description of the
approach and testing procedures and finishes by review of results.
71
6. Testing
Approach The following operations influence overall performance of the hash table:
• write to cache
• read from cache
• buffer call
• hash generation
• trace time
The approach is to measure performance of these operations during runtime and analyse
these data using plots to see bottleneck and potential pitfalls.
Test Description The application runs for the static scene with use cache and use benchmarking
configured to true. The node buffer is printed to file in location specified by the configu-
ration file. The output file contains lines with the following information: element index|
trace time |time to generate hash |write call time |buffer call time |get call time |number
of hits The data is processed using scripts for generation of gnuplot images. All the data
is generated for the following configuration parameters:
• ray cnt = 50000
• cache buffer size = 50000
• load factor = 3
• division = 200000
Write VS Buffer Call VS Read Figure 6.2 shows times of completion for write, buffer
call and read operations. X coordinate shows time in seconds, Y coordinate shows num-
ber of elements in cache. The main feature of the graph is clear separation of write/read
access time for elements corresponding to buckets and to ”ordinary“ nodes. Writing data
to buckets is more expensive with a maximum of 120 ms. Most of the write operations for
the ”ordinary” nodes take less then 80 ms. The same pattern repeats for the read opera-
tions with approximately 50 ms maximum for buckets and 40 ms for “ordinary” elements.
Buffer calls are almost neglectable.
Trace VS Write VS Buffer Call VS Read Figure 6.3 shows comparison of cache opera-
tions to tracing time with the same coordinates as the previous graph. Basically the image
repeats the “step” pattern as on the previous image. Almost all tracing operations do not
exceed 2 seconds. On the other hand, tracing for buckets in average is more expensive due
to the ”step“ pattern.
Hits Figure 6.4 shows the number of hits for the elements of the node buffer. The X
coordinate shows the number of elements in cache. The Y coordinate shows the number of
hits for the cache elements. The hits are uniformely distributed accross all elements. There
are no clusters visible in the image. It also could be seen that the cache is fragmented
because not all 50000 nodes are presented on the graph. Also some nodes have 0 hits.
72
6.3. Testing of Frame Coherence
Figure 6.2.: Write VS Buffer Call VS Read
Performance of Tracing Figure 6.5 shows comparison of performance for no caching,
caching with uniform random hash (hash1), caching with z-curve hashing (hash2) and
caching with mixed hashing function (hash3) 4.2.4. The testing was performed for:
1. division = 200000
2. cache load factor = 3
3. cache buffer size = 50000
The figure shows that caching with all caching functions are not better then tracing with-
out caching on almost the whole range of rays. An exception is window between 2560 and
10240 thousands rays when caching has benefits. After 10240 thousands performance of
tracing with caching again degrades.
Conclusion There are several statements which could be made based on the anlysis of
images:
1. Write/Read operations have unequal times for buckets and nodes
2. Buffer appears to be fragmented
3. Tracing with caching using chained hash table does not give performance benefits
for all hashing functions
4. Not all elements of the cache are used (there are elements with 0 hits)
73
6. Testing
Figure 6.3.: Trace VS Write VS Buffer Call VS Read
Analysis of Open-Addressed Table Performance
Next, it is given analysis for the open-addressed hash table. The benchmarking for the
static scene is done using atomic mutex synchronization.
Performance Analysis of Write/Read Operations Approach and tests description for the
analysis of read/write operations are the same as for the chained hash table.
Write VS Buffer Call VS Read Figure 6.6 shows times of completion for write, buffer
call and read operations. X coordinate shows time in seconds, Y coordinate shows number
of elements in cache. It can be seen from the image that most of the write operations take
60 ms, while most of the read operations take approximately 20 ms. Execution time for
read operations have two peaks with approximately 60 ms each. For the write operations,
16 ms peaks fall to the same range of indices in the buffer as the read peaks. The first 16
ms peak has small 19 ms emission. The main feature of the graph is its patchwork. This
is due to the fact that the cache is divided into parts for each vehicle. More ”busy“ parts
have greater times for write/read operations. Overall, as with the chained hash table the
writing operations are more expensive then reading and the buffer calls are neglectable.
Trace VS Write VS Buffer Call VS Read Figure 6.7 shows comparison of cache op-
erations to tracing time. The figure shows that parts of the buffer with more expensive
read/write operations have smaller trace times. The tracing has two minimums with ap-
proximaty 1 seconds each and three peaks ranging from 2.5 to 2 seconds which fall to the
parts with fast cache operations. Overall, cache operations are much faster then tracing.
74
6.3. Testing of Frame Coherence
Figure 6.4.: Hits
Also tracing for models with more ”busy” cache has greater performance, i.e tracing for
them is faster.
Performance of Tracing The testing is done for synchronization mechanism described in
Implementation of open-addressed hash table 5.2.2.
Approach The main concern in testing of tracing performance is testing of productivity
of wavetracer for different ray numbers without cache and with cache for different types
of hash functions.
Tests The testing is done iteratively changing parameters of configuration file and launch-
ing a new tracing. The ray number changed in the range from 1000 to 20280000, on each
iteration the ray number is multiplied by 2. The testing is done for the following parame-
ters:
1. cache buffer size = 300000
2. division = 200000
3. cache readers number = 5
Results Figure 6.8 shows results obtained for no caching, caching using random uniform
hash function (hash1), caching using morton codes(hash2) and caching using mixed func-
tion (hash3). The X coordinate indicates the number of rays in thousands, the Y coordinate
gives time in seconds. The plots display that approximately till 50000 rays all types of
75
6. Testing
Figure 6.5.: Performance of Tracing with Chained Hash Table
tracing have roughly the same performance. After that number the trends diverge and
caching gives advantage over no caching. The mixed function (hash3) has the best perfor-
mance which followed immediately by the random function (hash1) and the cache with
morton codes is a little worse.
Influence of Ray Reordering Figure 6.9 shows results of benchmarking of the open-
addressed table for atomic mutex synchronization. Influence of the ray reordering is im-
portant since both techniques(caching and ray sorting) will be used together. It also impor-
tant that here it is used the mutex atomic synchronization since it shows some potential of
the caching mechanism. A large tracing time for 1000 rays for the no cache is conditioned
by the initialization time for the data structures. This time does not need to be considered.
Approximately after 16000 rays caching becomes to prevail with the maximum advantage
of 2 seconds for 2048000 rays.
Conclusion In general, based on the figures, the following statements could be made for
the open-addressed hash table:
• On average, writing/reading operations for the open-addressed table take less time
then for the chained.
• Writing/reading operations have a patchwork pattern.
• Benefits of caching with open-addressed table reveal for considerably less ray counts
then for the chained table and has a more stable character.
76
6.3. Testing of Frame Coherence
Figure 6.6.: Write VS Buffer Call VS Read
6.3.2. Dynamic Scene
In this subsection, it is discussed a testing for dynamic scenes. The type of structure anal-
ysed is the open-addressed hash table which has better performance then the chained ta-
ble.
Approach
For dynamic testing, the following parameters are important:
• Temporal tracing metrics
• Accuracy of tracing with cache
• Calculation of correlation between frames
The first is obvious, the second item is necessary to estimate what error introduces caching
comparing to performance benefits. The last item is essential to assess the coherence be-
tween frames because high coherence correspond to high caching benefits.
Temporal Metrics As a temporal metric, it is chosen an average time of tracing for 500
working cycles of wavetracer. The average time could be calculated incrementally using
the following formula [11]:
An+1 = An +
vn+1 − An
n + 1
(6.2)
where An is the average number obtained on the previous cycle, vn+1 is tracing time for
n + 1th cycle and n + 1 is current number of cycles.
77
6. Testing
Figure 6.7.: Trace VS Write VS Buffer Call VS Read
Accuracy of Tracing For correct error calculation, it is necessary to solve the following
issues:
• What output parameters of the ray tracing could be taken for error calculation?
• What is the ground truth in this assessment?
• How to calculate the error algorithmically?
As parameters for error calculation could be taken points on a trace path contained in the
waypoint buffer. The union Waypoint contains the following members:
• WP Reflection
• WP Diffraction
• WP Miss
• WP Hit
• WP Launch
As a tracing path could be taken positions of WP Reflection and WP Hit. As the ground
truth, it could be used a trace path obtained without caching. The error could be calculated
using the following approach. The trace paths represent point clouds, these clouds could
be compared with each other using many ways. One of the possible assessment is to
calculate length between point clouds centers, this length introduces a measure for two
clouds difference. The length could be calculated using the following formula [2]:
D1 = L1(cA, cB) and D2 = L2(cA, cB) (6.3)
78
6.3. Testing of Frame Coherence
Figure 6.8.: Performance of Tracing with Open-Addressed Hash Table
where L1 and L2 are distances and cA and cB are centroids of point clouds.
Statistical Correlation between Frames Correlation between frames could be estimated
using the same data (point clouds). Every cycle of work of the wavetracer produces an
output written to the waypoint buffer. For example, correlation is calculated for two
sets of x coordinates in produced ouputs. Similar coefficients are calculated for y and z
coordinte. Normalized sum of these coefficients is considered as correlation coefficient
between frames. If this correlation coefficient is high then it means also that coherence
between frames is also high. Mathematically this could be expressed as following
rx1x2 =
N
i=1
(x1i− ¯x1)(x2i− ¯x2)
√ n
i=1
(x1i− ¯x1)2 n
i=1
(x2i− ¯x2)2
ry1y2 =
N
i=1
(y1i− ¯y1)(y2i− ¯y2)
√ n
i=1(y1i− ¯y1)2 n
i=1(y2i− ¯y2)2
rz1z2 =
N
i=1
(z1i− ¯z1)(z2i− ¯z2)
√ n
i=1
(z1i− ¯z1)2 n
i=1
(z2i− ¯z2)2
c12 =
r2
x1x2+r2
y1y2+r2
z1z2
1
(6.4)
where O1 = {x1, y1, z1} and O2 = {x2, y2, z2} are two ouputs of tracing pocedures and
xi, yi, zi are coordinate sets and rx1x2, ry1y2, rz1z2 are correlation coefficients between two
coordinate sets. c12 is normalized length of the correlation vector [rx1x2, ry1y2, rz1z2].
79
6. Testing
Figure 6.9.: Performance of Tracing with Chained Hash Table(Ray Reordering)
Figure 6.10 shows a geometrical interpretation for the error calculation. Two arrows
indicate point clouds centroids, the variable distance displays the current distance for the
frame, the variable avg error shows the average distance for preceeding frames including
this.
Description of Benchmarking Procedure The testing for dynamic scenes has the follow-
ing goals:
• Comparison of no caching VS caching with position purging VS precision purging
• Comparison of efficiency of hashing functions
• Comparison of caching VS caching with ray reordering
no caching VS position purging VS precision purging The testing is performed for the
range between 1000 and 90000 rays. For every 2000 rays, the testing is done for the follow-
ing parameters in the configuration file:
• benchmark file name is a file name where the output data is written.
• use cache is a parameter which turns on/off the caching.
• use cache residual is a parameter which indicates usage of precision purging.
• division is a parameter which defines a quantization accuracy.
80
6.3. Testing of Frame Coherence
Figure 6.10.: geometrical interpretation of error calculation
• ray cnt is a parameter which defines a number of rays.
• cache residual value is constant, defines residual threshold.
The output of tracing is written to folders with names generated on the basis of ray count
and caching parameters. The benchmarking data is written to special data file in a format:
ray cnt — tracing time
hashing functions In addition to parameters changed in the previous testing procedure,
this type of testing also alters hash method setting it from 1 to 3. 1 indicates uniform
random hashing, 2 morton codes hashing and 3 mixed hashing function. Recording of the
output data is the same as in the previous type of testing.
caching with ray reordering This type of testing is performed for the whole range of
caching methods and hash functions. The ray reordering parameter is set to 1. This testing
is done to estimate how ray reordering influences the overall performance of the tracing.
Benchmarking Automation The automation is performed using python bindings in ADTF.
The script opens ADTF, loads a configuration and runs the benchmarking. Before the test-
ing, it is necessary to generate a number of configuration files with required parameters,
the path to folder with files is supplied to the lauch script. After each iteration, the appli-
cation is shutdown to provide equal starting conditions for all types of testing. The output
data produced by testing is also processed using scripts.
81
6. Testing
Results In the following paragraph results of the testing procedures for dynamic scenes
are described.
OptiX error It is necessary to mention that while testing the system for dynamic scene,
there have been an exception with copying data from the host to device. An interested
reader will find the exception description in the following nvidia thread [39]. In order
to avoid the exception, there have been implemented a synchronization mechanism for
saving/reading data to/from buffer. The mechanism considerably reduces the caching
performance both for static and dynamic scenes but it cannot be not used since a pure mu-
tex synchronization does not ensure safe execution of the program. Moreover, developers
have not answered a question why such an error can occur. It is possible that such error
can occur when there is a big stress on the buffer. It is to be hoped that the error will be
resolved in the future version of the program.
no caching VS position purging VS precision purging Figure 6.11 shows results of the
benchmarking for no caching (green), caching with position stamp purging and division
200000(red) and caching with precision purging, residual value 0.0125 and division 10000
(blue). The X coordinate indicates a number of rays, while the Y coordinate shows average
Figure 6.11.: no caching VS position purging VS precision purging
tracing time in seconds. Caching with position stamp purging is constantly better then
tracing with no caching. The difference becomes bigger for higher ray numbers. Caching
with precision purging has better performance then both no caching and caching with po-
sition stamp purging. Simple caching gives approximately 10% performance improvement
for small and medium ray numbers with increase to 15 % for high ray count. Caching with
residual has around 30% improvement over no caching for small and medium ray num-
bers with decrease to aprroxiamtely 17% for higher ray counts.
82
6.3. Testing of Frame Coherence
Figure 6.12 shows an average error calculated for the testing procedure. The X coor-
dinate shows the number of rays, the Y coordinate indicates the distance calculated in
measuring units of the system. The green color diplays figures for no caching, red for
Figure 6.12.: error calculated for no caching VS position purging VS precision purging
caching with position stamp and division 200000 and blue for caching with residual 0.093
and division 10000. The error calculated for no caching to have it as the ground truth
value. Theoretically, this error should be 0 but it has some small value for the first itera-
tions which tends to becomes smaller with higher iterations. Concerning the caching with
position stamp purging, it has some huge error about 8 units on the first iteration. The
reasons for this are not investigated. For subsequent iterations, the error does not exceed 2
units. Is the error small or big? In order to answer this question it is necessary to calculate
average length of ray for which the error is calculated. The error with 10 - 5 percent of
average ray trace would be reasonable. In this test, the average tracing path is not calcu-
lated so the estimation cannot be given. The error for the caching with precision purging
in average is three times higher then for caching with position stamp. The error for this
caching scheme does not exceed 7 units.
Figure 6.13 correlation between frames calculated for 100 frames. The correlation coef-
ficient is calculated using two subsequent frames. The correlation coefficient varies from
high values (almost 50 %) to very low (less then 5%). On average, coherence between
frames is aprroximately 25 %
hashing functions In this subparagraph, it is given results of testing of caching for three
hashing functions. This time the ray number ranges from 1000 to 49000 rays. Figure 6.14
shows bencmarking times for no caching, caching with position stamp (division 200000)
and caching with residual purging (division 10000, residual 0.0125). Caching schemas
83
6. Testing
Figure 6.13.: correlation between frames
are tested for three different types of hashing functions: random uniformely distributed
function, morton codes hashing function, and mixed hashing function. Blue color shows
no cache, green caching with division 20000, turquoise is for the same caching schema
with morton codes, red displays the same caching schema with mixed function. Pink
corresponds to caching with residual 0.0125 and division 10000, yellow to the same caching
schema with morton codes and white the same schema with mixed function.
Residual caching with morton codes has the worst time, even worse then no caching at
all. The second worse time is no caching. Green, caching with position stamp purging
outperforms no caching as described in the previous paragraph. Almost the same result
has the mixed hashing for the same caching schema. Morton codes outperforms the ran-
dom uniform hashing for this schema by approximately 15% from 20 to 49 thousands rays.
Residual caching with the uniform hash function and the mixed hashing function has ap-
proximately the same performance competitive to position caching with morton codes.
However, the trend is that the later is better for bigger ray numbers.
Figure 6.15 shows the error calculated for all types of tracing in this test. Again, the error
for no caching should be theoretically 0 for all frames, it is given as a reference value to
show possible variation of tracing errors from frame to frame for the same type of caching
(systematic error). Caching with position stamp purging has the same trend for accu-
racy as in the previous test. Morton codes which gives better performance comparing to
the uniform function for the same caching schema has on average two times worse accu-
racy. Mixed function which does not give any performance benefits has approximately
the same accuracy as the random uniform function. In case of residual caching, morton
84
6.3. Testing of Frame Coherence
Figure 6.14.: average tracing times for caching with different hashing functions
codes are better in terms of accuracy but it has the worst tracing time. However, both
residual caching with the random uniform hashing function and residual caching with the
mixed hashing function have the worst error (about the same accuracy). Both types offer
approximately the same performance benefits.
Caching with Ray Reordering Figure 6.16 shows results of the tracing for the same
types of caching as in the previous testing. The only difference is that parameter value
ray reordering is set to 1 in the configuration file and the testing is performed for a bigger
range of rays from 1000 to 61000. The first observation which could be made is that the ray
reordering noticeably reduce the tracing time. For 49 thousands of rays the no cache trac-
ing with ray reordering is 2.5 times faster then without reordering. Secondly, for the given
synchronization type almost all caching techniques do not provide performance benefits.
Exceptions are residual caching with uniform random hash function and mixed hash func-
tion which give performance benefits till approximately 37 thousands of rays. After that
number the tracing time for these types of caching begin to grow. And for 61 thousands
this time noticeably exceeds the tracing time for no caching.
Figure 6.17 shows errors obtained for all types of caching with the ray reordering. For
the types of caching which provide performance increase the error is high enough. It
hovers around 5 units for both types of caching. Other methods with lower errors are not
85
6. Testing
Figure 6.15.: average errors calculated for caching with different hashing functions
of interest in terms of efficiency. The error for 43 thousands of rays equals to 0 because of
a failure in the automatic testing. The value does not have to be considered.
No Reordering VS Reordering Figure 6.18 gives a comparison of the tracing times for
no caching with reordering VS no cache without reordering. The blue color corresponds
to no cache with ray reordering and the green shows no cache with no reordering. Dimen-
sionality is the same as on the previous images. Overall, the diagram shows considerable
reduction of time for launches with the ray reordering for all the ray range. For 1000 rays,
the reduction amounts 23% while it is almost 60% for 49000 of rays. The last figure cor-
responds almost to 2.5-fold increase in efficiency. The main trend that the coefficient of
reduction tends to increase with the number of rays.
Conclusion
1. Software limitations do not allow to use atomic mutex synchronization in dynamic
scene which considerably reduces cache performance for both static and dynamic
scenes.
2. For the given type of synchronization described in the Implementation of the open-
addressed hash table 5.2.2 caching gives certain performance benefits (up to 30%
of tracing time). Caching with residual is more efficient than caching with position
purging but also makes bigger tracing errors.
3. Caching using morton codes as a hash function allows to increase tracing efficiency
for position purging. Regarding accuracy, morton codes give more acceptable results
for position purging then for caching with residual.
4. The ray sorting considerably influences the time of the ray tracing reducing it de-
pending on ray number. Wherein benefits which give the caching are leveled out
86
6.3. Testing of Frame Coherence
Figure 6.16.: average tracing times for caching with different hashing functions for ray
reordering
for this type of synchronization mechanism. Sorting also insignificantly reduces the
caching error.
87
6. Testing
Figure 6.17.: average errors calculated for caching with different hashing functions for ray
reordering
Figure 6.18.: average tracing times for no cache with ray reordering VS no cache without
reordering
88
Part V.
Discussion and Conclusions
89
7. Discussion
In the project, the following tasks have been performed and problems solved.
7.1. Ray Reordering
The task of ray reordering has been successfully solved using the space-filling curves. The
space-filling curves are also used for the BVH construction in the ray tracing. For example,
refer to [23]. It have been tried different versions for solution:
1. Construction of ray histogram and bypassing the histogram using the 3D Hilbert
curve.
2. Construction of ray histogram and bypassing the histogram using the Z-curve
3. Sorting initial ray coordinates using the Z-curve
4. Mapping of the obtained sorted ray list to 2D buffer using the 2D Hilbert curve
5. Mapping of the obtained sorted ray list to 3D buffer using the 3D Hilbert curve
Launching of the tracing procedure with 2D and 3D ray buffer gives no obvious results.
Overall, the most efficient implementation occurs to be the ray reordering using the Z-
curve. The curve could be easely implemented on the GPU side, the sorting could be
implemented using the radix sort. Results of the ray sorting will be discussed in the Con-
clusions chapter.
7.2. Frame Coherence
7.2.1. Caching Method
The main task that is solved here is construction of caching method for simulation of the
propagation channel in VANET simulation. The simulation is performed when practically
all the ray sources dynamically change their positions from frame to frame. Thus it is
solved a problem of cache construction, reusage of data from the previous launches and
cache purging.
7.2.2. Data Structure
In the implementation part, it is solved a problem of cache trees construction using ray
hashes. A research of hashing functions allows to detect their influence to the peformance.
It have been studied random uniform hasing function, hashing with Morton codes and
mixed hash function. They all differently influence the system perfomance, the results will
be discussed in the Conclusions chapter.
91
7. Discussion
OptiX does not allow to allocate memory on the device side. The problem is solved
using a buffer with elements constructed on the CPU side using the context. On the device
side, device functions use the buffer to construct cache.
A study of data structures and implementation of synchronization mechanism is an im-
portant part of the project. During the research it have been developed two data structures:
chained hash table and open-addressed hash table. Both data structures use different syn-
chronization mechanisms: lock-free synchronization and atomic mutex synchronization.
7.2.3. Testing
Static Scene Both data structures are tested for the static scene. It is evaluated perfor-
mance of their writing/reading operations and done a comparative analysis.
Dynamic Scene The main problem that occured during the testing of the open-addressed
hash table in dynamic scenes is that the third party tracing engine throws an exception
when tracing with cache using mutex synchronization for sufficiently large number of
rays. It has been designed a work-around solution, a new synchronization mechanism
based on mutex with two locks which counts a number of readers 5.2.1.
In the testing part, also it is designed an overall approach for testing and tests. It has
been developed an automated test suit for nightly tests. It is solved an error with repro-
duction of tracing paths using python bindings in ADTF. It has been developed a method
for calculation of system tracing errors with caching.
The last task solved in the testing part is influence of the ray reordering on the system
working with the ray cache.
92
8. Conclusions
1. In general, the overall system efficiency is considerably increased.
2. It is successfuly implemented a mechanism of the ray sorting on the CPU. The ray re-
ordering increases the system performance depending on the ray number. For 50000
of rays in dynamic testing, the efficiency increases by a factor of 2.5. The coefficient
of tracing time reduction increases with the number of rays (efficiency increases with
the ray number).
3. It has been successfuly deleloped and implemented a method for the ray caching.
4. It has been revealed that the open-addressed hash table is more efficient data struc-
ture than the chained hash table.
5. During the testing, it has been found that some software limitations do not allow to
use fully cache capabilities. It has been developed a work-around solution which
allows to circumvent to certain extent these limitations. Wherein caching allows to
increase the system efficiency depending on hashing function up to 30 %.
6. Under the joint action of the ray sorting and caching, the first prevails and the caching
does not increase the system efficiency introducing the tracing error.
7. Thus, for successful usage of the caching, it is neccessary to overcome limitations im-
posed by the third party software system. If such limitations could be totally circum-
vented then it is possible to weaken the reading access synchronization and allow for
multiple threads to read at the same time. This will considerably increase the overall
cache performance making it competitive with reordering.
93
8. Conclusions
94
Appendix
95
A. Space-Filling Curves
A.1. Morton Codes Generator
Listing A.1: Morton codes generator
unsigned int expandBits ( unsigned int v ) {
v = ( v * 0x00010001u ) & 0xFF0000FFu ;
v = ( v * 0x00000101u ) & 0x0F00F00Fu ;
v = ( v * 0x00000011u ) & 0xC30C30C3u ;
v = ( v * 0x00000005u ) & 0x49249249u ;
return v ;
}
unsigned int morton3D ( float x , float y , float z ) {
x = min( std : : max( x * 1024.0 f , 0.0 f ) , 1023.0 f ) ;
y = min( std : : max( y * 1024.0 f , 0.0 f ) , 1023.0 f ) ;
x = min( std : : max( z * 1024.0 f , 0.0 f ) , 1023.0 f ) ;
unsigned int xx = expandBits ( ( unsigned int ) x ) ;
unsigned int yy = expandBits ( ( unsigned int ) y ) ;
unsigned int zz = expandBits ( ( unsigned int ) z ) ;
return xx * 4 + yy * 2 + zz ;
}
A.2. 2D Hilbert Curve Implementation
The 2D Hilbert curve is implemented using an algorithm when the turtle turns at most
only once after doing a step [21].
Listing A.2: 2D Hilbert Curve Implementation
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−−−− hilbert2D . h −−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
# ifndef HILBERT2D H
#define HILBERT2D H
/ * *
* Maps 1D b u f f e r to 2D b u f f e r using H i l b e r t curve and t u r t l e g r a p h i c s .
* /
97
A. Space-Filling Curves
class Hilbert2D {
private :
/ / r e s u l t i n g b u f f e r
optix : : f l o a t 3 ** ray buffer2D ;
/ / i n i t i a l b u f f e r
std : : vector<Element> ray buffer ;
/ / curve depth
int depth ;
/ / s i z e o f one dimension o f 2D b u f f e r
int size ;
/ / current c o o r d i n a t e s o f t u r t l e
int x , y ;
/ / index o f 1D b u f f e r
int miles ;
/ / v a r i a b l e used to c a l c u l a t e t u r t l e d i r e c t i o n
int t u r t l e ;
public :
/ * *
* Constructor :
* param :
* r a y b u f f e r : i n i t i a l ray b u f f e r
* s i z e : s i z e o f b u f f e r
* /
Hilbert2D ( std : : vector<Element> ray buffer , int size ) ;
/ * *
* Makes one s t e p in d i r e c t i o n o f t u r t l e heading
*
* /
void step ( ) ;
/ * *
* Turns l e f t
* /
void t u r n l e f t ( ) ;
/ * *
* Turns r i g h t
* /
void turn right ( ) ;
/ * *
* Grammar :
*
* H1<−H2H1H5H3
* H2<−H1H6H3h5
* H3<−H1H6H3H4
* H4<−H6H1H5H3
* H5<−H6H1H5H2
98
A.2. 2D Hilbert Curve Implementation
* H6<−H4H6H3H5
*
* /
void H1( int depth ) ;
void H2( int depth ) ;
void H3( int depth ) ;
void H4( int depth ) ;
void H5( int depth ) ;
void H6( int depth ) ;
virtual ˜ Hilbert2D ( ) ;
int getDepth ( ) const {
return depth ;
}
optix : : f l o a t 3 ** getRayBuffer2D ( )
{
return ray buffer2D ;
}
int getX ( ) const {
return x ;
}
int getY ( ) const {
return y ;
}
int getSize ( ) const {
return size ;
}
/ * *
* C a l c u l a t e s depth from b u f f e r s i z e
* param :
* s i z e : b u f f e r s i z e
* /
double s t a t i c calc depth ( int size ) {
return log ( size ) / log ( 4 ) ;
}
/ * *
* C a l c u l a t e s dimension o f 2D b u f f e r
* param :
* depth : curve depth
* /
double s t a t i c c a l c s i z e ( int depth ) {
99
A. Space-Filling Curves
return pow(2 , depth ) ;
}
} ;
#endif / * HILBERT2D H * /
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−− hilbert2D . cpp −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−− t u r t l e s t e p −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void Hilbert2D : : step ( ) {
/ / here we increment b u f f e r index to get a new element from i n i t i a l
/ / b u f f e r
miles ++;
/ / depending on the t u r t l e o r i e n t a t i o n we increment / decrement x or y
/ / x and y in f a c t i n d i c e s o f the r e s u l t i n g b u f f e r
switch ( t u r t l e ) {
case 0:{
y++;
break ;
}
case 1:{
x++;
break ;
}
case 2:{
y−−;
break ;
}
case 3:{
x−−;
break ;
}
default :
break ;
}
/ / write ray d i r e c t i o n from i n i t i a l to r e s u l t i n b u f f e r
ray buffer2D [ x ] [ y ] . x = ray buffer [ miles ] . v . x ;
ray buffer2D [ x ] [ y ] . y = ray buffer [ miles ] . v . y ;
ray buffer2D [ x ] [ y ] . z = ray buffer [ miles ] . v . z ;
}
100
A.2. 2D Hilbert Curve Implementation
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−− turn l e f t −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void Hilbert2D : : t u r n l e f t ( ) {
t u r t l e = ( t u r t l e − 1 + 4) % 4;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−− turn r i g h t −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void Hilbert2D : : turn right ( ) {
t u r t l e = ( t u r t l e + 1) % 4;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−−− H1( r i g h t ) −−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void Hilbert2D : : H1( int depth ) {
i f ( depth >= 0 )
{
depth−−;
H2( depth ) ;
step ( ) ;
H1( depth ) ;
step ( ) ;
H5( depth ) ;
step ( ) ;
H3( depth ) ;
} else
{
turn right ( ) ;
}
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−−−−−− H2 −−−−−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void Hilbert2D : : H2( int depth ) {
i f ( depth >= 0 )
{
depth−−;
101
A. Space-Filling Curves
H1( depth ) ;
step ( ) ;
H6( depth ) ;
step ( ) ;
H3( depth ) ;
step ( ) ;
H5( depth ) ;
}
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−− H3( l e f t ) −−−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void Hilbert2D : : H3( int depth ) {
i f ( depth >= 0 )
{
depth−−;
H1( depth ) ;
step ( ) ;
H6( depth ) ;
step ( ) ;
H3( depth ) ;
step ( ) ;
H4( depth ) ;
} else
{
t u r n l e f t ( ) ;
}
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−−−−−− H4 −−−−−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void Hilbert2D : : H4( int depth ) {
i f ( depth >= 0 )
{
depth−−;
H6( depth ) ;
step ( ) ;
H1( depth ) ;
step ( ) ;
H5( depth ) ;
step ( ) ;
H3( depth ) ;
102
A.2. 2D Hilbert Curve Implementation
}
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−− H5( r i g h t ) −−−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void Hilbert2D : : H5( int depth ) {
i f ( depth >= 0 )
{
depth−−;
H6( depth ) ;
step ( ) ;
H1( depth ) ;
step ( ) ;
H5( depth ) ;
step ( ) ;
H2( depth ) ;
} else
{
turn right ( ) ;
}
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−−− H6( l e f t ) −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void Hilbert2D : : H6( int depth ) {
i f ( depth >= 0 )
{
depth−−;
H4( depth ) ;
step ( ) ;
H6( depth ) ;
step ( ) ;
H3( depth ) ;
step ( ) ;
H5( depth ) ;
} else
{
t u r n l e f t ( ) ;
}
}
103
A. Space-Filling Curves
A.3. 3D Hilbert Curve Grammar
For turtle’s orientation in space the same symbols are used as described earler
A → B + F − C + −FA − F + D − /F ∧ E&/F − A + +F − −F + + + F + G−
B → A&F ∧ C& ∧ FB ∧ F&D ∧ F − F + F ∧ B&&F ∧ ∧ E&&&F&N ∧
C → B + F − A + /F ∧ A&F + D − +FD + F − H + /F&H ∧ /F + P−
D → N ∧ F&G ∧ /F + G − /F ∧ C& ∧ FC ∧ F&M ∧ F − M + F ∧ O&
E → O&F ∧ P&F + E − F&N ∧ ∧ F − −O + + ∧ F ∧ E&/F − B + /F&N ∧
F → M + F − H + /F&F ∧ /F + G − −F + +M − − − F − F + F ∧ A&F + G−
G → ∧F&D ∧ &FG&F ∧ A&F + N − F&G ∧ ∧ F + +F − − ∧ F ∧ A&
H → M+; F − F + +F + +H − − + F + E − F&N ∧ F − C + −FC − F + P−
I → R + F − S + −FI − F + T − F&L ∧ F − I + +F + +U − − + F + V −
J → V − F + T − +FT + F − Z + F ∧ U&F + J − −F + +L − − − F − R+
K → J&F ∧ V &F + K − F&W ∧ &FJ&F ∧ K&/F − Y + /F&W ∧
L → W ∧ F&X ∧ /F + L − /F ∧ J&&F − −W + +&F&L ∧ F − R + F ∧ J&;
M → H ∧ F&F ∧ ∧ F − −M + + ∧ F ∧ O&/F − H + /F&M ∧ &FD&F ∧ O&
N → NG − F + D − +FN + F − B + /F&G ∧ /F + N − −F − −E + + − F − B+
O → P − F + E − −F − −O + + − F − M + F ∧ P&F + O − +FD + F − M+
P → O&F ∧ E&&F − −P + +&F&H ∧ /F + O − /F ∧ P& ∧ FC ∧ F&H ∧
R → I ∧ F&S ∧ &FR&F ∧ T&/F − U + /F&R ∧ ∧ F − −L + + ∧ F ∧ J&
S → I ∧ F&R ∧ F − R + F ∧ K& ∧ FV ∧ F&S ∧ /F + X − /F ∧ Z&
T → J&F ∧ V &F + T − F&X ∧ &FR&F ∧ T&/F − Y + /F&W ∧
U → Y + F − Z + F ∧ U&F + V − −F − −Y + +F − U + /F&I ∧ /F + V −
V → J&F ∧ K& ∧ FV ∧ F&S ∧ /F + L − /F ∧ V &&F − −U + +&F&I ∧
W → X − F + L − −F − −W + + − F − U + /F&I ∧ /F + K − +FK + F − Y +
X → W ∧ F&L ∧ ∧ F − −X + + ∧ F ∧ U&F + K − F&X ∧ &FS&F ∧ Z&
Y → Z&F ∧ U&&F + +Y − −&F&L ∧ F − S + F ∧ Y & ∧ FK ∧ F&W ∧
Z → Y + F − U + +F − −Z + + + F + X − /F ∧ Y &/F − Z + −FS − F + X−
104
B. A flow diagram for the main tracing loop
This appendix shows a flow diagram for the main tracing loop.
Figure B.1.: Ray tracing with cache
105
B. A flow diagram for the main tracing loop
106
C. Implementation Hash Tables
C.1. Implementation of Cache Key
Listing C.1: Implementation of cache key
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−−− s e t I n d i c e s −−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
d e v i c e void setIndices ( uint16 t instanceId ,
uint32 t primitiveIndex ,
f l o a t 3 inc dir ,
int hash method ,
int div = 200000)
{
this−>instanceId = instanceId ;
this−>primitiveIndex = primitiveIndex ;
this−>div x = ( i n c d i r . x + 1)* div /2;
this−>div y = ( i n c d i r . y + 1)* div /2;
this−>div z = ( i n c d i r . z + 1)* div /2;
switch ( hash method )
{
case 1: this−>hash = calc hash1 ( ) ; break ;
case 2: this−>hash = calc hash2 ( ) ; break ;
case 3: this−>hash = calc hash3 ( ) ; break ;
default : this−>hash = calc hash1 ( ) ; break ;
}
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−−−−− e q u a l s −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
d e v i c e bool equals ( CacheKey other )
{
i f ( this−>primitiveIndex != other . primitiveIndex )
return false ;
107
C. Implementation Hash Tables
i f ( this−>instanceId != other . instanceId )
return false ;
i f ( this−>div x != other . div x )
return false ;
i f ( this−>div y != other . div y )
return false ;
i f ( this−>div z != other . div z )
return false ;
return true ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−− c a l c h a s h 1 −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
d e v i c e unsigned int calc hash1 ( ) {
int x [ 5 ] = { primitiveIndex , instanceId , div x , div y , div z } ;
long p = (1L << 32) − 5;
long z = 0x64b6055aL ;
int z2 = 0x5067d19d ;
long s = 0;
long zi = 1;
for ( int i = 0; i < 5; ++ i ) {
long xi = ( x [ i ] * z2 ) >> 1;
s = ( s + zi * xi ) % p ;
zi = ( zi * z ) % p ;
}
s = ( s + zi * (p − 1 ) ) % p ;
end trace = clock ( ) ;
hash gen= ( float ) ( end trace−s t a r t t r a c e )/CLOCKS PER SEC ;
return ( unsigned int ) s ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−− c a l c h a s h 2 −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
d e v i c e unsigned int calc hash2 ( )
{
return mortonCode5 ( div x , div y , div z , instanceId ,
primitiveIndex ) ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−− separateBy4 −−−−−−−−−−−−−−−−−−−−−−− *
* *
108
C.2. Implementation of the Chained Hash Table
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
d e v i c e unsigned int separateBy4 ( unsigned int x )
{
x &= 0 x0000007f ;
x = ( x ˆ ( x << 16)) & 0x0070000F ;
x = ( x ˆ ( x << 8 ) ) & 0x40300C03 ;
x = ( x ˆ ( x << 4 ) ) & 0x42108421 ;
return x ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−− mortonCode5 −−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
d e v i c e unsigned int mortonCode5 ( unsigned int x , unsigned int y ,
unsigned int z , unsigned int u , unsigned int v )
{
return separateBy4 ( x )
| ( separateBy4 ( y ) << 1)
| ( separateBy4 ( z ) << 2)
| ( separateBy4 (u) << 3)
| ( separateBy4 ( v ) << 4 ) ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−− c a l c h a s h 3 −−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
d e v i c e unsigned int calc hash3 ( )
{
return calc hash1 ( ) + calc hash2 ( ) ;
}
C.2. Implementation of the Chained Hash Table
Listing C.2: Implementation of chained hash table interface
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−− writeToCache −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e void writeToCache ( PerRayData prev data ,
PerRayData data ,
CacheNode* &cachedNodeWrite )
{
109
C. Implementation Hash Tables
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−− get bucket −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
int buf s = node buffer . size ( ) ;
Key key ;
key = makeKey( key , prev data , false ) ;
int bucket ind = get bucket index ( key . hash ) ;
CacheNode* root = &node buffer [ bucket ind ] ;
CacheNode* node = NULL;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−− i n s e r t data −−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
i f ( ! root−>used )
{
root−>hash = key . hash ;
root−>data = data ;
root−>used = true ;
node = root ;
atomicAdd(&( root−>counter ) , 1 ) ;
} else
{
int counter = atomicAdd(&( root−>counter ) , 1 ) ;
int node ind = bucket ind + counter * num of buckets ;
i f ( node ind >= buf s )
return ;
node = &node buffer [ node ind ] ;
node−>hash = key . hash ;
node−>data = copyData ( data , node−>data ) ;
node−>used = true ;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−− s e a r c h f o r vacant p o s i t i o n −−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
while ( true )
110
C.2. Implementation of the Chained Hash Table
{
i f ( node−>hash <= root−>hash )
{
i f ( atomicCAS(&( root−>l e f t ) , −1, node ind ) == −1)
{
atomicCAS(&(node−>parent ) , −1, root−>index ) ;
return ;
} else
root = &node buffer [ root−>l e f t ] ;
} else
{
i f ( atomicCAS(&( root−>right ) , −1, node ind ) == −1)
{
atomicCAS(&(node−>parent ) , −1, root−>index ) ;
return ;
}
else
root = &node buffer [ root−>right ] ;
}
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−− l i n k elements o f one path −−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
i f ( cachedNodeWrite != NULL)
{
cachedNodeWrite−>queue = node−>index ;
}
cachedNodeWrite = node ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−− getFromCache −−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e CacheNode* getFromCache ( PerRayData data , bool benchmark )
{
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−− get bucket −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
int buf s = node buffer . size ( ) ;
111
C. Implementation Hash Tables
Key key ;
key = makeKey( key , data , false ) ;
int bucket ind = get bucket index ( key . hash ) ;
CacheNode* root = &node buffer [ bucket ind ] ;
i f ( ! root−>used )
return NULL;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−− s e a r c h f o r element with equal key −−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
while ( true ){
i f ( root−>hash == key . hash )
{
root−>hi t ++;
return root ;
}
else i f ( key . hash <= root−>hash )
i f ( root−>l e f t == −1)
return NULL;
else
root = &node buffer [ root−>l e f t ] ;
else
i f ( root−>right == −1)
return NULL;
else
root = &node buffer [ root−>right ] ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−−−− hasKey −−−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e bool hasKey ( PerRayData data )
{
CacheNode* node = getFromCache ( data , false ) ;
i f ( node == 0)
return false ;
return true ;
}
C.3. Implementation of the Open-Addressed Hash Table
Listing C.3: Implementation of open-addressed table interface
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
112
C.3. Implementation of the Open-Addressed Hash Table
* *
* −−−−−−−−−−−−−−−−−−−− writeToCache −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e void writeToCache ( int base index ,
PerRayData prev data ,
PerRayData data ,
CacheNode* &cachedNodeWrite ,
unsigned int pos hash )
{
i f ( c a c h e i n i t )
{
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−get bucket from t a b l e −−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
CacheKey key ;
key = makeKey( key , prev data , false ,
cache division , hash method ) ;
int iind = getBucketIndex ( key . hash ) ;
int bucket ind = base index + iind ;
i f ( bucket ind < node buffer . size ( ) )
{
CacheNode *node = &node buffer [ bucket ind ] ;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−− a c q u i r e write l o c k and write data −−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
i f ( atomicCAS(&(node−>writeLock ) , 0 , 1) == 0)
{
node−>hash = key . hash ;
node−>nextOrigin = prev data . nextOrigin ;
node−>nextDirection = prev data . nextDirection ;
node−>data = data ;
node−>used = true ;
node−>traceTime = trace time ;
node−>hashGen = key . hash gen ;
node−>pos hash = pos hash ;
node−>timestamp = time ;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−− r e l e a s e r e a d l o c k l o c k −−−−−−−−−−−−−−−−−−− *
113
C. Implementation Hash Tables
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
atomicExch (&(node−>readLock ) , 0 ) ;
}
}
}
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−− getFromCache −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e bool getFromCache ( int base index ,
PerRayData prev data ,
PerRayData &data ,
unsigned int pos hash )
{
i f ( c a c h e i n i t )
{
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−− get bucket by hash −−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
CacheKey key ;
key = makeKey( key , prev data , false ,
cache division , hash method ) ;
int iind = getBucketIndex ( key . hash ) ;
int bucket ind = base index + iind ;
i f ( bucket ind < node buffer . size ( ) )
{
CacheNode* node = &node buffer [ bucket ind ] ;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−− a c q u i r e the r e a d l o c k −−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
i f ( atomicCAS(&(node−>readLock ) , 0 , 1) == 0)
{
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−− purge i f node has d i f f e r e n p o s i t i o n stamp −−−−−−−− *
114
C.3. Implementation of the Open-Addressed Hash Table
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
i f ( node−>pos hash != pos hash )
{
atomicExch (&(node−>readN ) , 0 ) ;
atomicExch (&(node−>writeLock ) , 0 ) ;
return false ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−− s y n c h r o n i z a t i o n o f read a c c e s s −−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
int i = atomicInc (&(node−>readN ) ,
cache readers number ) ;
node−>hi t ++;
data = node−>data ;
i f ( i < cache readers number − 1)
atomicExch (&(node−>readLock ) , 0 ) ;
else
{
atomicExch (&(node−>readN ) , 0 ) ;
atomicExch (&(node−>writeLock ) , 0 ) ;
}
return true ;
}
}
}
return false ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−− getFromCacheRes −−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e bool getFromCacheRes ( int base index ,
PerRayData prev data ,
PerRayData &data ,
unsigned int pos hash )
{
i f ( c a c h e i n i t )
{
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
115
C. Implementation Hash Tables
* −−−−−−−−−−−−−−−− get bucket by hash −−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
CacheKey key ;
key = makeKey( key , prev data , false ,
cache division , hash method ) ;
int iind = getBucketIndex ( key . hash ) ; ;
int bucket ind = base index + iind ;
CacheNode* node = &node buffer [ bucket ind ] ;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−− a c q u i r e the r e a d l o c k −−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
i f ( atomicCAS(&(node−>readLock ) , 0 , 1) == 0)
{
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−− i f hashes are not equal c a l c u l a t e r e s i d u a l −−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
i f ( node−>hash != key . hash )
{
float res1 =
calculateResidual ( node−>data . nextDirection ,
prev data . nextDirection ) ;
float res2 =
calculateResidual ( node−>data . nextOrigin ,
prev data . nextOrigin ) ;
float res = res1 + res2 ;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−− i f r e s i d u a l e x c e e d s the t h r e s h o l d , purge −−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
i f ( res > cache residual value )
{
atomicExch (&(node−>readN ) , 0 ) ;
atomicExch (&(node−>writeLock ) , 0 ) ;
return false ;
}
}
116
C.3. Implementation of the Open-Addressed Hash Table
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−− s y n c h r o n i z a t i o n o f read a c c e s s −−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
int i = atomicInc (&(node−>readN ) , cache readers number ) ;
node−>hi t ++;
data = node−>data ;
i f ( i < cache readers number − 1)
atomicExch (&(node−>readLock ) , 0 ) ;
else
{
atomicExch (&(node−>readN ) , 0 ) ;
atomicExch (&(node−>writeLock ) , 0 ) ;
}
return true ;
}
}
return false ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−−−− makeKey −−−−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e CacheKey makeKey( CacheKey key ,
PerRayData prev data ,
bool debug ,
int division ,
int hash method )
{
key . primitiveIndex = 0;
key . instanceId = 0;
key . div x = 0;
key . div y = 0;
key . div z = 0;
key . hash = 0;
key . hash gen = 0;
key . setIndices ( prev data . instanceId , prev data . primitiveIndex ,
prev data . nextDirection , hash method , division ) ;
i f ( debug ){
r t P r i n t f ( ” primitive index %d n” , key . primitiveIndex ) ;
r t P r i n t f ( ” instance id %d n” , key . instanceId ) ;
117
C. Implementation Hash Tables
r t P r i n t f ( ” div x %d n” , key . div x ) ;
r t P r i n t f ( ” div y %d n” , key . div y ) ;
r t P r i n t f ( ” div z %d n” , key . div z ) ;
}
return key ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−− getBucketIndex −−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e uint getBucketIndex ( unsigned int hash )
{
return ( ( float ) hash/UINT MAX) * ( num of buckets − 1)/2;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−−−−− searchInMap −−−−−−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e int searchInMap ( int modelInstanceId )
{
int mapSize = cache emitter map . size ( ) ;
for ( int i = 0; i < mapSize ; ++ i )
{
EmitterMapEntry * entry = &cache emitter map [ i ] ;
i f ( entry−>modelInstanceId == modelInstanceId )
return entry−>base index ;
}
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* *
* −−−−−−−−−−−−−−−−−− c a l c u l a t e R e s i d u a l −−−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e float calculateResidual ( f l o a t 3 origin ,
f l o a t 3 otherOrigin )
{
float res = abs ( origin . x − otherOrigin . x ) +
abs ( origin . y − otherOrigin . y ) +
abs ( origin . z − otherOrigin . z ) ;
return res ;
}
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
118
C.3. Implementation of the Open-Addressed Hash Table
* *
* −−−−−−−−−−−−−−−−−− c h e c k I n t e r s e c t i o n s −−−−−−−−−−−−−−−−−−− *
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
inline d e v i c e
bool checkIntersections ( const CacheNode* node , float R)
{
const PerRayData * data = &(node−>data ) ;
for ( int i = 0; i < antenna buffer . size ( ) ; ++ i )
{
const AntennaBufferEntry * antenna = &antenna buffer [ i ] ;
i f ( data−>type == RECEIVER HIT)
{
f l o a t 3 recPos = antenna−>position ;
f l o a t 3 pos = data−>nextOrigin ;
float dx = abs ( recPos . x − pos . x ) ;
float dy = abs ( recPos . y − pos . y ) ;
float dz = abs ( recPos . z − pos . z ) ;
i f ( dx < R && dy < R && dz < R)
return true ;
}
}
return false ;
}
119
Bibliography
[1] Users manual. eb assist adtf 2.9.0. Elektrobit Automotive GmbH.
[2] Luis A. Alexandre. Set distance functions for 3d object recognition. Progress in Pattern
Recognition, Image Analysis, Computer Vision, and Applications, pages 57–64, 2013.
[3] Michael Bader. Space-Filling Curves. Springer Berlin Heidelberg, 2013.
[4] James Balfour. Cuda threads and atomics, 25 April 2011. NVIDIA Research.
[5] Margherita Barile. Taxicab metric, mathworld–a wolfram web resource, 2014.
[6] Mate Boban, Jao Barros, and Ozan K. Tonguz. Geometry-based vehicle-to-vehicle
channel modeling for large-scale simulation. IEEE Transactions on Vehicular Technology.
[7] Ken Chan, Rynson W.H. Lau, and Jianmin Zhao. Dynamic sound rendering based on
ray-caching.
[8] CAR 2 CAR Communication Consortium. Car 2 car communication consortium man-
ifesto.
[9] Kurt Debattista, Piotr Dubla, Francesco Banterle, Luis Paulo Santos, and Alan
Chalmers. Instant caching for interactive global illumination.
[10] F. Escarieu, V. Degardin, and L. Aveneau. 3d modelling of the propagation in an
indoor environment : a theoretical and experimental approach. Proceedings of the Eu-
ropean Conference on Wireless Technologie, 2001.
[11] Tony Finch. Incremental calculation of weighted mean and variance, 2009. University
of Cambridge Computing Service.
[12] M. Fiore, J. Harri, F. Filali, and C. Bonnet. Vehicular mobility simulation for vanets.
Simulation Symposium, 2007. ANSS ’07. 40th Annual.
[13] Tim Foley and Jeremy Sugerman. Kd-tree acceleration structures for a gpu raytracer.
In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hard-
ware, 2005.
[14] Kirill Garanzha and Charles Loop. Fast ray sorting and breadth-first packet traversal
for gpu ray tracing. EUROGRAPHICS, 29, 2010.
[15] T. Gaugel, L. Reichardt, J. Mittag, T. Zwick, and H. Hartenstein. Accurate simulation
of wireless vehicular networks based on ray tracing and physical layer simulation.
Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2011, pages
619–630.
121
Bibliography
[16] J. Gunther, S. Popov, and P. Slusallek. Realtime ray tracing on gpu with bvh-based
packet traversal. Interactive Ray Tracing, 2007. RT ’07. IEEE Symposium on, pages 113 –
118, 2007.
[17] M. Harris, D. Luebke, I. Buck, N. Govindaraju, J. Kr ˜A¼ger, A. Lefohn, T. Purcell,
and J. Wooley. Gpgpu: General-purpose computation on graphics processing units.
SIGGRAPH 2004 Course Notes.
[18] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming, 1st Edition. Morgan
Kaufmann, 2008.
[19] Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Lan-
guages and Systems, Volume 13 Issue 1:124–149, 1991.
[20] Asger Hoedt. Morton codes, 2014.
[21] Denis Jarema. Grammars for space-filling curves. WWW. Solution for Worksheet 11.
Algorithms of Scientific Computing - Summer 2013. Technical University Munich.
[22] Edmund Wright John Daintith. A Dictionary of Computing (6 ed.). Oxford University
Press, 2008.
[23] Tero Karras. Maximizing parallelism in the construction of bvhs, octrees, and k-d
trees. High Performance Graphics, 2012.
[24] Tero Karras. Thinking parallel, part iii: Tree construction on the gpu, December 2012.
[25] Donald Knuth. The Art of Computer Programming. Addison-Wesley, 1995.
[26] Jonathan Ledy, Herve Boeglen, Anne-Marie Poussard, Benoit Hilt, and Rodolphe
Vauzelle. A semi-deterministic channel model for vanets simulations. International
Journal of Vehicular Technology, Volume 2012.
[27] A. E. Lefohn, J. M. Kniss, C. D. Hansen, and R. T. Whitaker. Interactive deformation
and visualization of level set surfaces using graphics hardware. IEEE Visualization,
page 75ˆa82, 2003.
[28] Aaron E. Lefohn, Shubhabrata Sengupta, Joe Kniss, Robert Strzodka, and John D.
Owens. Generic, efficient, random-access gpu data structures. ACM Transactions on
Graphics, 25 Issue 1:60–99, January 2006.
[29] Kyle Loudon. Mastering Algorithms with C. O’Reilly Media, 1999.
[30] R. Mantiuk, K. J. Kim, A. G. Rempel, and W. Heidrich. Hdr-vdp-2: a calibrated vi-
sual metric for visibility and quality predictions in all luminance conditions. ACM
Transactions on Graphics, 30, 4, 2011.
[31] Randima Fernando Matt Pharr, editor. GPU Gems 2: Programming Techniques For High-
Performance Graphics And General-Purpose Computation. Pearson Addison Wesley Prof,
2005.
122
Bibliography
[32] David Meko. Applied time series analysis. Notes for lessons.
[33] Duane Merrill and Andrew Grimshaw. High performance and scalable radix sort-
ing: A case study of implementing dynamic parallelism for GPU computing. Parallel
Processing Letters, 21(02):245–272, 2011.
[34] Prabhakar Misra and Mainak Chaudhuri. Performance evaluation of concurrent lock-
free data structures on gpus. Parallel and Distributed Systems (ICPADS), pages 53 – 60,
2012.
[35] Bochang Moon, Byun Yongyoung, Kim Tae-Joon, Claudio Pio, Kim Hye-sun, Ban
Yun-ji, Nam Seung Woo, and Yoon Sung-eui. Cache-oblivious ray reordering. ACM
Transactions on Graphics, 29, 2010.
[36] Pat Morin. Open Data Structures (in C++).
[37] Nvidia. Cuda c programming guide, October 2012.
[38] Nvidia. Optix ray tracing engine. programming guide., November 2012.
[39] NVIDIA. Ray cache, 2014.
[40] Lars Nyland and Stephen Jones. Understanding and using atomic memory opera-
tions. In GPU Technology Conference. NVIDIA, 2013.
[41] OpenSceneGraph. The openscenegraph project website, 2014.
[42] Stefan Popov, Iliyan Georgiev, Philipp Slusallek, and Carsten Dachsbacher. Adaptive
quantization visibility caching. EUROGRAPHICS 2013, Volume 32, 2013.
[43] Przemyslaw Prusinkiewicz and Aristid Lindenmayer. The Alogirthmic Beauty of Plants.
Springer-Verlag, New York, 1996.
[44] C. Donner M. Cammarano H. W. Jensen Purcell, T. J. and P. Hanrahan. Photon map-
ping on programmable graphics hardware. Proceedings of the SIGGRAPH/Eurographics
Workshop on Graphics Hardware, page 41ˆa50., 2003.
[45] I. Buck W. R. Mark Purcell, T. J. and P. Hanrahan. Ray tracing on programmable
graphics hardware. ACM Transactions on Graphics(Proceedings of SIGGRAPH),
21(3):703ˆa712, 2002.
[46] Christian F. Ruff, Esteban W. G. Clua, and Leandro A. F. Fernandes. Dynamic per
object ray caching textures for real-time ray tracing. Graphics, Patterns and Images
(SIBGRAPI), 2013, pages 258 – 265, 2013.
[47] Jason Sanders and Edward Kandrot. CUDA by Example: An Introduction To General-
Purpose GPU Programming. Addison-Wesley, 2010.
[48] Daniel Scherzer, Lei Yang, and Oliver Mattausch. Exploiting temporal coherence in
real-time rendering. Proceedings of ACM SIGGRAPH, 2010.
[49] Robert Sedgewick. Algorithms (1st ed.). Addison-Wesley, 1983.
123
Bibliography
[50] Parag Tole, Fabio Pellacini, Bruce Walter, and Donald P. Greenberg. Interactive global
illumination in dynamic scenes.
[51] Graham Upton and Ian Cook, editors. A Dictionary of Statistics. Oxford University
Press, 2008.
[52] I. Wald, C. Benthin, and P. Slusallek. Interactive global illumination in complex and
highly occluded environments. Proceedings of the 14th Eurographics Workshop on Ren-
dering, 2003.
[53] B. Walter, S. Fernandez, A. Arbree, K. Bala, M. Donikian, and Greenberg D. P.
Lightcuts: a scalable approach to illumination. ACM Transactions on Graphics, 24,
3:1098ˆa1107, 2005.
[54] G. J. Ward, F. M. Rubinstein, and Clearr. D. A ray tracing solution for diffuse inter-
reflection. SIGGRAPH, page 85ˆa92, 1988.
[55] Daniel Weber, Jan Bender, Markus Schnoes, Andre Stork, and Dieter Fellner. Efficient
gpu data structures and methods to solve sparse linear systems in dynamics applica-
tions. Computer Graphics Forum, 32, issue 1:16–26, February 2013.
[56] Wikipedia. Integer (computer science) — Wikipedia, the free encyclopedia, 2014.
124

Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Communication

  • 1.
    FAKULT ¨AT F¨UR INFORMATIK DER TECHNISCHEN UNIVERSIT ¨AT M ¨UNCHEN Master’s Thesis in Informatics Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Communication Alexander Zhdanov
  • 3.
    FAKULT ¨AT F¨UR INFORMATIK DER TECHNISCHEN UNIVERSIT ¨AT M ¨UNCHEN Master’s Thesis in Informatics Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Communication Steigerung der Effizienz von Realtime GPU Raytracing bei der Modellierung von Fahrzeug-zu-Fahrzeug Kommunikation Author: Alexander Zhdanov Supervisor: Prof. Dr.-Ing. habil. Alois Knoll Advisor: Dipl.-Ing. Manuel Schiller Date: March 17, 2014
  • 5.
    I confirm thatthis master’s thesis is my own work and I have documented all sources and material used. M¨unchen, den 17. M¨arz 2014 Alexander Zhdanov
  • 7.
    Acknowledgments I would liketo thank Professor Knoll for the opportunity to work in his lab, my supervi- sor Manuel Schiller, Christoph Reisinger for valuable advice and also my parents Nikolay and Olga for their support. vii
  • 9.
    Abstract This thesis isdedicated to efficiency optimization of a software designed to simulate Car-2- Car communication system. Namely, it aims to improve the part of the system responsible for modelling of propagation channel implemented using realtime GPU raytracing. The research investigates a possible solution to the problem using reordering of the ray data and utilization of frame coherence. In the beginning, it has been carried out a review of the existing caching schemas exploiting an innerframe and intraframe coherence, tech- niques for the ray reordering on GPU and some of the GPU data structures. It have been considered conditions influencing the solution. It have been proposed algorithms for im- plementation of the ray sorting on the CPU using the space-filling curves. It is offered a method for caching of the tracing data for radiation sources rapidly changing its positions. It is shown a way for standard implementation of the ray reordering on the GPU using Morton codes and Radix sort. It is proposed an implementation of the caching method us- ing data structures utilizing different synchronization mechanisms. It has been analysed the system efficiency with the ray sorting. It is given an analisys of the system perfor- mance for both static and dynamic scenes and performed calculation of the system error for the caching. The system analysis shows that the ray reordering is capable to signifi- cantly increase the system efficiency. Also during the implementation stage, it have been revealed some limitations imposed by a third party software used for the GPU raytracing and proposed a work-around solution to overcome them. The proposed solution allows to increase the initial performance with varying degrees of success for different caching schemes. Nevertheless, evalution of the system performance in condition of interaction between two methods (ray reordering and caching) shows that the ray reordering prevails and currently nullify costs for the caching. ix
  • 10.
  • 11.
    Contents Acknowledgements vii Abstract ix Outlineof the Thesis xv I. Introduction 1 1. Introduction 3 1.1. Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1. Car-to-Car Communication System . . . . . . . . . . . . . . . . . . . 3 1.2.2. VANET Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3. Simulation of Propagation Channel . . . . . . . . . . . . . . . . . . . 4 1.3. Thesis Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1. Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2. Ray Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5. Software System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 II. Literature Review and Problem Solution 7 2. Literature review 9 2.1. Ray caching and frame coherence . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2. GPU data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3. GPU programming model and memory types . . . . . . . . . . . . . . . . . . 17 2.3.1. GPU programming model . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2. GPU memory types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4. Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3. Problem and Solution 27 3.1. An experiment with dimensionality of context launches . . . . . . . . . . . 27 3.1.1. Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2. Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3. Problem Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2. Application of frame coherence . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1. Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 xi
  • 12.
    Contents 3.2.2. Coherence .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3. Formulation of Caching Scheme . . . . . . . . . . . . . . . . . . . . . 34 III. Analysis and Implementation 41 4. Analysis and Modelling 43 4.1. Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.1. Code Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2. Frame Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.1. Code Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.2. Selection of Data Structure . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.3. Selection of Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.4. Selection of Hash Function . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.5. Selection of Mapping Function . . . . . . . . . . . . . . . . . . . . . . 52 4.2.6. Selection of Synchronization Mechanism . . . . . . . . . . . . . . . . 53 5. Implementation 55 5.1. Implementation of Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2. Implementation of Frame Coherence . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.1. Implementation of Data Model . . . . . . . . . . . . . . . . . . . . . . 56 5.2.2. Implementation of Data Structure . . . . . . . . . . . . . . . . . . . . 58 IV. Evaluation and Testing 67 6. Testing 69 6.1. System Configuration before Testing . . . . . . . . . . . . . . . . . . . . . . . 69 6.2. Testing of Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.1. Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.2. Tests Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.3. Testing of Frame Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3.1. Static Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3.2. Dynamic Scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 V. Discussion and Conclusions 89 7. Discussion 91 7.1. Ray Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2. Frame Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2.1. Caching Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2.2. Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.2.3. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8. Conclusions 93 xii
  • 13.
    Contents Appendix 97 A. Space-FillingCurves 97 A.1. Morton Codes Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.2. 2D Hilbert Curve Implementation . . . . . . . . . . . . . . . . . . . . . . . . 97 A.3. 3D Hilbert Curve Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 B. A flow diagram for the main tracing loop 105 C. Implementation Hash Tables 107 C.1. Implementation of Cache Key . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 C.2. Implementation of the Chained Hash Table . . . . . . . . . . . . . . . . . . . 109 C.3. Implementation of the Open-Addressed Hash Table . . . . . . . . . . . . . . 112 Bibliography 121 xiii
  • 15.
    Contents Outline of theThesis Part I: Introduction CHAPTER 1: INTRODUCTION The chapter introduces to a reader the area in which the research is performed. In this chapter, the main goals of the thesis are formulated . Part II: Literature Review and Problem Solution CHAPTER 2: LITERATURE REVIEW In this chapter it is given an overview of articles dedicated to ray reordering, ray caching and GPU data structures. CHAPTER 3: PROBLEM AND SOLUTION In this chapter, tasks formulation are given and proposed an algorithmic or schematic solution. Part III: Analysis and Implementation CHAPTER 4: ANALYSIS AND MODELLING In this chapter, it is presented an analysis of the existing code, design decisions and UML diagram of data model for caching. CHAPTER 5: IMPLEMENTATION The chapter presents implementation of solutions presented in the second part. The chap- ter identifies ways for the ray reordering on GPU and gives implementation of the caching method. Part IV: Evaluation and Testing CHAPTER 6: TESTING The chapter presents a desription of testing approaches and procedures and also gives an overview and analysis of the testing results. xv
  • 16.
    Contents Part VI: Discussionand Conclusions CHAPTER 7: DISCUSSION The chapter briefly discusses results of the research. CHAPTER 8: CONCLUSION The chapter articulates conclusions of the research. xvi
  • 17.
  • 19.
    1. Introduction Software efficiencyoften refers to algorithmic efficiency which is one of the central topics in computer science. According to Oxford Dictionary of Computing, algorithm efficiency is “a measure of the average execution time necessary for an algorithm to complete work on a set of data. Algorithm efficiency is characterized by its order.” [22]. On the other hand, according to Robert Sedgewick [49], program optimization is a process of modifying a software system to make some aspect of it work more efficiently or use fewer resources. The latter implies that there is a software system that behaviour needs to be optimizied. However, many experts in computer science (e.g Donald Knuth [25]) believe that the crit- ical code sections have to be throughly verified and found before the optimization takes place. In our case, efficiency optimization means that it is rather adding a new functional- ity helping to increase performance then optimization of the existing code. With the term optimization closely connected terms caching and performance analysis. 1.1. Thesis Statement The aim of the project is modification of existing software system developed for modeling of Car-to-Car Communication to increase its efficiency (decrease average time taken for data processing). The system uses realtime GPU raytracing for simulation of wireless com- munication between cars. A hypothesis of the research is that the tracing process could be speeded up by taking into account an innerframe and outerframe coherence i.e. caching of resulting data for future requests. It is also supposed that the efficiency optimization could be achieved by altering tracing parameters, for example, by changing an order of rays shot on a scene. Both hypothesis are tested using performance analysis (benchmarking). 1.2. Motivation In the Motivation section, it is given a description of a general direction in which the cur- rent work is carried out. In the first subsection it is briefly described Car-to-Car system, its general purpose. In the second subsection, it is given some general information about virtual testing of such systems. The last subsection considers an impact of realtime gpu raytracing on the virtual drive system and on other urgent scientific areas. 1.2.1. Car-to-Car Communication System Car-to-Car Communication system is a wireless network “between vehicles and their en- vironment in order to make the vehicles of different manufactures interoperable and also enable them to communicate with road-side units.[8]”. According to the Car2Car Com- munication Consortium, the system shall provide the following top level features: 3
  • 20.
    1. Introduction • automaticfast data transmission between vehicles and between vehicles and road side units • transmission of traffic information, hazard warnings and entertainment data • support of ad hoc Car 2 Car Communications without need of a pre-installed net- work infrastructure • the Car 2 Car system is based on short range Wireless LAN technology The Car 2 Car Communication System has the following goals: • enable the cooperation between vehicles • increase driver awareness • extend driver’s horizon • enable entirely new safety functions • reduce accidents and their severity • include active traffic management applications • help to improve traffic flow Thus, the main scenarios for which the system is designed include safety, traffic efficiency, infotainment and some others. 1.2.2. VANET Simulation In general, Car-2-Car communication systems represent a type of Vehicular ad-hoc Net- works (VANET). Specifics of using a wireless connection in such networks requires an active development of new network protocols suitable for the task. However, high costs of full-scale real tests make them disadvantageous. An important part in simulation of such systems is realistic motion model [12]. Another important issue in VANET simulation is realistic modelling of propagation channel [26]. For the later problem there are two possi- ble solutions: statistical and deterministic channel models. The deterministic method uses a ray-tracing to model wave propagation [10]. The deterministic approach provides a real- istic simulation taking into account geometrical and radio properties of the environment. 1.2.3. Simulation of Propagation Channel In modelling of the propagation channel using the ray tracing, there are different ap- proaches. Some authors, for example, create a radio map using a pre-calculation [15]. Others use a mixed (statistical and deterministic) approach for the channel simulation [6]. All authors, however, agree on that statistical models are unable to provide a necessary precision in the network simulation. On the other hand, the ray tracing which provides the seek accuracy suffers from low performance. Thus, the problem of increasing efficiency of the realtime ray tracing becomes an important step towards building accurate VANET simulators. The problem is also important in other areas of computer science, for example, in computer graphics [16]. 4
  • 21.
    1.3. Thesis Goals 1.3.Thesis Goals The main goal of the thesis is to increase the efficiency of realtime GPU raytracing in a sys- tem used for VANET simulation. The main functionality of the system has been developed by the start of this work. So design goals can be formulated as follows: 1. Increase the system efficiency using the ray sorting Design of sorting methods (algorithms) Implementation of sorting methods (CPU) Testing of the system performance with the ray reordering 2. Increase the system efficiency using caching Design of caching method Implementation of caching (GPU) Testing of the system performance using caching 1.3.1. Ray Reordering GPU performs calculations separating the streams on groups which are called ”warps“. The main problem with warp is ”divergence” [4]. Such problem occurs, for example, in code containing branches when some streams inside a ”warp“ take one branch of the ex- ecution flow, other suspend on the evaluation point. When the first finish their execu- tion, others take another branch, so the first group of threads now suspend till the second group finish their execution. Presorting of rays helps to fully utilize the hardware such that threads in one warp take the same execution paths. 1.3.2. Ray Cache On the other hand, caching helps to reduce intensity of computational load. Cache stores calculations which have been already computed in the system for future requests. The main problem here is development of the caching schema and also implementation of the caching on the GPU. The other important thing is testing and evaluation of testing results. Tests development should be performed using automation tools as much as possible. 1.4. Contributions The main contibution of this work is development of caching method for a system simulat- ing VANET with low and medium intraframe coherence. This means that the geometrical configuration of rays completely changes from frame to frame preserving a certain degree of correlation. Other techniques use high ray coherence between frames, it means that par- tially beam configuration stays stable between frames, this allows to reuse tracing results between frames. In this case, radiation sources change their positions with relatively high speed between frames. 5
  • 22.
    1. Introduction The mainproblem solved in the implementation part is selection of efficient data struc- ture allowing to build cache on the GPU side. Next, during testing of the data structure for dynamic scenes, there have been an error that could be contributed to OptiX bugs. The error has been solved by design of synchronization schema for writing/reading data to/from buffer. The error in more detail will be described in the Testing chapter. 1.5. Software System Overview In this section, it is given a brief system overview. Roughly the system contains the fol- lowing modules: Wavetracer This module is responsible for the ray tracing using OptiX engine: reading of configuration file, creation of context, initialization of parameters, tracing programs, launching of the tracing, processing of the output data, writing of the processed data to the output file. This is the main module which will be amended. Sceneviewer This module diplays results of the ray tracing both statically and dynami- cally using OpenSceneGraph [41]. Osgloader Extracts information out of loaded 3D models. Optix wrapper The module is a wrapper of OptiX api. C++ API of OptiX does not meet needs of the application, for example, iterating the scene graph. edgedetector Detection of diffraction edges. adtf coupling The software component is responsible for encapsulation of modules into ADTF [1]. These plugins are osgplugin, testdriverplugin, vtdplugin and wavetracer plu- gin. All the plugins should inherit basic interfaces of the ADTF. 6
  • 23.
    Part II. Literature Reviewand Problem Solution 7
  • 25.
    2. Literature review Thestudy was conducted in the following directions: ray caching techniques & frame coherence, GPU data structures and ray reordering. Also it has been written a review on GPU programming model and memory types. Ray caching and frame coherence is presented in one subsection while GPU data structures and ray reodering in separate. 2.1. Ray caching and frame coherence Realtime ray tracing requires high computational power. One of the methods which could be used to reduce computational expences is caching. Results of the tracing proce- dure could be stored in a ray cache which will reduce a response time for future requests. The main question is how to build an adequate and efficient caching strategy. Several au- thors were selected which used caching techniques in their work (Chan [7], Debattista [9], Popov [42], Ruff [46], Tole [50], Scherzer [48]). The goals of the study: 1. find ”postulates“ for ray caching (how ray caching could be performed in general) 2. the seek strategy shall exploit ray coherence in rapidly changing environments 3. the seek strategy shall be implementationally gpu-compatible The following literature review attempts to find such a strategy. In a research by Chan et al. [7] the ray coherence is exploited to accelerate a sound rendering process in an interactive environment. The article postulates the following prin- ciples for the ray caching. 1. Rays with the same geometric properties (starting point, directions) as contained in cache do not have to be traced again. 2. To maintain the intersection history objects are subdivided into discrete patches. 3. The cache represents a graph with object patches as nodes and rays as edges. Once a ray hits a patch in the cache, the whole intersection history for the given patch be taken which replaces costly intersection tests. Since this article is important for the research, it is necessary to highlight some implementation details here. Patch subdivision An object is subdivided into small patches so that two rays hitting the same patch are considered have the same intersection points. Angular ray directions are quantized 9
  • 26.
    2. Literature review CoherenceIntra-frame coherence occurs when several rays share the same path inside one frame. Inter-frame coherence occurs when several rays share parts of paths contained in the cache Ray cache The cache consists of a tree and a graph attached to tree. Every node of the tree is identified by a complex index (object, patch, patch, ..., division) Purging The cache is purged according to Least Recently Used strategy using timestamp Source movement When a ray source changes its position all cache entries connected with it are removed from the cache. Chan got significant performance improvements using the ray cache. The method was beneficial in multi-user interactive environments with high ray coherence. Hovewer, the method has several disadvantages. Firstly, the method is implemented on cpu using so- phisticated data structures which will be hardly efficient on gpu. Secondly, the method is limited to step - by - step movements with high correlations between frames. Thirdly, the cache is purged when a source changes its position, it means that in succeeding frames the cache cannot be reused. Nevertheless, the article states fundamental ideas for the ray cache implementation. Debattista et al. [9] used several techniques based on irradiance caching [54] in render- ing dynamic scenes with global illumination [52]. The main contribution of the article is detection of invalid ray paths after geometric transformations. Authors considered five cases for invalidation of their instant cache. Case 1 Occlusion of a light path by moving object (occlusion of a secondary light source) Case 2 Deocclusion of a light path by moving object (deocclusion of a secondary light source) Case 3 Occlusion of a visibility ray by moving object Case 4 Deocclusion of a visibility ray by moving object Case 5 Cache sample lies on a dynamic object Figure 2.1 gives a summary of all cases. Overall, Debattista used caching for calculation of a radiance integral where the cache stores illumination instead of visibility. Secondly, the method is limited to static scene with moving objects. However, some of the ideas for ray purging (ray invalidation) could be used in this work as well. Popov et al. [42] in their work exploited an idea of lightcuts [53]. Authors introduced a fundamental point-to-point visibility caching algorithm with could be incorporated in any ray tracer. Also they developed an adaptive quantization scheme which helps to control trade-off between performance and quality. The algorithm was implemented on gpu using hash table which is of a particular interest for the research. 10
  • 27.
    2.1. Ray cachingand frame coherence Figure 2.1.: The five cases that invalidate the instant cache [9] One of the main entities in Popov’s work is a binary visibility function. It is defined as V (X, Y ) = ® 1 if X and Y are mutually visible 0 otherwise (2.1) The visibility function is approximated using quantization of path domain and mapping K( ¯pe, ¯pl) → N which relates a pair of surface points to a unique cluster. The quantized visibility function is defined as ¯V ( ¯pe, ¯pl) ≈ V C (K( ¯pe, ¯pl)) (2.2) The quantization error is controlled using equation |A( ¯pe)||A(¯pl)| = (CE)2 P( ¯pe)P(¯pl)Np (2.3) The figure 2.2 summarizes the concept. To define K(.), the scene surface is divided into a set of virtual multi-resolution, overlap- ping and differently oriented voxel grids. For vertex X with N(X) a quantized direction is defined as ωq = ñ N(X) + 1 2 CN ô , dz = 2ωq CN − 1 (2.4) K(.) returns a tuple of 14 integers: 3 for the orientation of X, 3 for the coordinates contain- ing X, and 1 for the grid resolution R( ¯pe); integers for Y are chosen analogously. 11
  • 28.
    2. Literature review Figure2.2.: Similar paths are grouped together and share the same visibilty query [42] The concept is illustrated by the figure 2.3. Figure 2.3.: Visibility domain quantization [42] Results of the visibility queries are stored in a hash table. To define a particular bin, researchers calulate a 32 bit key k(j) from j = K(.) and use modulo division k(j) mod CT where CT is the hash table size. They also employ a direct-mapped cache which does not resolve collisions simply overwriting the data. One important implemenation detail is that the algorithm uses a counter which controls how many threads in the warp need to trace rays. If the counter is less then some predefined threshold, the local state of each thread is saved in a small pre-warp queue and and the rays are not trace immediately. Whenever the number of threads in the queue exceeds 32, the tracing is performed. This helps to utilize a gpu performance and load it uniformely. For the method assessment Popov uses three metrics: Quality metric, Performance met- ric and Cache performance metric. As a quality writers employ a predictive QMOS pro- posed by Mantiuk et. all [30]. As a performance estimation it is taken a shadow ray re- duction and the total frame rendering time. The cache performance was measured using ratio (1 − HitRatio) = MissRatio = 1/(RayDirection + 1) (2.5) Results show significant shadow ray reduction, up to 50×, preserving image quality with QMOS above 77%. The total rendering speedup varies from 2.5× to 6.7× for different scenes. 12
  • 29.
    2.1. Ray cachingand frame coherence All in all, the article is very valuable. Firstly, it gives a theoretical background for the visibility caching. Secondly, the approach described could be used in rapidly changing environments as in car driving since the quantization allows to exploit the inter-frame co- herence when trace paths change entirely from frame to frame. Thirdly, the algorithm is efficiently implemented on gpu using hash table and direct mapping which makes caching very fast. On the other hand, quantization introduces too much parameters (14) which makes hash table indexing a bit complicated using digest, prepending and modulo op- erations. The higher performance makes memory utilization higher. On the hand, fast changing scenes impose much stress on gpu buffers which could potentially cause mem- ory destruction. Ruff et al. [46] investigate the question of using caching textures for real time tracing in OptiX. For each reflective object of the scene, researchers generate a set of 6 caching textures. Before tracing a ray leaving the object the algorithm queries color information for that ray in the caching textures. If the information is available it is taken from the cache, otherwise the tracing procedure takes place. Authors introduce geometrical scheme how reflection rays are saved in the textures. Results shows that the method produces pictures that are visually equivalent to the reference images. A speed up achieved comparing to conventional ray tracing varies from 2% to 168% depending on the number of additional reflective objects. Developers themself mention in their article that their method is tailored to static scenes with convex objets and auto-reflection features. However, the idea of using the cubic box as caching structure could be beneficial in this work as well. Tole et al. [50] examine in their paper how to build a system for interactive computa- tion of global illumination[52] in dynamic scenes. The system stores illumination samples generated by pixel based rendering algorithms and then applies interpolation between samples using graphics hardware. The shading cache represents a hierarchical patch tree with every patch containing the last computed shading values for its vertices. The patch could be used in three ways, either its value is used for interpolation or the patch could be refined further or its value is updated. If the cache grows above the threshold, patches which are not longer seen on the scene removed together with their children using ”not recently used” strategy. When an object on the scene or light moves the patch values are recalculated using ”age priority“. Comparison with other cache rendering systems, shows that the system suits best for applications like interactive lighting design and modeling. Altogether, the system likewise Debattista [9] uses illumination cache, spatial coherence is exploited using interpolation, while temporal coherence maintained by reusage of patches from previous frames. All this makes the approach practically useless in this study. Work by Scherzer [48] is notes for the course with the same name: ”Exploiting Tempo- ral Coherence in Real-Time Rendering”. He determines temporal (frame) coherence (TC) as “the existence of a correlation in time of the output of a given algorithm”. He further states that the coherence could be used to accelerate a given algorithm making it incre- mental in time and for quality improvement by taking into account results obtained in the 13
  • 30.
    2. Literature review previousframes. Next, Scherzer describes the Reverse Reprojection Cache which reuses shading results from previously rendered frames. The basic idea of the method is to allow renderer to use a shading information which is available for a given point in the previous frame buffer. In order to do this Scherzer introduces a reverse reprojection operator. For refreshing the cache, the screen is divided into n parts which are updated in a round-robin fashion. The method shows good acceleration results for a few pixel shaders. It is used for stereoscopic rendering, simulation of motion blur and depth of field effects, view frustum culling techniques and others. Again, this method is basically designed to being used as shading/illumination cache, it could be used for exploiting the temporal coherence in ob- ject space in culling techniques but this does not help much in our task. The formulation of the temporal coherence could be used to calculate how good frames correlate with each other. Upon the whole, the objectives are achieved. It has been selected the main strategy for caching proposed by Chan et al. [7]. Based on work by Popov [42] the strategy could be extended using the quantization (Chan also uses it intoducing divisions). Also the system could be efficiently implemented on gpu using hash table and bin indexing. Some of the policies for ray purging could be used from work by [9]. 2.2. GPU data structures In this subsection, it is given an overview for gpu data structures using book ”GPU Gems 2“ [31] and articles by Lefohn [28], Foley [13] and Prabhakar [34]. The main goal of the review to find an efficient parallel gpu data structure for the ray cache implementation. Lefohn, in chapter 33 of ”GPU Gems 2”[31] explains how fundamental data structures are implemented using GPU programming model. According to Lefohn, GPU has the following data structures: multidimensional arrays, static and dynamic sparse structures. Multidimensional arrays 2-D textures with nearest-neighbour filtering are the substrate on which most of the GPGPU structures are built. All multidimensional array use address translation to convert N-D array address to 2-D texture. GPU implementa- tion of the address translation suffers from limitations on floating-point addressing. 1-D array 1D arrays are implemented by mapping the data to 2D texture. Currently, a maximum width for a 1D texture is 227 = 134, 217, 728 [37]. Each time an element in a 1D array accessed by a GPU program, the address is translated into a 2D texture indeces. 2-D array 2D arrays are represented as 2D textures. Their maximum size is limited by the GPU driver. 14
  • 31.
    2.2. GPU datastructures Figure 2.4.: Representation of 1D arrays on GPU 3-D array 3D arrays can be implemented in three ways: as 3D textures, as several levels of 2D textures or directly mapped to single 2D textures. Every implemen- tation has its pluses and minuses. For example, the simplest 3D implementation does not have an address translation and for this structure could be used native GPU trilinear filtering to create high-quality data renderings. As a disadvan- tage, this structure requires many passes to write to the whole array. structures There are two possible solutions: a stream of structures and a structure of streams. The stream of structures is a problematic solution because every member of structure has a different stream index and they cannot be easily updated. Con- trariwise, in the structure of streams, a separate flow is created for every structure member. sparse data structures Implementation of sparse data structures as lists, trees or sparse matrices is problematic on the GPU. Firstly, because this involves writing to a com- puted memory address (scattering). Secondly, traversing of such data structures in- volves an inhomogenous number of pointer dereferencing operations to access data which has difficulties based on processing properties of SIMD architecture. Elements which are processed by single SIMD should contain exactly the same instructions. static sparse structures The static sparse structures are not changed during the GPU computation. All of these structures contain one or more levels of indirec- tion. There are two methods for solving the problem of irregular access in these pat- terns: the first one is to divide the whole frame into blocks where all blocks have the same random access model and can be handled together. The second method is to have a stream to process one member from its scheduled list per render passage. dynamic sparse structures Dynamic sparse data structures is a very active research area. One of two noticable works are Purcell et al. 2003 [44] and Lefohn at al [17], [27]. 15
  • 32.
    2. Literature review Figure2.5.: Purcell et al. 2002 [45]. Sparse Ray-Tracing Data Structure A photon map [44] is a cache which stores intersection points and incoming di- rections for light packets called photons.There are two techniques which allow to build the photon map on the GPU. The first one computes addresses and data for writing then it performs a scattering by performing parallel sorting opera- tions on the buffers. The second method uses vertex processor. Lefohn [27] created efficient dynamic data structure on GPU for implicit sur- face deformation. He solves scattering problem by sending small messages to the CPU when the GPU needs to be updated. The structure uses the blocking strategy. Weber et al. [55] presented efficient implementation for sparse matices on GPU in solving of sparse linear systems in dynamic applications. performance considerations In case of dependent memory reads there is a possibility to create a memory-incoherent memory accesses. It could be prevented by creation of coherent blocks of similar computations, small lookup tables and minimization of dependency levels. Another important performance concerns include optimization of computational frequency on the GPU, program specialization and a proper use of pbuffers. Foley and Sugerman [13] presents a GPU implementation for kd-tree traversal algorithm suitable for raytracing but they build data structures on the CPU. This is of no interest for the work. Lefohn et al. [28] presented an abstract generic template library for complex random- access data structures on the GPU. The structures, a stack, an octree, a quadtree are build using standard library components. Firstly, ptx programs generated by nvcc compiler should conform to restrictions imposed by OptiX API. This makes impossible usage of 16
  • 33.
    2.3. GPU programmingmodel and memory types some CUDA libraries. Secondly, we do not need so complicated data structures as octrees, on the other hand functions for construction and utilization of data primitives as 1D, 2D, 3D arrays are built in OptiX API. All these make usage of the library unreasonable. Lock-free data structures represent a certain interest in this work. Prabhakar and Chaud- huri [34] evaluate their performance on the GPU. They consider lock-free linked list [18], hash table, skip list [18] and priority queue [18]. The data structures are evaluated using a mix of add, delete and search operations for different key ranges. For the lock-free linked list, the GPU implementation has a moderate speedup up to 7.4 times for small to medium key ranges comparing to the CPU implementation. The hash-table on the GPU outper- forms the CPU implementation of the same structure in the all key ranges with the maxi- mum speedup at 11.3. GPU realization of the lock-free skip-list is beneficial for small and medium key ranges with the maximum speedup at 30.7. For the lock-free priority queue, GPU benchmarks have the same pattern as for the skip-list with the maximum speedup of 30.8. They close discussion by comparing performance of the GPU implementations of hash-table and linked list. The hash-table is 36 to 538 times better then the linked-list. They conclude that the GPU helps the hash-table to reveal its concurrent potential making it the best data structure for arbitrary key ranges. To sum up, in this subsection, it is considered problems of building data structures on the GPU. It has been mentioned that the question of sparse data structures construction is challenging task. However, many developers and researchers have already contributed to this area. A lock - free data structures is of a particular interest since they offer efficient GPU implementation. The hash table proved to be the best data structure of the afore- mentioned due to its constant performance benefits and design well-suited for usage in multithreaded GPU applications. 2.3. GPU programming model and memory types In this section, it is given an overview of CUDA C programming model [37] and consid- ered different types of GPU memory. 2.3.1. GPU programming model Kernels CUDA C allows a programmer to write C functions (kernels) which during invocation are executed in parallel by N different CUDA threads. A kernel is defined using the global identifier. The quantity of CUDA threads which are going to execute the kernel for the given call is defined using a new syntax <<<...>>>. Each CUDA thread is given a unique thread id which is accessible in the kernel body using the built-in threadIdx variable. Thread hierarchy For convenience, threadIdx is a three component vector so that every thread can have a 1- dimensional, 2-dimensional or 3-dimensional index to form a 1-dimensional, 2-dimensional 17
  • 34.
    2. Literature review or3-dimensional block. The thread id and it index inside the block is put in one-to-one correspondence using the following equations: for the 1D block: thread ID = x where (x) is the thread index for the 2D block of size (Dx, Dy): thread ID = (x + yDx) where (x, y) is the thread index for the 3D block of size (Dx, Dy, Dz): thread ID =(x + yDx + zDxDy) There is a limit in the number of threads combined into one block since all the threads should be processed by one processor core thus sharing the same memory. Presently, mod- ern GPU allows to have blocks with a maximum of 1024 [37] threads per block. Nonetheless, a kernel can be executed by a multiple blocks so the total number of threads executing the kernal equals to the number of blocks multiplied by the number of threads in the block. The blocks are organized in one-dimensional, two-dimensional, three-dimensional grids as illustrated by figure 2.6. Figure 2.6.: Grid of thread blocks [37] 18
  • 35.
    2.3. GPU programmingmodel and memory types The size of the grid is defined by the data being processed of by the number of processors in the system. The number of threads per block and the number of blocks in grid are defined by the syntax <<<...>>> and can be of int or dim3 types. Each block in the grid is identified by one-dimensional, two-dimensional and three- dimensional index which can be accessed from within the kernel using the global variable blockIdx. The dimension of the block is accessed via the global blockDim variable. The thread blocks has to be executed independently from each other. It must be possible to execute them in any order, in parallel or in series. This independence requirement allows to execute blocks independently for any number of processors. Threads in one block can cooperate using shared variables and synchronization of mem- ory accesses. Namely, opportunities exist to define synchronization points in the kernel body by calling intrinsic function syncthreads(). The function acts as barrier at which the block threads wait till any is allowed to proceed. Memory Hierarchy CUDA threads can have a different access to data using multiple memory spaces during their execution as shown on figure 2.7. Every thread has its own private memory. All threads within one block share the same memory which has the same lifetime as the block. All threads can access the same global memory. There exist two additional read-only memory areas accessible for all threads: the con- stant and texture memory spaces. The global, constant and texture memories are opti- mized for different memory usages. The texture memory offers a variety of addressing models as well as data filtering for some data formats. Heterogeneous Programming As illustrated by figure 2.8 the CUDA programming model assumes that kernels are exe- cuted on a separate physical device which acts as a coprocessor to the host running the C program. The CUDA programming model also expect that the host and the device main- tain a separate memory spaces in DDRAM. Thereby the program controls the global, con- stant and texture memory areas through calls to CUDA runtime. This include allocation and deallocation of device memory and transfer of data between the host and the device. 19
  • 36.
    2. Literature review Figure2.7.: Memory hierarchy [37] Serial code is executed on the host while parallel code is executed on the device. Compute capability Compute capability is defined by major and minor revision numbers. Devices that share the same major revision number are of the same core architecture. The minor revisions represent incremental improvement of the core architecture, possibly including new fea- tures. 2.3.2. GPU memory types In this subsection, it is given an overview of different GPU memory types. 20
  • 37.
    2.3. GPU programmingmodel and memory types Figure 2.8.: Heterogeneous programming [37] Device memory accesses An istruction which accesses a memory address could be performed multiple times de- pending on the distribution of memory addresses across the threads within one warp. How the distribution influences performance is peculiar to each memory type and de- scribed in the following subsections. For instance, for the global memory, the rule of thumb is that the more scattered addresses are the less performance is. Global Memory Global memory exists in device memory which is accessed using 32-, 64- and 128-byte [37] memory transactions. These transactions should be naturally lined up: only 32-, 64- and 128-byte segments of device memory that are lined up to their size (i.e. the first address of a segment is a multiple of its size) can be read or written by these transactions. When a warp executes an istruction that accesses the global memory, it joins all memory addresses for all threads within one warp into one or more memory transations depending 21
  • 38.
    2. Literature review onthe size of word accessed by each thread and distribution of the memory accesses across the threads. The more transactions are necessary the more unused words are transferred in addition to those words which are actually accessed by the threads reducing instruction throughput. How much transations are necessary and what the throughput of the device is fully depends on the computing capability of the device. For devices of compute capability 1.0 and 1.1 [37] requirements to get any coalescense are very high. They are more relaxed for devices with higher compute capability. For devices of compute capability 2.x and higher [37] the memory transactions are cached so that data localization is used to reduce impact on the throughput. In order to maximize the throughput of the global memory it is necessary to maximize the coalescing by: • Following the most optimal patterns for devices with computing capabilities of 1.x, 2.x and 3.x [37] • Utilizing data types which comply with requirements of data size and alignment • Padding of data in some cases when, for example, accessing two-dimensional arrays Size and Alignment Requirement Instruction in global memory support writing and reading of words with size of 1,2,4,8 or 16 bytes [37]. Any access to data in the global memory compiled to a single istruction in global memory if and only if the data size does not exceed these numbers and the data is naturally aligned. If the requirements are violated then muliple instructions with different access patterns are which hinder data coalescing. The alignment is automatically done for built-in data types like char, short, int, long, longlong, float, double like float2 or float4 [37]. For structures, the size and alignment requirements can be fullfilled using special speci- ficators like align (8) or align (16). Any variable which is located in the global memory is returned by a driver routine for memory allocation or by runtime API aligned to at least with 256 bytes. Reading of not naturally aligned 8-byte or 16-byte words [37] could lead to incorrect results, a special attention should be paid to maintain alignment of a value or an array of values of these types. This is a typical error which occurs when memory allocation for multiple arrays using common calls to function cudaMalloc or cuMemAlloc are replaced by allocation of a single large block in memory partitioned into multiple arrays. In this case the starting addresses of the arrays are shifted with regard to the initial address of the block. 22
  • 39.
    2.3. GPU programmingmodel and memory types Two-Dimensional Arrays A common access pattern is when a thread with index tx, ty tries to access an element in 2D array of width width using the following mapping BaseAddress + width ∗ ty + tx. In order these accesses are fully coalesced, the width of the thread block as well as the width of the array should be be a multiple of the warp size. Local Memory Only automatic variables could be placed to the local memory. The automatic variables are: • Arrays for which cannot be detemined that they are indexed with constant values • Large structures of arrays which otherwise consume too much register memory • Any variable if kernel uses more registers then available The local memory is located in device memory and as a consequence has a high latency and a low bandwidth as the global memory and is a subject to the same requirements for memory coupling. However the local memory is organized in that way that consecutive 32-bit words accesses are performed by threads with consecutive IDs. Therefore the ac- cesses are coalesced as long as all threads in one warp access the same relative address. For devices with compute capability 2.x and higher [37], all accesses to the local memory are cached in the same way as accesses to the global memory. Shared Memory Shared memory has much lower latency and much higher instruction throughput then lo- cal or global memory because it is placed directly on the chip. To achieve high bandwidth, the shared memory is divided into equally-sized memory modules called banks which can be accessed simultaniously. Thus processes of reading and writing to memory which refer to locations seating in n memory banks can served simultaneously resulting in n times overall bandwidth increase. If two threads access addresses in the same bank then serialization is necessary. These requests are divided into as many conflict-free queries as necessary. If thus n queries oc- curs, the initial memory request is said to cause n-way bank conflicts. In order to maximize performance it is necessary to minimize bank conflicts. This is specific to different device types because of mechanisms of mapping memory addresses to memory banks. Constant Memory Constant memory lies in device memory and cached in constant cache for devices with compute capabilities of 1.x and 2.x [37]. For devices with compute capability 1.x [37] a request to constant memory for a warp is divided into two requests for an every half- warp which then served independently. These requests are further divided to subrequests depending on the number of memory addresses contained in the initial query. The overall 23
  • 40.
    2. Literature review throughputis reduced by the number of subrequests. These subrequests are served at the cache bandwidth in case of cache hit or at device bandwidth otherwise. Texture and Surface Memory Texture and surface memory reside in device memory and are cached in texture cache, thus the cost of texture and surface memory access equals to cost of reading from the device memory in case the data is not in cache, otherwise it costs reading from the texture cache. The texture cache is optimized for 2D spatial localization, threads of the same warp located in 2D space near each other achieve the best performance. Therefore the cache is designed for streaming ingress with constant latency. Thus a number of cache hits reduces the DDRAM bandwidth demand but not the fetch latency. Reading of device memory using texture or surface memory has a number of benefits which makes it advantageous alternative comparing to reading the device memory from global or constant memory: • If accesses to global or constant memory are carried out with violation of perfor- mance rules, a higher bandwidth can be achieved providing that there is a localiza- tion in texture fetches or surface reads. • Operations of addresses calculation are performed outside of the kernel using special units • Packed data can be transferred to separate variables using single operation • 8 or 16-bit integers can be cast to 32 floating-point values in the range of [0.0,1.0] or [-1.0, 1.0] [37] 2.4. Ray Reordering In this subsection, some ray reordering techniques are considered using articles by Garanzha and Loop [14] and Moon et al. [35]. The goal here is to find suitable for the task ray reordering methods. Garanzha and Loop [14] use ray sorting to boost efficiency of ray tracing by revealing of coherence between rays and reducing a number of execution branches within SIMD processor. For the ray sorting they propose a method based on compression of key-index pair. Then the compressed data is sorted and decompressed. The sequence of key-index pairs is generated by using a ray id as the index and a hash value for the given ray as the key. Coordinates of ray sources are quantized using virtual uniform 3D grid within the volume bounding the scene. Ray directions are also quan- tized using virtual uniform grid. Using these grids, authors calculate ray ids which then merged into a 32-bit hash value. Rays which have the same hash value are considered to 24
  • 41.
    2.4. Ray Reordering becoherent in space. Then compressed data is sorted using radix sort. After the data is de- compressed, packet ranges are extraced uing the same compression procedure. Once the ranges are extracted, the next step is to create a frustum for each packet. The frustums are traversed using the breadth-first algorithm. Next, the active frustums are decomposed into chunks of 32 rays max analogously to CUDA warp. This eliminates execution branches within a CUDA warp. Primary rays are indexed and sorted according to a screen-space Z-curve. Binary BVH is build on CPU using binning algorithm. The algorithm is com- pared with Depth-first algorithm for the ray tracing. They get significant performance improvements for soft shadow rays at 1024 × 768 × 16 samples. Comparing to CPU, GPU implementation is 4× faster. However, authors assume that memory consumption could be sufficient. Also the bad case for the algorithm if one frustum captures all of the BVH leaves which could cause very unbalanced workload. Moon et al. [35] implemented the ray tracing with cache-oblivious ray reordering. For the ray sorting, authors introduce a Hit Point Heuristic. A hit point is computed as the first intersection point between the scene and a line starting from the ray origin pointing in the ray direction. After this, points are reordered using a space-filling curve (Z-curve). Dur- ing implementation stage, Hilbert curves were also considered but they gave only slight performance benefits (e.g. 2%) while having much more complex implementation. The ray tracer is implemented on the CPU. The method is tested for path tracing as well as for photon mapping. For the path tracing, their method achieves a significant 16.83 times performance improvement compared to without reodering. For the photon mapping, the method in different configurations gives from 3.77 to 12.28 times performance improve- ment. Rays reordering is also the cause of a higher cache utilization. Also ray reodering based on Hit Point Heuristic shows better performance then ray reodering based on ori- gin + direction reordering. However, authors mention that there is no guarantee that the method will improve performance of the ray tracing because of the overhead. Altogether, the goal is accomplished, it have been considered methods for ray reordering using different techniques. Hash values for the rays could be generated using ray origin and direction quantization as well as based on quantization of spatial information like hit points coordinates. Rays according to their hash values could be sorted in different ways. For example, using radix sort or space-filling curves. Also, in considered articles authors report about significant performance improvements for tracing with ray sorting. 25
  • 42.
  • 43.
    3. Problem andSolution 3.1. An experiment with dimensionality of context launches The first problem that is considered is the problem of how dimensionality of a computa- tional problem affects efficiency of ray tracing. 3.1.1. Problem Description The problem described in Redmine ticket #155 ”Experiment with dimensionality of context launches“. The ticket has the following content: ”Chapter 9.Performance Guidelines” of OptiX Programming Guide [38] states that the maximum coherence between threads of a tile is achieved by choosing an appropri- ate dimensionality for the launch. For example, common problems with 2D images has 2D complexity. Thus the problem is reduced to the determination of the launch dimension and investigation of how this affects efficiency. 3.1.2. Theory To describe the solution to the problem it is necessary to start from definitions for space- filling and Hilbert curves. Space-filling Curve The space-filling curve is defined in the following way [3]: Given a mapping f : I → Rn, then the corresponding curve f∗(I) is called a space-filling curve, if the Jordan content of f∗(I) is larger then 0. Hilbert Curve The Hilbert curve is defined as [3]: • each parameter t ∈ I = [0, 1] is contained in a sequence of intervals I ⊃ [a1, b1] ⊃ ... ⊃ [an, bn] ⊃ ... where each interval results from a division-by-four of the previous interval • each such sequence of intervals can be uniquely mapped to a corresponding se- quence of 2D intervals (subsquares) • the 2D sequence of intervals converges to a unique point q in q ∈ Q = [0, 1] × [0, 1] - q is defined as h(t) f : I → Q defines a space-filling curve, the Hilbert curve. 27
  • 44.
    3. Problem andSolution Grammar for 2D Hilbert Curve Grammar for 2D Hilbert curve can be constructed in the following way [3]: • No-terminal symbols: H, A, B, C, start symbol H • terminal characters: ↑, ↓, ←, → • productions: H ← A ↑ H → H ↓ B A ← H → H ↑ H ← B B ← C ← H ↓ H → B C ← B ↓ H ← H ↑ B • replacement rule: in any word, all non-terminals have to be replaced at the same time → L-System (Lindenmayer) Arrows describe the iterations of the Hilbert curve in ”turtle graphics“[43]. Figure shows a sample 2D Hilbert curve generated using the grammar. Figure 3.1.: An example of 2D Hilbert curve Grammar for 3D Hilbert Curve L-Systems in three dimensions could be described using ”turtle graphics” [43]. The ba- sic idea is to represent the turtle orientation in 3D space using a set of vectors [ ˆH, ˆL, ˆU] representing the turtle’s heading, left direction and upward direction accordingly. Vectors [ ˆH, ˆL, ˆU] form an orthonormal basis. Spatial rotations of the turtle can be described: [ ˆH , ˆL , ˆU ] = [ ˆH, ˆL, ˆU]R (3.1) where R is a 3 × 3 rotation matrix. Rotations by angle α around vectors ¯U,¯L and ¯H are represented by the following matrices: 28
  • 45.
    3.1. An experimentwith dimensionality of context launches Figure 3.2.: Controlling turtle in 3D [43] RU (α) =    cos(α) sin(α) 0 −sin(α) cos(α) 0 0 0 1    RL(α) =    cos(α) 0 −sin(α) 0 1 0 sin(α) 0 cos(α)    RH(α) =    0 0 1 0 cos(α) −sin(α) 0 sin(α) cos(α)    (3.2) The following symbols determine turtle’s orientation in space: + Turn left by angle δ, using rotation matrix RU (δ) - Turn left by angle δ, using rotation matrix RU (−δ) & Pitch down by angle δ, using rotation matrix RL(δ) ∧ Pitch up by angle δ, using rotation matrix RL(−δ) Roll left by angle δ, using rotation matrix RH(δ) / Roll right by angle δ, using rotation matrix RH(δ) — Turn around, using rotation matrix RU (180◦) An interested reader could find a grammar for 3D Hilbert curve in appendix A.3. Z Curve Z-curves are defined in terms of Morton codes [24]. In order to calculate Morton codes it is necessary to consider a binary representation of point coordinates in 3D space, as shown by figure 3.3. Firstly, for each coordinate, the binary code is expanded by insertion of two additional ”gaps“ after each bit. Secondly, the binary codes of all coordinates are joined (interleaved) to form one binary number. If thus resulting codes are sorted in ascending order, this will determine the sequence of z-curve in 3D space (the left part of figure 3.3). The sorting could be performed using radix sort.The expansion and interleaving of bits could be done efficiently by utilizing the arcane bit-swizzling properties of integer multi- plication. A curious reader will find a listing in appendix A.1. 29
  • 46.
    3. Problem andSolution Figure 3.3.: Generation of Morton codes [24] 3.1.3. Problem Solution The problem solution could be sketched in the following way: 1. Sort ray buffer according to spatial ray coordinates 2. Initialize the sorted buffer with the context depending on the dimensionality For 1D, – For 2D, map 1D indeces to 2D array structure using 2D Hilbert curve For 3D, map 1D indeced to 3D array structure using 3D Hilbert curve Next will be discussed in more details points of this sketch. Sorting Histogram and Hilbert Curve The first approach is to generate indices for every ray in 3D and sort them according to Hilbert curve. Ray indices Rays generated for the wavetracer represent uniformely distributed points on a unit sphere. Their coordinates could be quantized. The quanti- zation is the same as the redistribution of rays on a three-dimensional spatial data structure (a histogram) according to their directions. For each element in- side the ray buffer could be generated indices depending on the number of bins in the histogram. Using these indices the ray could be added to a bin of the histogram. A pseudocode for the algorithm is shown by Algorithm 1. A geo- metrical interpretation of the algorithm is shown by figure 3.4. Hilbert curve Once, the data structure is obtained, it could be sorted using a 3D Hilbert curve or, which is the same, mapped from 3D to 1D data structure. The sorted buffer could be used directly for the context initialization. The Hilbert curve generated for 16 × 16 × 16 bins is shown by figure 3.5. Morton codes and Radix Sort The second approach is a logical continuation of the pre- vious. Morton codes sort rays according to their spatial neighborhood in z-order. Using morton codes, rays could be sorted with radix sort. 30
  • 47.
    3.2. Application offrame coherence Algorithm 1 Algorithm for histogram generation for all element in Buffer do x0 ⇐ (element.x + 1)/2) y0 ⇐ (element.y + 1)/2) z0 ⇐ (element.z + 1)/2) x ⇐ floor(x0 ∗ bin num/2) y ⇐ floor(y0 ∗ bin num/2) z ⇐ floor(z0 ∗ bin num/2) list ⇐ histogram[x][y][z] v.x ⇐ x0 v.y ⇐ y0 v.z ⇐ z0 list.pushBack(v) histogram[x][y][z] ⇐ list end for Morton codes Efficient implementation for generation Morton codes was shown in the previous subsections. The figure 3.6 shows a pattern generated by the algorithm when sorting rays. Radix Sort CUDA already has an efficient implementation for this algorithm. Transformation between Spacial Structures To experiment with dimensionality, it is necessary to transform spatial structures from 1D to 2D or 3D data structures. This also could be achieved using Hilbert curves. Mapping from 1D to 2D The sorted buffer could be mapped from 1D to 2D using 2D Hilbert curve. Mapping from 1D to 3D The mapping between 1D and 3D could be achieved using 3D Hilbert curve. Implementation details will be discussed in the appropriate chapter. 3.2. Application of frame coherence The second problem is an investigation of influence of frame coherence on the wavetracer performance. 3.2.1. Problem Description The task is formulated in Redmine ticket # 218 ”Exploit frame coherence”. The task has the following objectives: 1. Find types of coherence which exist in the system a) Measure coherence 31
  • 48.
    3. Problem andSolution Figure 3.4.: Uniformely distributed rays & histogram bins 2. Find schemas(algorithms) which allow to exploit them 3. Propose efficient implementation for the algorithms 4. Implement the proposed solution 5. Measure performance 6. What are the costs for coherence utilization in the system? In the following subsections, the first two points will be considered. 3.2.2. Coherence According to ”A Dictionary of Statistics” [51], coherence is a “term used to describe the resemblance between the fluctuations displayed by two time series; an analogue of corre- lation”. Innerframe Coherence In the context of the given work, the innerframe coherence means that there is a correlation between results of ray tracing inside one frame. Rays with high coherence could be com- 32
  • 49.
    3.2. Application offrame coherence Figure 3.5.: Hilbert curve generated for 16 × 16 × 16 bins bined into groups. Inside these groups it necessary to trace only one ray [42]. Questions which could be posed here: 1. How to know what rays have to be coalesced into one group? 2. How results of the tracing are to be stored in the cache? 3. How to calculate the error which introduces this approach? Intraframe Coherence The intraframe coherence means that there is a correlation between results of ray tracing procedure for different frames. A result of the tracing procedure for any ray could be stored in cache and used in the next frames. Questions which could be posed here: 1. How to measure coherence between frames? 2. How to know what data could be reused in the next frames? 3. What caching strategy to choose to purge the cache? Coherence Measurement Both innerframe and intraframe coherence could be measured using Pearson product cor- relation coefficient [32]. rxy = N i=1(xi − ¯x)(yi − ¯y) » n i=1(xi − ¯x)2 n i=1(yi − ¯y)2 (3.3) Where x and y are two random variables with N observations. For the innerframe coherence could be used autocorrelation function [32]. rk = N−k i=1 (xi − ¯x)(xi+k − ¯x) n i=1(xi − ¯x)2 (3.4) 33
  • 50.
    3. Problem andSolution Figure 3.6.: Z-curve The quantity rk is called the autocorrelation coefficient at lag k. Calculation of these values will be covered in more detail in the chapter dedicated to testing. 3.2.3. Formulation of Caching Scheme In the Literature Review it has been done already a survey of caching schemas. Using ideas stated by Chan [7] it is possible to build a caching system for the given task. Cache Tree In Chan’s [7] work, each object consists of several surfaces, each surface is divided into several levels of patches and every patch is further quantized depending on angular values of incident rays. Objects In this work, there is only one type of objects: a model instance. Models are distinguished using their ids which are defined in the configuration file. An environment is also loaded as a model instance which usually has -1 as id. So roots of cache trees could be featured using these identificators. Patches There is a natural division of such objects into patches which are called primitive indices. These primitive indices introduce patches of native precision for the objects. So results cached for the given object could be discriminated using these indices. Divisions Patches or primitive indices are responsible for spatial accuracy. However, rays have to be distinguished also by angular precision. Angular values of incident rays represent spatial coordinates within a sphere of unit length. These coordinates are also quantized using some big number introducing quantization precision. Quantized coordi- nates introduce divisions which further dinstinguish incident rays. 34
  • 51.
    3.2. Application offrame coherence Division Index Thus results of the tracing procedure are stored in cache using a multi- component division index. The index consist of the model instance id, primitive index id and of three quantized angular coordinates. The concept is shown by figure 3.7. Figure 3.7.: Cache Tree Cache Construction Initially the cache is empty. In general, a trace path is represented in the system as a se- quence of points. Data associated with results of tracing for any ray could be of different types, for example, miss, reflection, diffraction, receiver hit, emitter launch. Per ray data carries all the information neccessary for any of these types. The most important informa- tion is origin of a point where intersection or some other tracing event occurs, a new ray direction (i.e. direction of an outgoing ray), instance id (i.e. identificator of a model where a tracing event occurs) and a primitive index. When a ray is constructed, results of the previous tracing are used. Thus it can be observed that between results of the previous tracing and results of a current tracing there exists a one-to-one association. Consequently, results of the previous tracing can be used as a key whereas results of the current tracing can be seen as data. In case when there are no previous results, i.e when a ray is taken from the buffer, for the first key both instance id and primitive index id are set to zero and divisions are obtained by quantization of the initial ray direction. The cache construction is illustrated by figure 3.8. All subsequent rays request cache using multicomponent index. If such enty exists then the result stored for the entry is taken for the ray. Otherwise the ray is traced and the result is saved to cache using the multicomponent index. An algorithm for cache construction is shown by Algorithm 2. An attention should be paid to that if query returns false then the entry is overwritten with a new value. For convenience of retrieving, cache entries which belong to the same path could be linked in a list. 35
  • 52.
    3. Problem andSolution Figure 3.8.: Cache Construction Cache Purging To purge the cache it is necessary to know what entries are not valid any more. There are three cases of changing of a steady-state ray configurtion. They are emitter movement, receiver movement and (reflection) object movement. Emitter Movement In this paragraph, it is analysed how to purge the cache in case of emitter movement. There two possible solutions: Position Purging and Precision Purging. Position Purging We could define for what range of variation of emitter positions change will not produce new tracing results. In order to do this, emitter positions are quantized which introduces further divisions for a prmitive index (patch). Thus a cache entry is characterized by three additional coefficients which represent an emitter position for the ray. These coefficients are stored as an additional key for the entry (position stamp). Any ray has its own position stamp. Thus the cache is queried with a multicomponent key, if the entry has an outdated position stamp it should be purged. An algorithm for cache query is shown by Algorithm 3. This position stamp obtained with a quantization of emitter 36
  • 53.
    3.2. Application offrame coherence Algorithm 2 Algorithm for cache construction for all ray in rayStack do mcid ⇐ generate multicomponent index(ray) pk ⇐ generate position key(emitter position) query ⇐ cache.contain(pk, mcid) if query then result ⇐ cache.get entry(mcid) else result ⇐ trace(ray) save to cache(pk, mcid, result) end if end for Algorithm 3 Algorithm for cache query Require: multicomponet index ∨ position stamp if cache.contain(multicomponet index) then node ⇐ cache.get(multicomponet index) if node.position stamp equals position stamp then return TRUE end if end if return FALSE position introduces a degree of flexibility which allows to use cache data between frames if variation of emitter position is within a certain precision range. Precision Purging Precision purging is characterized by the fact that changing of the emitter position hash does not necessarily mean that the cache has to be purged. Instead, it is calculated an estimation of how far parameters of the requested element are from that is currently at this address in the cache. This is called residual value. If this value exceeds a threshold then the cache is purged at the requested address. Implemenation will be described in the implementation chapter. Receiver Movement Clearing the cache when moving the receiver can be performed as follows. As the trace path is represented in the system by a sequence of points, then there are three cases: 1. intersection point is on the environment 2. intersection point is on other moving object 3. intersection point is on receiver In case the intersection point is on the environment, it cannot be claimed that associated cache entry is not valid since the ray path has not reached a receiver yet. So this entry is valid since it is not associated with the receiver which changed its position. The second 37
  • 54.
    3. Problem andSolution case is the same as the first one. The third case can be checked by verifying a hit point po- sition against receiver positions which present on the scene. The hit point position should be within a radius of emitter anttena which in this case is 1. So the entry is invalidated, if the test is unsuccessful. The concept is illustrated by figure 3.9. A pseudocode for this test Figure 3.9.: Receiver movement is represented by algorithm 4. Object Movement This is the most complex case. It happens when a ray reflects from other moving object (not environment), for example, other car. One possible approach for the problem solution is to create a more complex position key. It means that the key will reflect current positions of all moving objects on the scene, a sort of map. Once any object in the map makes a movement, a new key generated. It will differ from previous, if the movement exceeds a quantization precision. However, in the current implementation a number of such cases is neglectable since most of the virtual drive recording have the same speed for all cars. It means that they are moving with constant speed without overtaking each other. 38
  • 55.
    3.2. Application offrame coherence Algorithm 4 Algorithm to check intersection with receiver Require: node ∨ radius data ← node.data for all antenna in antennaBuffer do if data.type equals RECEIV ER HIT then recPos ← antenna.position pos ← data.origin dx ← abs(recPos.x − pos.x) dy ← abs(recPos.y − pos.y) dz ← abs(recPos.z − pos.z) dr ← sqrt(dx2 + dy2 + dz2) if dx ≤ radius then return TRUE end if end if return FALSE end for 39
  • 56.
    3. Problem andSolution 40
  • 57.
    Part III. Analysis andImplementation 41
  • 59.
    4. Analysis andModelling 4.1. Ray Reordering 4.1.1. Code Analysis The best place to put ray sorting is where the ray buffer is created in class RandomEmitterBuffer in module antennageometry.cpp. Existing Code In the constructor of RandomEmitterBuffer, the ray buffer is created and formatted us- ing OptiX context. Constructor calls resize method. The method resize sets the size of buffer, fills the buffer elements calculating (uniformly) distributed spherical coordinates and calls rtBufferMarkDirty which notifies OptiX that the content of buffer is changed. Modification For convenience, it is possible to create variable reorder which indicates that the buffer is to be sorted or not. If the variable is set to true then sort method is called. Listing 4.1 show modification of method resize. Listing 4.1: Resize with calling sort method void RandomEmitterBuffer : : resize ( const RTsize size ) { buffer−>setSize1D ( size ) ; f i l l ( ) ; i f ( reorder ) sort ( ) ; buffer−>markDirty ( ) ; } 4.2. Frame Coherence 4.2.1. Code Analysis Schematically the main tracing loop is represented by figure 4.1. It consists of three main stages: 1. Analysis of the stack top element 43
  • 60.
    4. Analysis andModelling 2. Ray tracing 3. Storing the results on the top of the stack Figure 4.1.: The main tracing loop Analysis of the stack top element This part of the algorithm is the most complex with a lot of branches. Its purposes are: 1. Choice of direction for the top element 2. Write results of the tracing to the WayPointBuffer depending on configuration parameters 3. Unwind stack An interested reader will find a chart in appendix B. 44
  • 61.
    4.2. Frame Coherence Raytracing On this stage, data is taken from the stack, a ray is constructed and traced. This step fits naturally for the cache implementation. An extended flowchart is shown by figure 4.2. Here, it is introduced variable cacheEnabled for convenience of turning off/on the Figure 4.2.: Ray tracing with cache cache. If cache is enabled, the cache is queried if it contains an entry with a key defined by prev data. In case the cache contains the key, the data is directly set in the function call. Otherwise the ray is traced and results are saved to cache. The last state is setting the value of data to prev data that is to use it on the next iteration. Data saving The last stage consists of pushing data to stack and incrementing the stack counter. The stage is shown by figure 4.3. 4.2.2. Selection of Data Structure The following questions arise when selecting a data structure for the cache construction: 1. How to implement tree described in the previous chapters for the cache construc- tion? 45
  • 62.
    4. Analysis andModelling Figure 4.3.: Saving data 2. If such trees can be efficiently constructed, how to maintain them? 3. How to implement a fast data search in such trees? In our case, the cache trees cannot be implemented directly as they described by Chan [7], firstly, because it requires a dynamic allocation of memory which is disabled in OptiX. Sec- ondly, a search of elements in such data structures is challenging. In order to make it more convenient, it is necessary to maintain additional data structures performing indexation. Assuming all these, construction of such trees will be difficult. Binary Tree The first idea that comes is to use a binary tree. It has an easy construction and main- tainance. Complexity of search in binary tree is O(log(n)). An object hierarchy could be maintained using hashing of multicomponet indices. On the other hand, hash values could be used to determine a total ordering of elements of the tree. Selection of an appropriate hash function is a separate issue which will be regarded later. The problem with dynamic allocation could be solved using a buffer with predefined elements. The tree is illustrated using the figure 4.4. However, the binary tree has one major defect: all elements of the tree have to be added using the root element. The root element has to contain a counter which indicates the next element in the buffer of predefined elements. In case of massively parallel computations, GPU streams have to access the counter successively in order to provide data consistency. This counter introduces the main bottleneck of the binary tree. Chained Hash Table A natural solution to the problem described in the previous paragraph is to constuct trees in parallel. This resolves collisions of streams trying to access the counter and reduces the waiting time in the queue. The streams are distributed by various root elements depending on the hash value of added element. Such data structure is called a chained hash table [29]. It is more preferred then the binary tree, however, performance analysis of the chained table shows that access to the root elements (buckets) and to chained elements of the table 46
  • 63.
    4.2. Frame Coherence Figure4.4.: Binary Tree have different times. This difference overall reduces the table performance making it not very profitable. Also since the element buffer has a fixed size, it is necessary to reserve a certain number of elements for each bucket. This creates a fragmentation of the element buffer. The data structure is illustrated by figure 4.5. Implementation of the chained hash table is described in Implementation of Data Struc- ture 5.2.2. Performance analysis is described in the Testing chapter 6. Open-Addressed Hash Table Further improvement of the data structure is permission of the direct access to elements of the table. This solution has several benfits: 1. Further reduction of stream collisions 2. Absence of buffer fragmentation 3. A fast access to elements Direct mapping allows to further reduce collisions of GPU streams. Absence of trees con- struction solves the probem of buffer fragmentation. Direct access to buffer elements pro- vides a possibility of fast read/write cache operations implementation. Nevertheless, there exists an implementation pitfall connected with OptiX buffer which will be described in the implementation subsection. Implementation of the open-addressed hash table [29] is described in Implementation of Data Structure subsection 5.2.2. Performance analysis is described in the Testing chapter 6. 47
  • 64.
    4. Analysis andModelling Figure 4.5.: Chained Hash Table 4.2.3. Selection of Data Model According to theoretical justification described in chapter Problem and Solution 3 and the selected data structure, in this subsection, the data model is described. Cache Key CacheKey is a structure to hold ray multicomponent key parameters according to defini- tion in 3.2.3. A cache key contains the following members: primitiveIndex - identificator of mesh triangle instanceId - identificator of loaded model div x - x division of direction angle div y - y division of direction angle div z - z division of direction angle hash - hash value of the key calc hash() - function which calculate hash value using the members of the key Position Key A position key contains the following members: pos x - x division of emitter position 48
  • 65.
    4.2. Frame Coherence posy - y division of emitter position pos z - z division of emitter position phash - hash value of the key pos hash() - function which calculate hash value using the members of the key Cache Node A cache node or element of the hash table contains the following members: ckey - cache key pkey - position key data - per ray load data timestamp - timestamp of creation used - usage marker hit - number of hits These are the basic elements of the node. The node member list will be extended depend- ing on the task. Per Ray Data To get the overall view of the data model, it is necessary to give a description of PerRay- Data. A struct PerRayData has the following members: diffractionCnt - parameter from the main loop to the hit programs type - The type of the waypoint. This value is set by the hit program. nextOrigin - origin of the next launch nextDirection - direction of the next launch receiverDistance, diffEdge, diffStepSize, diffSteps - all these are used in the closest hit program of the receiver primitiveIndex - the same of the cache key instanceId - the same of the cache key normal - the normal of the triangle on which the ray was reflected emitterSlot - the slot of the antenna which emitted this ray emitterModelInstanceId - the same as modelInstanceId in the cache key 49
  • 66.
    4. Analysis andModelling Data Model Thus the overall data model is shown by figure 4.6. In reality, CacheNode does not contain CacheKey and PositionKey, it only contains their hash values. This is done to minimize a memory size of CacheNode and consequently of the hash table on the GPU. Figure 4.6.: Data Model 4.2.4. Selection of Hash Function Selection Criteria Usually, when a hash function is selected, the following criteria are used: 1. The hash function should spread elements accross the table in a random and uniform manner. 2. Collision of hash values for different elements should be minimal. 50
  • 67.
    4.2. Frame Coherence Thefirst condition is necessary in order to distribute elements accross all buckets in uni- form manner so that all the buckets contain approximately equal number of elements. The second condition ensures that any element is identified in unique way. However, based on the task solution approach, it is necessary to add another condition, namely that keys with similar parameters should have close hash values. The question is whether the conditions one and two are compatible. This should not be a problem because rays are already uniformely distributed on a unit sphere. Hash Function with Uniform Distribution The key represents an array of integers. It is necessary to generate a hash value based on the array. The code for such function could be taken, for example, from Morin [36]. Listing 4.2 shows the code. Listing 4.2: Hash function for integer array unsigned hashCode ( ) { long p = (1L<<32)−5; / / prime : 2ˆ32 − 5 long z = 0x64b6055aL ; / / 32 b i t s from random . org int z2 = 0x5067d19d ; / / random odd 32 b i t number long s = 0; long zi = 1; for ( int i = 0; i < x . length ; i ++) { / / reduce to 31 b i t s long long xi = ( ods : : hashCode ( x [ i ] ) * z2 ) >> 1; s = ( s + zi * xi ) % p ; zi = ( zi * z ) % p ; } s = ( s + zi * (p−1)) % p ; return ( int ) s ; } In this listing, x is an array of integers. Integers are hashed using a multiplicative hash function with d = 31 to reduce a hash code to 31 bits representation. This is done in order additions and multiplications can be carried out using a 63-bit arithmetic. Probability for two sequences to contain have the same hash code is defined as [36] 2 231 + r (232 − 5) (4.1) Hash Function Preseving Data Locality Space filling curves again could be used to generate codes for multicomponent keys. The key represents an integer array with 5 componets: instanceId, primitiveIndex, ,div x, div y, div z. Listing 4.3 shows how the Morton Codes generator could be altered for 5D [20]. 51
  • 68.
    4. Analysis andModelling Listing 4.3: Morton Codes generator for 5D unsigned int SeparateBy4 ( unsigned int x ) { x &= 0 x0000007f ; x = ( x ˆ ( x << 16)) & 0x0070000F ; x = ( x ˆ ( x << 8 ) ) & 0x40300C03 ; x = ( x ˆ ( x << 4 ) ) & 0x42108421 ; return x ; } MortonCode MortonCode5 ( unsigned int x , unsigned int y , unsigned int z , unsigned int u , unsigned int v ) { return SeparateBy4 ( x ) | ( SeparateBy4 ( y ) << 1) | ( SeparateBy4 ( z ) << 2) | ( SeparateBy4 (u) << 3) | ( SeparateBy4 ( v ) << 4 ) ; } SeparateBy4 inserts four blank bits between every two bits in the binary representation of an integer. MortonCode5 interleaves binary representations using shift and or operations. Double Hashing In open-addressed hash tables, it is also used a mixed hash function [29]. The function for double hashing is defined as: h(k, i) = (h1(k) + i ∗ h2(k)) mod m (4.2) where h1 and h2 are two auxilliary hash functions and i goes from 1 to m − 1 where m is the number of positions in the table. In this work, however, it is used a simpler equation where i and m are set to 1. Justification for this is that probing of hash table is not used, i.e. insertion code does not look for unoccupied places in the table. Instead entire range of hash function values is mapped directly to a discrete set of buffer indices. The mapping function discribed in the next subsection. 4.2.5. Selection of Mapping Function A mapping function associates a range of hash function values to the set of buffer indices. The following formula performs the mapping: f(h) = (1.0 + h/INT MAX) ∗ m/2; (4.3) where h is a hash value, INT MAX is a constant denoting the maximum integer in a system, m is the hash table size. The size of INT MAX is defined by ANSI standard. For unix 32-bit systems it is 2,147,483,648 [56]. Thus for m = 500000 a range of hash values for one bucket is approximately 8590. 52
  • 69.
    4.2. Frame Coherence 4.2.6.Selection of Synchronization Mechanism Selection of synchronization mechanism depends on the data structure which is going to be implemented. In case of chained hash tables, it is possible to use a lock-free synchro- nization [34]. Synchronization of reading/writing access in open-addressed hash tables could be implemented using atomic locks [47]. Lock-free Synchronization In lock-free style of programming [40], at least one thread always do a progress. All threads try to write their results to the concurrent data structure. On failure, a thread repeats the operation. For synchronization, atomic operation are used usually. The following code shows the atomicCAS operation how it is defined in CUDA [40]. Listing 4.4: atomicCAS [40] int atomicCAS ( int *p , int cmp, int v){ exclusive single thread { int old = *p ; i f (cmp == old ) *p = v ; } return old ; } The next listing shows insertion of element to a lock-free linked list. Listing 4.5: Insertion to lock-free linked list [40] void i n s e r t ( ListNode mine , ListNode prev ) { ListNode old , link = prev−>next ; do{ old = link ; mine−>next = old ; link = atomicCAS(&prev−>next , link , mine ) ; }while ( link != old ) } Idea behind the lock-free data updates is that on every new cycle it is generated a new value based on current data. Then performed an atomicCAS operation trying to change the current data to the new value. If the operation unsuccessful it is repeated again. Atomic Lock Synchronization In the locking style of programming [40], all threads are trying to get the lock. One thread aquires the lock, does its work and release the lock and so on. The next listing shows a mutex synchronization using atomic locking. 53
  • 70.
    4. Analysis andModelling Listing 4.6: Addition using atomic lock [40] int locked = 0; bool try lock ( ) { int prev = atomicExch(&locked , 1 ) ; i f ( prev == 0) return true ; return false ; } bool unlock ( ) { int prev = atomicExch(&locked , 0 ) ; i f ( prev == 1) return true ; return false ; } double atomicAdd ( double * data , double val ) { while ( try lock ( ) == false ) ; double old = * data ; * data = old + val ; unlock ( ) ; return old ; } 54
  • 71.
    5. Implementation 5.1. Implementationof Ray Reordering During the research stage, it have been implemented 2D and 3D Hilbert curves, Z-curve and the ray histogram. The implementation is performed on the CPU side, because the task that is to explore how the ray sorting affects efficiency. On the basis of investigation results, it is found that the most efficient solution is a combination of Z-curve and Radix sort. An implementation of Z-curve has been described in the previous chapter and CUDA has already an efficient implementation for Radix sort algorithm. Thus there exists a standard GPU implementation. Results of the benchmarking are described in the next chapter. Hilbert Curves 2D Hilbert Curve The curve is implemented using a turtle graphics with at most one turn after a step [21]. An interested reader could find the implementation in appendix A.2 3D Hilbert Curve Implementation of the 3D Hilbert curve on the CPU is a straightfor- ward. It follows the syntax given in appendix A.3. Showing the implementation would be tedious for the reader. Z Curve Implementation for Z-curve strictly follows the algorithm given in appendix A.1. Ray Histogram Ray histograming is described by algorithm 1. The implementation exactly corresponds to the algorithm. Radix Sort The algorithm for Radix sort has already an efficient implementation in CUDA, see, for example [33]. 5.2. Implementation of Frame Coherence In this section, it is described an implementation of frame coherence according to selected data model 4.2.3 and data structure 4.2.2. 55
  • 72.
    5. Implementation 5.2.1. Implementationof Data Model The implementation of data model includes implementation of CacheKey and CacheNode. Cache Key CacheKey is a data structure which contains parameters of a multicomponent key 3.2.3. Interface setIndices device void setIndices(uint16 t instanceId, uint32 t primitiveIndex, float3 inc dir, int hash method, int div) Return value void Parameters instanceId : identificator of loaded model primitiveIndex : identificator of mesh triangle inc dir : a ray direction angle hash method : id of hash function div : quantization precision Description Set the key members and calculate the hash value for the key data. equals device bool equals(Key other) Return value boolean Parameters other : a key for comparison Description Returns true if the key is equal to the key provided calc hash1 device bool calc hash1() Return value unsigned integer Description Calculates a hash value from key members using uniform random distribu- tion [36] 56
  • 73.
    5.2. Implementation ofFrame Coherence calc hash2 device bool calc hash2() Return value unsigned integer Description Calculates a hash value from key members using morton codes [20] calc hash3 device bool calc hash2() Return value unsigned integer Description Calculates a hash value using a mix of two hashing functions [29] separateBy4 device unsigned int separateBy4(unsigned int x) Return value unsigned integer Parameters x : a number which binary representation to be shifted by 4 Description Separate bits by 4 bit places in binary representation of a number provided in the method [20] mortonCode5 device unsigned int mortonCode5(unsigned int x, unsigned int y, unsigned int z, unsigned int u, unsigned int v) Return value unsigned integer Parameters x : x coordinate y : y coordinate z : z coordinate u : u coordinate v : v coordinate Description Constructs morton codes by interleaving x, y, z, u, v using oring and shifting [20] 57
  • 74.
    5. Implementation Interface ImplementationA curious reader could find the interface implementation of the cache key in appendix C.1 5.2.2. Implementation of Data Structure The present subsection describes implementation of data strutures selected in the previous subsections. Chained Hash Table Configuration Parameters Two parameters are added to a configuration file. These pa- rameters are: 1. cache buffer size 2. cache load factor cache buffer size is necessary to define a size of buffer which elements are used to build the hash table on the device side. The buffer is created on the CPU side using OptiX context. This buffer is filled with elements of CacheNode type. The initial state of these elements is set and then the buffer passed to the device side. The primary cause why the buffer is used for the hash table construction is impossibility for dynamic allocation of the table elements on the device side using OptiX [38]. There are also some efficiency consid- erations why using such buffer could be beneficial [34] for the hash table construction. The following listing shows the buffer initalization on the CPU through the context. Listing 5.1: Nodes buffer initialization nodeBuffer = context−>createBuffer (RT BUFFER INPUT OUTPUT ) ; nodeBuffer−>setElementSize ( sizeof ( CacheNode ) ) ; context−>setBuffer ( BufferVariable : : CACHE NODE BUFFER) ( nodeBuffer ) ; cache load factor defines a maximun number of elements which could be expected in a bucket. The number of buckets for the hash table is determined using the following formula: number of bins = cache buffer size cache load factor (5.1) Key Parameters CacheKey has the same parameters as described in 4.2.3. Divisions are calculated using the following code: Listing 5.2: Divisions div x = ( i n c d i r . x + 1)* div /2; div y = ( i n c d i r . y + 1)* div /2; div z = ( i n c d i r . z + 1)* div /2; Here, inc dir is variable of type float3 containing direction cosines of a ray. Directions are always aligned to positive floating numbers by adding one, after that the floats are multiplied by large integer variable which is denoted by div. The variable is responsible for the quantization precision, large values are responsible for greater accuracy. 58
  • 75.
    5.2. Implementation ofFrame Coherence Cache Node It is necessary to mention, that in the current configuration of the hash table, the CacheNode does not contain PositionKey as in 4.2.3 because the table is mostly tested for static scenes. Also the CacheNode contains some additional components, they are: left - integer, index of the left element in the buffer right - integer, index of the right element in the buffer parent - integer, index of the parent in the buffer index - integer, index of the given node in the buffer queue - integer, index of the next element on the path in the buffer Construction of Interface writeToCache inline device void writeToCache(PerRayData prev data, PerRayData data, CacheNode∗ &cachedNodeWrite) Return value void Parameters prev data - data structure with results of the previous tracing (key) data - data structure with results of the current tracing (data) cachedNodeWrite - a variable which is used to link elements of one trace Description Adds data to cache with a key generated from prev data getFromCache inline device CacheNode∗ getFromCache(PerRayData data) Return value Returns a pointer to requested element, NULL if the element is not there Parameters data - data structure with results of the previous tracing (key) Description Gets an element by its key generated from data 59
  • 76.
    5. Implementation hasKey inline devicebool hasKey(PerRayData data) Return value Returns true if the requested element is in the cache, false otherwise Parameters data - data structure with results of the previous tracing (key) Description Indicates an element’s existance in the cache by its key generated from data get bucket index inline device uint get bucket index(int hash) Return value Returns true if the requested element is in the cache, false otherwise Parameters data - data structure with results of the previous tracing (key) Description Indicates an element’s existance in the cache by its key generated from data Interface Implementation writeToCache constructs a binary tree in the bucket determined by the hash value of node being inserted. Method writeToCache uses a lock-free synchronization paradigm [19]. Before the loop which seeks a vacant place starts, the method gets pointers to the root node of the tree and to the node in buffer which has to be inserted. Then the loop starts, in both subtrees the method tries to atomicCAS indices pointing to the left and right elements. If the operation successful, the loop terminated, otherwise to root is assigned a new value, root− > left or root− > right and the operation repeats again. getFromCache acesses elements of a tree located in a bucket obtained using the hash code of the search element. If the element is found, returns it, otherwise returns NULL. No syn- chronization is necessary. The only modification which is done is counter incrementation. hasKey returns true if getFromCache returns value which is not NULL. getBucketIndex maps hash values to the buffer indices. It first divides a hash value by the maximum integer, adds one and multiplies the resulting number by a number of ele- ments in the element buffer divided by 2. A curious reader could find the implementation of the chained hash table in appendix C.2 60
  • 77.
    5.2. Implementation ofFrame Coherence Open-Addressed Hash Table Configuration Parameters The following parameters were added to the configuration file: cache readers number is the number of streams which can read successively from a table bucket without update. After the number is exceeded, the data is purged and its content is overwritten. This parameter is a part of a synchronization mechanism. cache residual value is a threshold for residual calculation in precision purging 3.2.3. use cache is a parameter for turning the cache on/off use cache residual is a parameter to turn on/off precision purging 3.2.3. hash method is a parameter for selecting a hash function 4.2.4. Key Parameters CacheKey has the same parameters as described in 4.2.3. According to the diagram 4.6, CacheKey has also a second key PositionKey. Both keys are stored using their hash values. Cache Node A cache element has the same parameters as described in 4.2.3. Construction of Interface writeToCache inline device void writeToCache(int base index, PerRayData prev data, PerRayData data, CacheNode∗ &cachedNodeWrite, int pos hash) Return value void Parameters base index : specifies an offset of buffer indices for each model, defined as num of buckets× modelInstanceId prev data : data structure with results of the previous tracing (key) data : data structure with results of the current tracing (data) cachedNodeWrite : a variable which is used to link elements of one trace pos hash : a hash value of position key(emitter position) Description Adds data to cache with a key generated from prev data 61
  • 78.
    5. Implementation getFromCache inline devicebool getFromCache(int base index, PerRayData prev data, PerRayData &data, int pos hash) Return value boolean Parameters base index : specifies an offset of buffer indices for each model, defined as num of buckets× modelInstanceId prev data : data structure with results of the previous tracing (key) data : data structure which ready to filled with cached data pos hash : a hash value of position key(emitter position) Description Fills data with cached data and returns true if prev data(key) exists. Returns false, if pos hash provided in the method call does not coincide with that of an element residing by the address for the key. getFromCacheRes inline device bool getFromCacheRes(int base index, PerRayData prev data, PerRayData &data, int pos hash) Return value boolean Parameters base index : specifies an offset of buffer indices for each model, defined as num of buckets× modelInstanceId prev data : data structure with results of the previous tracing (key) data : data structure which ready to filled with cached data pos hash : a hash value of position key(emitter position) Description Fills data with cached data and returns true if prev data(key) exists. Returns false, if in case hash values for an element and that of the key are not equal and a residual value exceeds a threshold specified in the configuration file 5.2.2. makeKey inline device Key makeKey(Key key, PerRayData prd, bool debug, int division, int hashmethod); Return value CacheKey 62
  • 79.
    5.2. Implementation ofFrame Coherence Parameters key : to be initialized prd : data with parameters to initialize the key debug : print debug information division : precision of angle quantization hashmethod : id of hashing function Description Receives a key in parameter list, fills with data from prd and returns. getBucketIndex inline device uint getBucketIndex(unsigned int hash) Return value unsigned int Parameters hash : a key hash value Description Returns an index in the node buffer for the hash value specified in the method call getBaseIndex inline device int getBaseIndex(int modelInstanceId) Return value int Parameters modelInstanceId : id of model Description Returns a starting index in the node buffer for the model id provided in the method call searchInMap inline device int searchInMap(int modelInstanceId) Return value int Parameters modelInstanceId : id of model Description Returns an offset index for modelInstanceId provided in the method call. 63
  • 80.
    5. Implementation calculateResidual inline devicefloat calculateResidual(float3 origin, float3 otherOrigin) Return value float Parameters origin : the first point otherOrigin : the second point Description Calculates L1 distance between two points provided in the function call. checkIntersections inline device bool checkIntersections(const CacheNode∗ node, float R) Return value boolean Parameters node : node to be checked R : radius of antenna Description Returns true if coordinates of the node provided in the method call are within a unit sphere of any of the receivers which present currently on the scene. This method checks a validity of RECEIV ER HIT 3.2.3 Interface Implementation writeToCache writes data to cache using prev data as a key. Synchronization of both write and read accesses is performed using an atomic locking pardigm [47]. In the write method, after the pointer to the bucket is obtained, a stream tries to lock the node on writing using atomicCAS operation and writeLock. If the operation successful, then the stream change the data inside the node and unlock the readLock. This operation unlocks the object for reading. If the operation is not successful the stream simply leaves the section without writing since, firstly, both results cannot be stored and, secondly, it could be faster to trace ray then wait on the writeLock. getFromCache reads from the cache data using data provided in the method call as key. The method also uses the same synchronization [47] paradigm as writeToCache. A stream gets a pointer to bucket and then tries to acquire the readLock. Every stream which ac- quires the readLock check whether position stamp of the bucket is valid. If it is not, then the method sets readN to 0, unlocks the bucket for writing setting writeLock to 0 and 64
  • 81.
    5.2. Implementation ofFrame Coherence returns false. This means that data of the given bucket will be automatically overwrit- ten. If the stamp is valid, then, firstly, the stream increments the total number of reads for the given bucket, this number is denoted by variable readN. If this number is less then cache readers number defined in the configuration file, the stream after reading the data releases the readLock. Otherwise, it sets the total number of reads readN to 0 and unlocks the bucket for writing setting the writeLock to 0. getFromCacheRes reads from cache data using data variable provided in the method call as key. The method uses the same synchronization mechanism as in readFromCache. The only difference is that the method uses a different approach for purging 3.2.3. In case hashes of the requested element and that of the bucket are not equal, the method calculate a residual value. If the residual value is less then a threshold denoted by cache residual value provided in the configuration file then the stream reads data and returns true. Otherwise it returns false which means that data in the bucket will be overwritten. makeKey is an auxiliary method which serves for construction of key from prev data. getBucketIndex returns an index in the node buffer for the hash provided in the method call. The method implementation differs from the chained hash-table in that instead of INT MAX it uses UINT MAX because hash values have now unsinged int type. searchInMap is an auxiliary method helping to determine on offset for a particular model. The buffer cache emitter map has an associated offset index which shall be multiplied by the number of elements in the buffer divided by the number of models loaded. calculateResidual is an auxiliary method helping to determine L1 [5] distance between two points provided in the method call. The method is used in getFromCacheRes to calculate the residual value. checkIntersections checks coordinates of node provided in the method call against coor- dinates of receivers are currently on the scene. If the node coordinates are within radius R provided in the method call, the method returns true, otherwise returns false. A curious reader could find the implementation in the appendix C.3 65
  • 82.
  • 83.
  • 85.
    6. Testing 6.1. SystemConfiguration before Testing In this section it is briefly described the system on which the testing is performed. System The operation system has the following parameters • Operating System: Linux-x86 64 • Release: Ubuntu 12.04 (precise) • Kernel: 3.2.0-58-generic CPU Parameters of the CPU are • CPUs: 4 • Model Name: Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz • Frequency: 2494.001 MHz • L2 cache: 3072 KB Memory Parameters of the system memory are • Memory total: 3888 MiB NVIDIA The graphic card has the following parameters • Graphics Processor: GeForce GTX 680 • CUDA Cores: 1536 • Total Memory: 2048 MB • Memory Interface: 256-bit • NVIDIA Driver Version: 331.38 69
  • 86.
    6. Testing 6.2. Testingof Ray Reordering 6.2.1. Approach The task of ray reordering is to increase the efficiency of tracing. Ray reordering does not introduce errors that is why no error estimation needs to be calculated. The approach comprises testing of all possible ways of ray reordering for enough large range of the ray count number and to compare the tracing times with the base version without reordering. Results are presented as a diagram of tracing times for ray thousands. 6.2.2. Tests Description The following types of ray reodering are used during the testing: 1. without sorting (mnemonic: WS) 2. sorting of 1D buffer using a histogram and bypassing of histogram using 3D Hilbert curve (mnemonic: H3DH) 3. sorting of 1D buffer using a morton codes (mnemonic: Z) 4. sorting of 1D buffer using a histogram and bypassing the histogram using morton codes (mnemonic: HZ) 5. sorting of 1D buffer using a histogram and bypassing the histogram using 3D Hilbert curve, mapping of the resulting buffer to 2D using 2D Hilbert curve (mnemonic:H3DH2DH) 6. sorting of 1D buffer using a histogram and bypassing the histogram using 3D Hilbert curve, mapping of the resulting buffer to 3D using 3D Hilbert curve (mnemonic: H3DH3DH) Further mnemonics will be used instead of full names for methods. For all types of reoder- ing, it is generated from 50000 to 90000000 rays. 6.2.3. Results The abscissa of figure 6.1 shows the number of rays in millions. One unit on the ordinate corresponds to 1 second. Red color corresponds to WS in 6.2.2, green to H3DH, blue to Z, pink to HZ, turquoise to H3DH2DH , yellow to H3DH3DH. Overall, results shows that ray reordering has a significant impact on the performance. The best time indicator has the ray reordering using H3DH (green) and also HZ (pink). The worst time has tracing without reordering (WS) for all ray numbers. The ray reordering using Z (blue) has slightly worse performance then the method based on histograming and space-filling curves (both H3DH and HZ). Mapping of the initial 1D buffer to 2D buffer using H3DH2DH is at least not worse then Z (blue) but performance sharply degrades after 50 millions of rays. Study of such a behaviour is not performed. Results for H3DH3DH (3D buffer) are not worse then for H3DH2DH (2D buffer) till approximately 16 mill of rays. 70
  • 87.
    6.3. Testing ofFrame Coherence Figure 6.1.: Results for Ray Reordering Benchmarking for this type of reordering cannot proceed furher because the depth of 3D Hilbert related to the amount of rays by the following formula depth = log(size) log(8) (6.1) where size is the size of the ray buffer. Thus the size of the ray buffer for depth = 8 is 16777216, the next number for depth = 9 is 134217728 which goes out of the range for the ray count number. 6.3. Testing of Frame Coherence 6.3.1. Static Scene In the following section it is described an approach and tests for the static scene. Analysis of Chained Table Performance Analysis for the static scene starts from performance study for the chained table. In this subsection, it is given a performance anatomy for this type of table with different hash functions. Performance Analysis of Write/Read Operations Analysis of read/write operations is vital for the whole efficiency of the hash table. The study begins from description of the approach and testing procedures and finishes by review of results. 71
  • 88.
    6. Testing Approach Thefollowing operations influence overall performance of the hash table: • write to cache • read from cache • buffer call • hash generation • trace time The approach is to measure performance of these operations during runtime and analyse these data using plots to see bottleneck and potential pitfalls. Test Description The application runs for the static scene with use cache and use benchmarking configured to true. The node buffer is printed to file in location specified by the configu- ration file. The output file contains lines with the following information: element index| trace time |time to generate hash |write call time |buffer call time |get call time |number of hits The data is processed using scripts for generation of gnuplot images. All the data is generated for the following configuration parameters: • ray cnt = 50000 • cache buffer size = 50000 • load factor = 3 • division = 200000 Write VS Buffer Call VS Read Figure 6.2 shows times of completion for write, buffer call and read operations. X coordinate shows time in seconds, Y coordinate shows num- ber of elements in cache. The main feature of the graph is clear separation of write/read access time for elements corresponding to buckets and to ”ordinary“ nodes. Writing data to buckets is more expensive with a maximum of 120 ms. Most of the write operations for the ”ordinary” nodes take less then 80 ms. The same pattern repeats for the read opera- tions with approximately 50 ms maximum for buckets and 40 ms for “ordinary” elements. Buffer calls are almost neglectable. Trace VS Write VS Buffer Call VS Read Figure 6.3 shows comparison of cache opera- tions to tracing time with the same coordinates as the previous graph. Basically the image repeats the “step” pattern as on the previous image. Almost all tracing operations do not exceed 2 seconds. On the other hand, tracing for buckets in average is more expensive due to the ”step“ pattern. Hits Figure 6.4 shows the number of hits for the elements of the node buffer. The X coordinate shows the number of elements in cache. The Y coordinate shows the number of hits for the cache elements. The hits are uniformely distributed accross all elements. There are no clusters visible in the image. It also could be seen that the cache is fragmented because not all 50000 nodes are presented on the graph. Also some nodes have 0 hits. 72
  • 89.
    6.3. Testing ofFrame Coherence Figure 6.2.: Write VS Buffer Call VS Read Performance of Tracing Figure 6.5 shows comparison of performance for no caching, caching with uniform random hash (hash1), caching with z-curve hashing (hash2) and caching with mixed hashing function (hash3) 4.2.4. The testing was performed for: 1. division = 200000 2. cache load factor = 3 3. cache buffer size = 50000 The figure shows that caching with all caching functions are not better then tracing with- out caching on almost the whole range of rays. An exception is window between 2560 and 10240 thousands rays when caching has benefits. After 10240 thousands performance of tracing with caching again degrades. Conclusion There are several statements which could be made based on the anlysis of images: 1. Write/Read operations have unequal times for buckets and nodes 2. Buffer appears to be fragmented 3. Tracing with caching using chained hash table does not give performance benefits for all hashing functions 4. Not all elements of the cache are used (there are elements with 0 hits) 73
  • 90.
    6. Testing Figure 6.3.:Trace VS Write VS Buffer Call VS Read Analysis of Open-Addressed Table Performance Next, it is given analysis for the open-addressed hash table. The benchmarking for the static scene is done using atomic mutex synchronization. Performance Analysis of Write/Read Operations Approach and tests description for the analysis of read/write operations are the same as for the chained hash table. Write VS Buffer Call VS Read Figure 6.6 shows times of completion for write, buffer call and read operations. X coordinate shows time in seconds, Y coordinate shows number of elements in cache. It can be seen from the image that most of the write operations take 60 ms, while most of the read operations take approximately 20 ms. Execution time for read operations have two peaks with approximately 60 ms each. For the write operations, 16 ms peaks fall to the same range of indices in the buffer as the read peaks. The first 16 ms peak has small 19 ms emission. The main feature of the graph is its patchwork. This is due to the fact that the cache is divided into parts for each vehicle. More ”busy“ parts have greater times for write/read operations. Overall, as with the chained hash table the writing operations are more expensive then reading and the buffer calls are neglectable. Trace VS Write VS Buffer Call VS Read Figure 6.7 shows comparison of cache op- erations to tracing time. The figure shows that parts of the buffer with more expensive read/write operations have smaller trace times. The tracing has two minimums with ap- proximaty 1 seconds each and three peaks ranging from 2.5 to 2 seconds which fall to the parts with fast cache operations. Overall, cache operations are much faster then tracing. 74
  • 91.
    6.3. Testing ofFrame Coherence Figure 6.4.: Hits Also tracing for models with more ”busy” cache has greater performance, i.e tracing for them is faster. Performance of Tracing The testing is done for synchronization mechanism described in Implementation of open-addressed hash table 5.2.2. Approach The main concern in testing of tracing performance is testing of productivity of wavetracer for different ray numbers without cache and with cache for different types of hash functions. Tests The testing is done iteratively changing parameters of configuration file and launch- ing a new tracing. The ray number changed in the range from 1000 to 20280000, on each iteration the ray number is multiplied by 2. The testing is done for the following parame- ters: 1. cache buffer size = 300000 2. division = 200000 3. cache readers number = 5 Results Figure 6.8 shows results obtained for no caching, caching using random uniform hash function (hash1), caching using morton codes(hash2) and caching using mixed func- tion (hash3). The X coordinate indicates the number of rays in thousands, the Y coordinate gives time in seconds. The plots display that approximately till 50000 rays all types of 75
  • 92.
    6. Testing Figure 6.5.:Performance of Tracing with Chained Hash Table tracing have roughly the same performance. After that number the trends diverge and caching gives advantage over no caching. The mixed function (hash3) has the best perfor- mance which followed immediately by the random function (hash1) and the cache with morton codes is a little worse. Influence of Ray Reordering Figure 6.9 shows results of benchmarking of the open- addressed table for atomic mutex synchronization. Influence of the ray reordering is im- portant since both techniques(caching and ray sorting) will be used together. It also impor- tant that here it is used the mutex atomic synchronization since it shows some potential of the caching mechanism. A large tracing time for 1000 rays for the no cache is conditioned by the initialization time for the data structures. This time does not need to be considered. Approximately after 16000 rays caching becomes to prevail with the maximum advantage of 2 seconds for 2048000 rays. Conclusion In general, based on the figures, the following statements could be made for the open-addressed hash table: • On average, writing/reading operations for the open-addressed table take less time then for the chained. • Writing/reading operations have a patchwork pattern. • Benefits of caching with open-addressed table reveal for considerably less ray counts then for the chained table and has a more stable character. 76
  • 93.
    6.3. Testing ofFrame Coherence Figure 6.6.: Write VS Buffer Call VS Read 6.3.2. Dynamic Scene In this subsection, it is discussed a testing for dynamic scenes. The type of structure anal- ysed is the open-addressed hash table which has better performance then the chained ta- ble. Approach For dynamic testing, the following parameters are important: • Temporal tracing metrics • Accuracy of tracing with cache • Calculation of correlation between frames The first is obvious, the second item is necessary to estimate what error introduces caching comparing to performance benefits. The last item is essential to assess the coherence be- tween frames because high coherence correspond to high caching benefits. Temporal Metrics As a temporal metric, it is chosen an average time of tracing for 500 working cycles of wavetracer. The average time could be calculated incrementally using the following formula [11]: An+1 = An + vn+1 − An n + 1 (6.2) where An is the average number obtained on the previous cycle, vn+1 is tracing time for n + 1th cycle and n + 1 is current number of cycles. 77
  • 94.
    6. Testing Figure 6.7.:Trace VS Write VS Buffer Call VS Read Accuracy of Tracing For correct error calculation, it is necessary to solve the following issues: • What output parameters of the ray tracing could be taken for error calculation? • What is the ground truth in this assessment? • How to calculate the error algorithmically? As parameters for error calculation could be taken points on a trace path contained in the waypoint buffer. The union Waypoint contains the following members: • WP Reflection • WP Diffraction • WP Miss • WP Hit • WP Launch As a tracing path could be taken positions of WP Reflection and WP Hit. As the ground truth, it could be used a trace path obtained without caching. The error could be calculated using the following approach. The trace paths represent point clouds, these clouds could be compared with each other using many ways. One of the possible assessment is to calculate length between point clouds centers, this length introduces a measure for two clouds difference. The length could be calculated using the following formula [2]: D1 = L1(cA, cB) and D2 = L2(cA, cB) (6.3) 78
  • 95.
    6.3. Testing ofFrame Coherence Figure 6.8.: Performance of Tracing with Open-Addressed Hash Table where L1 and L2 are distances and cA and cB are centroids of point clouds. Statistical Correlation between Frames Correlation between frames could be estimated using the same data (point clouds). Every cycle of work of the wavetracer produces an output written to the waypoint buffer. For example, correlation is calculated for two sets of x coordinates in produced ouputs. Similar coefficients are calculated for y and z coordinte. Normalized sum of these coefficients is considered as correlation coefficient between frames. If this correlation coefficient is high then it means also that coherence between frames is also high. Mathematically this could be expressed as following rx1x2 = N i=1 (x1i− ¯x1)(x2i− ¯x2) √ n i=1 (x1i− ¯x1)2 n i=1 (x2i− ¯x2)2 ry1y2 = N i=1 (y1i− ¯y1)(y2i− ¯y2) √ n i=1(y1i− ¯y1)2 n i=1(y2i− ¯y2)2 rz1z2 = N i=1 (z1i− ¯z1)(z2i− ¯z2) √ n i=1 (z1i− ¯z1)2 n i=1 (z2i− ¯z2)2 c12 = r2 x1x2+r2 y1y2+r2 z1z2 1 (6.4) where O1 = {x1, y1, z1} and O2 = {x2, y2, z2} are two ouputs of tracing pocedures and xi, yi, zi are coordinate sets and rx1x2, ry1y2, rz1z2 are correlation coefficients between two coordinate sets. c12 is normalized length of the correlation vector [rx1x2, ry1y2, rz1z2]. 79
  • 96.
    6. Testing Figure 6.9.:Performance of Tracing with Chained Hash Table(Ray Reordering) Figure 6.10 shows a geometrical interpretation for the error calculation. Two arrows indicate point clouds centroids, the variable distance displays the current distance for the frame, the variable avg error shows the average distance for preceeding frames including this. Description of Benchmarking Procedure The testing for dynamic scenes has the follow- ing goals: • Comparison of no caching VS caching with position purging VS precision purging • Comparison of efficiency of hashing functions • Comparison of caching VS caching with ray reordering no caching VS position purging VS precision purging The testing is performed for the range between 1000 and 90000 rays. For every 2000 rays, the testing is done for the follow- ing parameters in the configuration file: • benchmark file name is a file name where the output data is written. • use cache is a parameter which turns on/off the caching. • use cache residual is a parameter which indicates usage of precision purging. • division is a parameter which defines a quantization accuracy. 80
  • 97.
    6.3. Testing ofFrame Coherence Figure 6.10.: geometrical interpretation of error calculation • ray cnt is a parameter which defines a number of rays. • cache residual value is constant, defines residual threshold. The output of tracing is written to folders with names generated on the basis of ray count and caching parameters. The benchmarking data is written to special data file in a format: ray cnt — tracing time hashing functions In addition to parameters changed in the previous testing procedure, this type of testing also alters hash method setting it from 1 to 3. 1 indicates uniform random hashing, 2 morton codes hashing and 3 mixed hashing function. Recording of the output data is the same as in the previous type of testing. caching with ray reordering This type of testing is performed for the whole range of caching methods and hash functions. The ray reordering parameter is set to 1. This testing is done to estimate how ray reordering influences the overall performance of the tracing. Benchmarking Automation The automation is performed using python bindings in ADTF. The script opens ADTF, loads a configuration and runs the benchmarking. Before the test- ing, it is necessary to generate a number of configuration files with required parameters, the path to folder with files is supplied to the lauch script. After each iteration, the appli- cation is shutdown to provide equal starting conditions for all types of testing. The output data produced by testing is also processed using scripts. 81
  • 98.
    6. Testing Results Inthe following paragraph results of the testing procedures for dynamic scenes are described. OptiX error It is necessary to mention that while testing the system for dynamic scene, there have been an exception with copying data from the host to device. An interested reader will find the exception description in the following nvidia thread [39]. In order to avoid the exception, there have been implemented a synchronization mechanism for saving/reading data to/from buffer. The mechanism considerably reduces the caching performance both for static and dynamic scenes but it cannot be not used since a pure mu- tex synchronization does not ensure safe execution of the program. Moreover, developers have not answered a question why such an error can occur. It is possible that such error can occur when there is a big stress on the buffer. It is to be hoped that the error will be resolved in the future version of the program. no caching VS position purging VS precision purging Figure 6.11 shows results of the benchmarking for no caching (green), caching with position stamp purging and division 200000(red) and caching with precision purging, residual value 0.0125 and division 10000 (blue). The X coordinate indicates a number of rays, while the Y coordinate shows average Figure 6.11.: no caching VS position purging VS precision purging tracing time in seconds. Caching with position stamp purging is constantly better then tracing with no caching. The difference becomes bigger for higher ray numbers. Caching with precision purging has better performance then both no caching and caching with po- sition stamp purging. Simple caching gives approximately 10% performance improvement for small and medium ray numbers with increase to 15 % for high ray count. Caching with residual has around 30% improvement over no caching for small and medium ray num- bers with decrease to aprroxiamtely 17% for higher ray counts. 82
  • 99.
    6.3. Testing ofFrame Coherence Figure 6.12 shows an average error calculated for the testing procedure. The X coor- dinate shows the number of rays, the Y coordinate indicates the distance calculated in measuring units of the system. The green color diplays figures for no caching, red for Figure 6.12.: error calculated for no caching VS position purging VS precision purging caching with position stamp and division 200000 and blue for caching with residual 0.093 and division 10000. The error calculated for no caching to have it as the ground truth value. Theoretically, this error should be 0 but it has some small value for the first itera- tions which tends to becomes smaller with higher iterations. Concerning the caching with position stamp purging, it has some huge error about 8 units on the first iteration. The reasons for this are not investigated. For subsequent iterations, the error does not exceed 2 units. Is the error small or big? In order to answer this question it is necessary to calculate average length of ray for which the error is calculated. The error with 10 - 5 percent of average ray trace would be reasonable. In this test, the average tracing path is not calcu- lated so the estimation cannot be given. The error for the caching with precision purging in average is three times higher then for caching with position stamp. The error for this caching scheme does not exceed 7 units. Figure 6.13 correlation between frames calculated for 100 frames. The correlation coef- ficient is calculated using two subsequent frames. The correlation coefficient varies from high values (almost 50 %) to very low (less then 5%). On average, coherence between frames is aprroximately 25 % hashing functions In this subparagraph, it is given results of testing of caching for three hashing functions. This time the ray number ranges from 1000 to 49000 rays. Figure 6.14 shows bencmarking times for no caching, caching with position stamp (division 200000) and caching with residual purging (division 10000, residual 0.0125). Caching schemas 83
  • 100.
    6. Testing Figure 6.13.:correlation between frames are tested for three different types of hashing functions: random uniformely distributed function, morton codes hashing function, and mixed hashing function. Blue color shows no cache, green caching with division 20000, turquoise is for the same caching schema with morton codes, red displays the same caching schema with mixed function. Pink corresponds to caching with residual 0.0125 and division 10000, yellow to the same caching schema with morton codes and white the same schema with mixed function. Residual caching with morton codes has the worst time, even worse then no caching at all. The second worse time is no caching. Green, caching with position stamp purging outperforms no caching as described in the previous paragraph. Almost the same result has the mixed hashing for the same caching schema. Morton codes outperforms the ran- dom uniform hashing for this schema by approximately 15% from 20 to 49 thousands rays. Residual caching with the uniform hash function and the mixed hashing function has ap- proximately the same performance competitive to position caching with morton codes. However, the trend is that the later is better for bigger ray numbers. Figure 6.15 shows the error calculated for all types of tracing in this test. Again, the error for no caching should be theoretically 0 for all frames, it is given as a reference value to show possible variation of tracing errors from frame to frame for the same type of caching (systematic error). Caching with position stamp purging has the same trend for accu- racy as in the previous test. Morton codes which gives better performance comparing to the uniform function for the same caching schema has on average two times worse accu- racy. Mixed function which does not give any performance benefits has approximately the same accuracy as the random uniform function. In case of residual caching, morton 84
  • 101.
    6.3. Testing ofFrame Coherence Figure 6.14.: average tracing times for caching with different hashing functions codes are better in terms of accuracy but it has the worst tracing time. However, both residual caching with the random uniform hashing function and residual caching with the mixed hashing function have the worst error (about the same accuracy). Both types offer approximately the same performance benefits. Caching with Ray Reordering Figure 6.16 shows results of the tracing for the same types of caching as in the previous testing. The only difference is that parameter value ray reordering is set to 1 in the configuration file and the testing is performed for a bigger range of rays from 1000 to 61000. The first observation which could be made is that the ray reordering noticeably reduce the tracing time. For 49 thousands of rays the no cache trac- ing with ray reordering is 2.5 times faster then without reordering. Secondly, for the given synchronization type almost all caching techniques do not provide performance benefits. Exceptions are residual caching with uniform random hash function and mixed hash func- tion which give performance benefits till approximately 37 thousands of rays. After that number the tracing time for these types of caching begin to grow. And for 61 thousands this time noticeably exceeds the tracing time for no caching. Figure 6.17 shows errors obtained for all types of caching with the ray reordering. For the types of caching which provide performance increase the error is high enough. It hovers around 5 units for both types of caching. Other methods with lower errors are not 85
  • 102.
    6. Testing Figure 6.15.:average errors calculated for caching with different hashing functions of interest in terms of efficiency. The error for 43 thousands of rays equals to 0 because of a failure in the automatic testing. The value does not have to be considered. No Reordering VS Reordering Figure 6.18 gives a comparison of the tracing times for no caching with reordering VS no cache without reordering. The blue color corresponds to no cache with ray reordering and the green shows no cache with no reordering. Dimen- sionality is the same as on the previous images. Overall, the diagram shows considerable reduction of time for launches with the ray reordering for all the ray range. For 1000 rays, the reduction amounts 23% while it is almost 60% for 49000 of rays. The last figure cor- responds almost to 2.5-fold increase in efficiency. The main trend that the coefficient of reduction tends to increase with the number of rays. Conclusion 1. Software limitations do not allow to use atomic mutex synchronization in dynamic scene which considerably reduces cache performance for both static and dynamic scenes. 2. For the given type of synchronization described in the Implementation of the open- addressed hash table 5.2.2 caching gives certain performance benefits (up to 30% of tracing time). Caching with residual is more efficient than caching with position purging but also makes bigger tracing errors. 3. Caching using morton codes as a hash function allows to increase tracing efficiency for position purging. Regarding accuracy, morton codes give more acceptable results for position purging then for caching with residual. 4. The ray sorting considerably influences the time of the ray tracing reducing it de- pending on ray number. Wherein benefits which give the caching are leveled out 86
  • 103.
    6.3. Testing ofFrame Coherence Figure 6.16.: average tracing times for caching with different hashing functions for ray reordering for this type of synchronization mechanism. Sorting also insignificantly reduces the caching error. 87
  • 104.
    6. Testing Figure 6.17.:average errors calculated for caching with different hashing functions for ray reordering Figure 6.18.: average tracing times for no cache with ray reordering VS no cache without reordering 88
  • 105.
    Part V. Discussion andConclusions 89
  • 107.
    7. Discussion In theproject, the following tasks have been performed and problems solved. 7.1. Ray Reordering The task of ray reordering has been successfully solved using the space-filling curves. The space-filling curves are also used for the BVH construction in the ray tracing. For example, refer to [23]. It have been tried different versions for solution: 1. Construction of ray histogram and bypassing the histogram using the 3D Hilbert curve. 2. Construction of ray histogram and bypassing the histogram using the Z-curve 3. Sorting initial ray coordinates using the Z-curve 4. Mapping of the obtained sorted ray list to 2D buffer using the 2D Hilbert curve 5. Mapping of the obtained sorted ray list to 3D buffer using the 3D Hilbert curve Launching of the tracing procedure with 2D and 3D ray buffer gives no obvious results. Overall, the most efficient implementation occurs to be the ray reordering using the Z- curve. The curve could be easely implemented on the GPU side, the sorting could be implemented using the radix sort. Results of the ray sorting will be discussed in the Con- clusions chapter. 7.2. Frame Coherence 7.2.1. Caching Method The main task that is solved here is construction of caching method for simulation of the propagation channel in VANET simulation. The simulation is performed when practically all the ray sources dynamically change their positions from frame to frame. Thus it is solved a problem of cache construction, reusage of data from the previous launches and cache purging. 7.2.2. Data Structure In the implementation part, it is solved a problem of cache trees construction using ray hashes. A research of hashing functions allows to detect their influence to the peformance. It have been studied random uniform hasing function, hashing with Morton codes and mixed hash function. They all differently influence the system perfomance, the results will be discussed in the Conclusions chapter. 91
  • 108.
    7. Discussion OptiX doesnot allow to allocate memory on the device side. The problem is solved using a buffer with elements constructed on the CPU side using the context. On the device side, device functions use the buffer to construct cache. A study of data structures and implementation of synchronization mechanism is an im- portant part of the project. During the research it have been developed two data structures: chained hash table and open-addressed hash table. Both data structures use different syn- chronization mechanisms: lock-free synchronization and atomic mutex synchronization. 7.2.3. Testing Static Scene Both data structures are tested for the static scene. It is evaluated perfor- mance of their writing/reading operations and done a comparative analysis. Dynamic Scene The main problem that occured during the testing of the open-addressed hash table in dynamic scenes is that the third party tracing engine throws an exception when tracing with cache using mutex synchronization for sufficiently large number of rays. It has been designed a work-around solution, a new synchronization mechanism based on mutex with two locks which counts a number of readers 5.2.1. In the testing part, also it is designed an overall approach for testing and tests. It has been developed an automated test suit for nightly tests. It is solved an error with repro- duction of tracing paths using python bindings in ADTF. It has been developed a method for calculation of system tracing errors with caching. The last task solved in the testing part is influence of the ray reordering on the system working with the ray cache. 92
  • 109.
    8. Conclusions 1. Ingeneral, the overall system efficiency is considerably increased. 2. It is successfuly implemented a mechanism of the ray sorting on the CPU. The ray re- ordering increases the system performance depending on the ray number. For 50000 of rays in dynamic testing, the efficiency increases by a factor of 2.5. The coefficient of tracing time reduction increases with the number of rays (efficiency increases with the ray number). 3. It has been successfuly deleloped and implemented a method for the ray caching. 4. It has been revealed that the open-addressed hash table is more efficient data struc- ture than the chained hash table. 5. During the testing, it has been found that some software limitations do not allow to use fully cache capabilities. It has been developed a work-around solution which allows to circumvent to certain extent these limitations. Wherein caching allows to increase the system efficiency depending on hashing function up to 30 %. 6. Under the joint action of the ray sorting and caching, the first prevails and the caching does not increase the system efficiency introducing the tracing error. 7. Thus, for successful usage of the caching, it is neccessary to overcome limitations im- posed by the third party software system. If such limitations could be totally circum- vented then it is possible to weaken the reading access synchronization and allow for multiple threads to read at the same time. This will considerably increase the overall cache performance making it competitive with reordering. 93
  • 110.
  • 111.
  • 113.
    A. Space-Filling Curves A.1.Morton Codes Generator Listing A.1: Morton codes generator unsigned int expandBits ( unsigned int v ) { v = ( v * 0x00010001u ) & 0xFF0000FFu ; v = ( v * 0x00000101u ) & 0x0F00F00Fu ; v = ( v * 0x00000011u ) & 0xC30C30C3u ; v = ( v * 0x00000005u ) & 0x49249249u ; return v ; } unsigned int morton3D ( float x , float y , float z ) { x = min( std : : max( x * 1024.0 f , 0.0 f ) , 1023.0 f ) ; y = min( std : : max( y * 1024.0 f , 0.0 f ) , 1023.0 f ) ; x = min( std : : max( z * 1024.0 f , 0.0 f ) , 1023.0 f ) ; unsigned int xx = expandBits ( ( unsigned int ) x ) ; unsigned int yy = expandBits ( ( unsigned int ) y ) ; unsigned int zz = expandBits ( ( unsigned int ) z ) ; return xx * 4 + yy * 2 + zz ; } A.2. 2D Hilbert Curve Implementation The 2D Hilbert curve is implemented using an algorithm when the turtle turns at most only once after doing a step [21]. Listing A.2: 2D Hilbert Curve Implementation / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−−−− hilbert2D . h −−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / # ifndef HILBERT2D H #define HILBERT2D H / * * * Maps 1D b u f f e r to 2D b u f f e r using H i l b e r t curve and t u r t l e g r a p h i c s . * / 97
  • 114.
    A. Space-Filling Curves classHilbert2D { private : / / r e s u l t i n g b u f f e r optix : : f l o a t 3 ** ray buffer2D ; / / i n i t i a l b u f f e r std : : vector<Element> ray buffer ; / / curve depth int depth ; / / s i z e o f one dimension o f 2D b u f f e r int size ; / / current c o o r d i n a t e s o f t u r t l e int x , y ; / / index o f 1D b u f f e r int miles ; / / v a r i a b l e used to c a l c u l a t e t u r t l e d i r e c t i o n int t u r t l e ; public : / * * * Constructor : * param : * r a y b u f f e r : i n i t i a l ray b u f f e r * s i z e : s i z e o f b u f f e r * / Hilbert2D ( std : : vector<Element> ray buffer , int size ) ; / * * * Makes one s t e p in d i r e c t i o n o f t u r t l e heading * * / void step ( ) ; / * * * Turns l e f t * / void t u r n l e f t ( ) ; / * * * Turns r i g h t * / void turn right ( ) ; / * * * Grammar : * * H1<−H2H1H5H3 * H2<−H1H6H3h5 * H3<−H1H6H3H4 * H4<−H6H1H5H3 * H5<−H6H1H5H2 98
  • 115.
    A.2. 2D HilbertCurve Implementation * H6<−H4H6H3H5 * * / void H1( int depth ) ; void H2( int depth ) ; void H3( int depth ) ; void H4( int depth ) ; void H5( int depth ) ; void H6( int depth ) ; virtual ˜ Hilbert2D ( ) ; int getDepth ( ) const { return depth ; } optix : : f l o a t 3 ** getRayBuffer2D ( ) { return ray buffer2D ; } int getX ( ) const { return x ; } int getY ( ) const { return y ; } int getSize ( ) const { return size ; } / * * * C a l c u l a t e s depth from b u f f e r s i z e * param : * s i z e : b u f f e r s i z e * / double s t a t i c calc depth ( int size ) { return log ( size ) / log ( 4 ) ; } / * * * C a l c u l a t e s dimension o f 2D b u f f e r * param : * depth : curve depth * / double s t a t i c c a l c s i z e ( int depth ) { 99
  • 116.
    A. Space-Filling Curves returnpow(2 , depth ) ; } } ; #endif / * HILBERT2D H * / / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−− hilbert2D . cpp −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−− t u r t l e s t e p −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void Hilbert2D : : step ( ) { / / here we increment b u f f e r index to get a new element from i n i t i a l / / b u f f e r miles ++; / / depending on the t u r t l e o r i e n t a t i o n we increment / decrement x or y / / x and y in f a c t i n d i c e s o f the r e s u l t i n g b u f f e r switch ( t u r t l e ) { case 0:{ y++; break ; } case 1:{ x++; break ; } case 2:{ y−−; break ; } case 3:{ x−−; break ; } default : break ; } / / write ray d i r e c t i o n from i n i t i a l to r e s u l t i n b u f f e r ray buffer2D [ x ] [ y ] . x = ray buffer [ miles ] . v . x ; ray buffer2D [ x ] [ y ] . y = ray buffer [ miles ] . v . y ; ray buffer2D [ x ] [ y ] . z = ray buffer [ miles ] . v . z ; } 100
  • 117.
    A.2. 2D HilbertCurve Implementation / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−− turn l e f t −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void Hilbert2D : : t u r n l e f t ( ) { t u r t l e = ( t u r t l e − 1 + 4) % 4; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−− turn r i g h t −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void Hilbert2D : : turn right ( ) { t u r t l e = ( t u r t l e + 1) % 4; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−−− H1( r i g h t ) −−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void Hilbert2D : : H1( int depth ) { i f ( depth >= 0 ) { depth−−; H2( depth ) ; step ( ) ; H1( depth ) ; step ( ) ; H5( depth ) ; step ( ) ; H3( depth ) ; } else { turn right ( ) ; } } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−−−−−− H2 −−−−−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void Hilbert2D : : H2( int depth ) { i f ( depth >= 0 ) { depth−−; 101
  • 118.
    A. Space-Filling Curves H1(depth ) ; step ( ) ; H6( depth ) ; step ( ) ; H3( depth ) ; step ( ) ; H5( depth ) ; } } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−− H3( l e f t ) −−−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void Hilbert2D : : H3( int depth ) { i f ( depth >= 0 ) { depth−−; H1( depth ) ; step ( ) ; H6( depth ) ; step ( ) ; H3( depth ) ; step ( ) ; H4( depth ) ; } else { t u r n l e f t ( ) ; } } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−−−−−− H4 −−−−−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void Hilbert2D : : H4( int depth ) { i f ( depth >= 0 ) { depth−−; H6( depth ) ; step ( ) ; H1( depth ) ; step ( ) ; H5( depth ) ; step ( ) ; H3( depth ) ; 102
  • 119.
    A.2. 2D HilbertCurve Implementation } } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−− H5( r i g h t ) −−−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void Hilbert2D : : H5( int depth ) { i f ( depth >= 0 ) { depth−−; H6( depth ) ; step ( ) ; H1( depth ) ; step ( ) ; H5( depth ) ; step ( ) ; H2( depth ) ; } else { turn right ( ) ; } } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−−− H6( l e f t ) −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void Hilbert2D : : H6( int depth ) { i f ( depth >= 0 ) { depth−−; H4( depth ) ; step ( ) ; H6( depth ) ; step ( ) ; H3( depth ) ; step ( ) ; H5( depth ) ; } else { t u r n l e f t ( ) ; } } 103
  • 120.
    A. Space-Filling Curves A.3.3D Hilbert Curve Grammar For turtle’s orientation in space the same symbols are used as described earler A → B + F − C + −FA − F + D − /F ∧ E&/F − A + +F − −F + + + F + G− B → A&F ∧ C& ∧ FB ∧ F&D ∧ F − F + F ∧ B&&F ∧ ∧ E&&&F&N ∧ C → B + F − A + /F ∧ A&F + D − +FD + F − H + /F&H ∧ /F + P− D → N ∧ F&G ∧ /F + G − /F ∧ C& ∧ FC ∧ F&M ∧ F − M + F ∧ O& E → O&F ∧ P&F + E − F&N ∧ ∧ F − −O + + ∧ F ∧ E&/F − B + /F&N ∧ F → M + F − H + /F&F ∧ /F + G − −F + +M − − − F − F + F ∧ A&F + G− G → ∧F&D ∧ &FG&F ∧ A&F + N − F&G ∧ ∧ F + +F − − ∧ F ∧ A& H → M+; F − F + +F + +H − − + F + E − F&N ∧ F − C + −FC − F + P− I → R + F − S + −FI − F + T − F&L ∧ F − I + +F + +U − − + F + V − J → V − F + T − +FT + F − Z + F ∧ U&F + J − −F + +L − − − F − R+ K → J&F ∧ V &F + K − F&W ∧ &FJ&F ∧ K&/F − Y + /F&W ∧ L → W ∧ F&X ∧ /F + L − /F ∧ J&&F − −W + +&F&L ∧ F − R + F ∧ J&; M → H ∧ F&F ∧ ∧ F − −M + + ∧ F ∧ O&/F − H + /F&M ∧ &FD&F ∧ O& N → NG − F + D − +FN + F − B + /F&G ∧ /F + N − −F − −E + + − F − B+ O → P − F + E − −F − −O + + − F − M + F ∧ P&F + O − +FD + F − M+ P → O&F ∧ E&&F − −P + +&F&H ∧ /F + O − /F ∧ P& ∧ FC ∧ F&H ∧ R → I ∧ F&S ∧ &FR&F ∧ T&/F − U + /F&R ∧ ∧ F − −L + + ∧ F ∧ J& S → I ∧ F&R ∧ F − R + F ∧ K& ∧ FV ∧ F&S ∧ /F + X − /F ∧ Z& T → J&F ∧ V &F + T − F&X ∧ &FR&F ∧ T&/F − Y + /F&W ∧ U → Y + F − Z + F ∧ U&F + V − −F − −Y + +F − U + /F&I ∧ /F + V − V → J&F ∧ K& ∧ FV ∧ F&S ∧ /F + L − /F ∧ V &&F − −U + +&F&I ∧ W → X − F + L − −F − −W + + − F − U + /F&I ∧ /F + K − +FK + F − Y + X → W ∧ F&L ∧ ∧ F − −X + + ∧ F ∧ U&F + K − F&X ∧ &FS&F ∧ Z& Y → Z&F ∧ U&&F + +Y − −&F&L ∧ F − S + F ∧ Y & ∧ FK ∧ F&W ∧ Z → Y + F − U + +F − −Z + + + F + X − /F ∧ Y &/F − Z + −FS − F + X− 104
  • 121.
    B. A flowdiagram for the main tracing loop This appendix shows a flow diagram for the main tracing loop. Figure B.1.: Ray tracing with cache 105
  • 122.
    B. A flowdiagram for the main tracing loop 106
  • 123.
    C. Implementation HashTables C.1. Implementation of Cache Key Listing C.1: Implementation of cache key / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−−− s e t I n d i c e s −−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / d e v i c e void setIndices ( uint16 t instanceId , uint32 t primitiveIndex , f l o a t 3 inc dir , int hash method , int div = 200000) { this−>instanceId = instanceId ; this−>primitiveIndex = primitiveIndex ; this−>div x = ( i n c d i r . x + 1)* div /2; this−>div y = ( i n c d i r . y + 1)* div /2; this−>div z = ( i n c d i r . z + 1)* div /2; switch ( hash method ) { case 1: this−>hash = calc hash1 ( ) ; break ; case 2: this−>hash = calc hash2 ( ) ; break ; case 3: this−>hash = calc hash3 ( ) ; break ; default : this−>hash = calc hash1 ( ) ; break ; } } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−−−−− e q u a l s −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / d e v i c e bool equals ( CacheKey other ) { i f ( this−>primitiveIndex != other . primitiveIndex ) return false ; 107
  • 124.
    C. Implementation HashTables i f ( this−>instanceId != other . instanceId ) return false ; i f ( this−>div x != other . div x ) return false ; i f ( this−>div y != other . div y ) return false ; i f ( this−>div z != other . div z ) return false ; return true ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−− c a l c h a s h 1 −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / d e v i c e unsigned int calc hash1 ( ) { int x [ 5 ] = { primitiveIndex , instanceId , div x , div y , div z } ; long p = (1L << 32) − 5; long z = 0x64b6055aL ; int z2 = 0x5067d19d ; long s = 0; long zi = 1; for ( int i = 0; i < 5; ++ i ) { long xi = ( x [ i ] * z2 ) >> 1; s = ( s + zi * xi ) % p ; zi = ( zi * z ) % p ; } s = ( s + zi * (p − 1 ) ) % p ; end trace = clock ( ) ; hash gen= ( float ) ( end trace−s t a r t t r a c e )/CLOCKS PER SEC ; return ( unsigned int ) s ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−− c a l c h a s h 2 −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / d e v i c e unsigned int calc hash2 ( ) { return mortonCode5 ( div x , div y , div z , instanceId , primitiveIndex ) ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−− separateBy4 −−−−−−−−−−−−−−−−−−−−−−− * * * 108
  • 125.
    C.2. Implementation ofthe Chained Hash Table * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / d e v i c e unsigned int separateBy4 ( unsigned int x ) { x &= 0 x0000007f ; x = ( x ˆ ( x << 16)) & 0x0070000F ; x = ( x ˆ ( x << 8 ) ) & 0x40300C03 ; x = ( x ˆ ( x << 4 ) ) & 0x42108421 ; return x ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−− mortonCode5 −−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / d e v i c e unsigned int mortonCode5 ( unsigned int x , unsigned int y , unsigned int z , unsigned int u , unsigned int v ) { return separateBy4 ( x ) | ( separateBy4 ( y ) << 1) | ( separateBy4 ( z ) << 2) | ( separateBy4 (u) << 3) | ( separateBy4 ( v ) << 4 ) ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−− c a l c h a s h 3 −−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / d e v i c e unsigned int calc hash3 ( ) { return calc hash1 ( ) + calc hash2 ( ) ; } C.2. Implementation of the Chained Hash Table Listing C.2: Implementation of chained hash table interface / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−− writeToCache −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e void writeToCache ( PerRayData prev data , PerRayData data , CacheNode* &cachedNodeWrite ) { 109
  • 126.
    C. Implementation HashTables / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−− get bucket −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / int buf s = node buffer . size ( ) ; Key key ; key = makeKey( key , prev data , false ) ; int bucket ind = get bucket index ( key . hash ) ; CacheNode* root = &node buffer [ bucket ind ] ; CacheNode* node = NULL; / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−− i n s e r t data −−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( ! root−>used ) { root−>hash = key . hash ; root−>data = data ; root−>used = true ; node = root ; atomicAdd(&( root−>counter ) , 1 ) ; } else { int counter = atomicAdd(&( root−>counter ) , 1 ) ; int node ind = bucket ind + counter * num of buckets ; i f ( node ind >= buf s ) return ; node = &node buffer [ node ind ] ; node−>hash = key . hash ; node−>data = copyData ( data , node−>data ) ; node−>used = true ; / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−− s e a r c h f o r vacant p o s i t i o n −−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / while ( true ) 110
  • 127.
    C.2. Implementation ofthe Chained Hash Table { i f ( node−>hash <= root−>hash ) { i f ( atomicCAS(&( root−>l e f t ) , −1, node ind ) == −1) { atomicCAS(&(node−>parent ) , −1, root−>index ) ; return ; } else root = &node buffer [ root−>l e f t ] ; } else { i f ( atomicCAS(&( root−>right ) , −1, node ind ) == −1) { atomicCAS(&(node−>parent ) , −1, root−>index ) ; return ; } else root = &node buffer [ root−>right ] ; } } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−− l i n k elements o f one path −−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( cachedNodeWrite != NULL) { cachedNodeWrite−>queue = node−>index ; } cachedNodeWrite = node ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−− getFromCache −−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e CacheNode* getFromCache ( PerRayData data , bool benchmark ) { / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−− get bucket −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / int buf s = node buffer . size ( ) ; 111
  • 128.
    C. Implementation HashTables Key key ; key = makeKey( key , data , false ) ; int bucket ind = get bucket index ( key . hash ) ; CacheNode* root = &node buffer [ bucket ind ] ; i f ( ! root−>used ) return NULL; / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−− s e a r c h f o r element with equal key −−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / while ( true ){ i f ( root−>hash == key . hash ) { root−>hi t ++; return root ; } else i f ( key . hash <= root−>hash ) i f ( root−>l e f t == −1) return NULL; else root = &node buffer [ root−>l e f t ] ; else i f ( root−>right == −1) return NULL; else root = &node buffer [ root−>right ] ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−−−− hasKey −−−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e bool hasKey ( PerRayData data ) { CacheNode* node = getFromCache ( data , false ) ; i f ( node == 0) return false ; return true ; } C.3. Implementation of the Open-Addressed Hash Table Listing C.3: Implementation of open-addressed table interface / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 112
  • 129.
    C.3. Implementation ofthe Open-Addressed Hash Table * * * −−−−−−−−−−−−−−−−−−−− writeToCache −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e void writeToCache ( int base index , PerRayData prev data , PerRayData data , CacheNode* &cachedNodeWrite , unsigned int pos hash ) { i f ( c a c h e i n i t ) { / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−get bucket from t a b l e −−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / CacheKey key ; key = makeKey( key , prev data , false , cache division , hash method ) ; int iind = getBucketIndex ( key . hash ) ; int bucket ind = base index + iind ; i f ( bucket ind < node buffer . size ( ) ) { CacheNode *node = &node buffer [ bucket ind ] ; / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−− a c q u i r e write l o c k and write data −−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( atomicCAS(&(node−>writeLock ) , 0 , 1) == 0) { node−>hash = key . hash ; node−>nextOrigin = prev data . nextOrigin ; node−>nextDirection = prev data . nextDirection ; node−>data = data ; node−>used = true ; node−>traceTime = trace time ; node−>hashGen = key . hash gen ; node−>pos hash = pos hash ; node−>timestamp = time ; / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−− r e l e a s e r e a d l o c k l o c k −−−−−−−−−−−−−−−−−−− * 113
  • 130.
    C. Implementation HashTables * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / atomicExch (&(node−>readLock ) , 0 ) ; } } } } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−− getFromCache −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e bool getFromCache ( int base index , PerRayData prev data , PerRayData &data , unsigned int pos hash ) { i f ( c a c h e i n i t ) { / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−− get bucket by hash −−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / CacheKey key ; key = makeKey( key , prev data , false , cache division , hash method ) ; int iind = getBucketIndex ( key . hash ) ; int bucket ind = base index + iind ; i f ( bucket ind < node buffer . size ( ) ) { CacheNode* node = &node buffer [ bucket ind ] ; / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−− a c q u i r e the r e a d l o c k −−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( atomicCAS(&(node−>readLock ) , 0 , 1) == 0) { / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−− purge i f node has d i f f e r e n p o s i t i o n stamp −−−−−−−− * 114
  • 131.
    C.3. Implementation ofthe Open-Addressed Hash Table * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( node−>pos hash != pos hash ) { atomicExch (&(node−>readN ) , 0 ) ; atomicExch (&(node−>writeLock ) , 0 ) ; return false ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−− s y n c h r o n i z a t i o n o f read a c c e s s −−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / int i = atomicInc (&(node−>readN ) , cache readers number ) ; node−>hi t ++; data = node−>data ; i f ( i < cache readers number − 1) atomicExch (&(node−>readLock ) , 0 ) ; else { atomicExch (&(node−>readN ) , 0 ) ; atomicExch (&(node−>writeLock ) , 0 ) ; } return true ; } } } return false ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−− getFromCacheRes −−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e bool getFromCacheRes ( int base index , PerRayData prev data , PerRayData &data , unsigned int pos hash ) { i f ( c a c h e i n i t ) { / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 115
  • 132.
    C. Implementation HashTables * −−−−−−−−−−−−−−−− get bucket by hash −−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / CacheKey key ; key = makeKey( key , prev data , false , cache division , hash method ) ; int iind = getBucketIndex ( key . hash ) ; ; int bucket ind = base index + iind ; CacheNode* node = &node buffer [ bucket ind ] ; / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−− a c q u i r e the r e a d l o c k −−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( atomicCAS(&(node−>readLock ) , 0 , 1) == 0) { / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−− i f hashes are not equal c a l c u l a t e r e s i d u a l −−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( node−>hash != key . hash ) { float res1 = calculateResidual ( node−>data . nextDirection , prev data . nextDirection ) ; float res2 = calculateResidual ( node−>data . nextOrigin , prev data . nextOrigin ) ; float res = res1 + res2 ; / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−− i f r e s i d u a l e x c e e d s the t h r e s h o l d , purge −−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / i f ( res > cache residual value ) { atomicExch (&(node−>readN ) , 0 ) ; atomicExch (&(node−>writeLock ) , 0 ) ; return false ; } } 116
  • 133.
    C.3. Implementation ofthe Open-Addressed Hash Table / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−− s y n c h r o n i z a t i o n o f read a c c e s s −−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / int i = atomicInc (&(node−>readN ) , cache readers number ) ; node−>hi t ++; data = node−>data ; i f ( i < cache readers number − 1) atomicExch (&(node−>readLock ) , 0 ) ; else { atomicExch (&(node−>readN ) , 0 ) ; atomicExch (&(node−>writeLock ) , 0 ) ; } return true ; } } return false ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−−−− makeKey −−−−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e CacheKey makeKey( CacheKey key , PerRayData prev data , bool debug , int division , int hash method ) { key . primitiveIndex = 0; key . instanceId = 0; key . div x = 0; key . div y = 0; key . div z = 0; key . hash = 0; key . hash gen = 0; key . setIndices ( prev data . instanceId , prev data . primitiveIndex , prev data . nextDirection , hash method , division ) ; i f ( debug ){ r t P r i n t f ( ” primitive index %d n” , key . primitiveIndex ) ; r t P r i n t f ( ” instance id %d n” , key . instanceId ) ; 117
  • 134.
    C. Implementation HashTables r t P r i n t f ( ” div x %d n” , key . div x ) ; r t P r i n t f ( ” div y %d n” , key . div y ) ; r t P r i n t f ( ” div z %d n” , key . div z ) ; } return key ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−− getBucketIndex −−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e uint getBucketIndex ( unsigned int hash ) { return ( ( float ) hash/UINT MAX) * ( num of buckets − 1)/2; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−−−−− searchInMap −−−−−−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e int searchInMap ( int modelInstanceId ) { int mapSize = cache emitter map . size ( ) ; for ( int i = 0; i < mapSize ; ++ i ) { EmitterMapEntry * entry = &cache emitter map [ i ] ; i f ( entry−>modelInstanceId == modelInstanceId ) return entry−>base index ; } } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −−−−−−−−−−−−−−−−−− c a l c u l a t e R e s i d u a l −−−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e float calculateResidual ( f l o a t 3 origin , f l o a t 3 otherOrigin ) { float res = abs ( origin . x − otherOrigin . x ) + abs ( origin . y − otherOrigin . y ) + abs ( origin . z − otherOrigin . z ) ; return res ; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 118
  • 135.
    C.3. Implementation ofthe Open-Addressed Hash Table * * * −−−−−−−−−−−−−−−−−− c h e c k I n t e r s e c t i o n s −−−−−−−−−−−−−−−−−−− * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / inline d e v i c e bool checkIntersections ( const CacheNode* node , float R) { const PerRayData * data = &(node−>data ) ; for ( int i = 0; i < antenna buffer . size ( ) ; ++ i ) { const AntennaBufferEntry * antenna = &antenna buffer [ i ] ; i f ( data−>type == RECEIVER HIT) { f l o a t 3 recPos = antenna−>position ; f l o a t 3 pos = data−>nextOrigin ; float dx = abs ( recPos . x − pos . x ) ; float dy = abs ( recPos . y − pos . y ) ; float dz = abs ( recPos . z − pos . z ) ; i f ( dx < R && dy < R && dz < R) return true ; } } return false ; } 119
  • 137.
    Bibliography [1] Users manual.eb assist adtf 2.9.0. Elektrobit Automotive GmbH. [2] Luis A. Alexandre. Set distance functions for 3d object recognition. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 57–64, 2013. [3] Michael Bader. Space-Filling Curves. Springer Berlin Heidelberg, 2013. [4] James Balfour. Cuda threads and atomics, 25 April 2011. NVIDIA Research. [5] Margherita Barile. Taxicab metric, mathworld–a wolfram web resource, 2014. [6] Mate Boban, Jao Barros, and Ozan K. Tonguz. Geometry-based vehicle-to-vehicle channel modeling for large-scale simulation. IEEE Transactions on Vehicular Technology. [7] Ken Chan, Rynson W.H. Lau, and Jianmin Zhao. Dynamic sound rendering based on ray-caching. [8] CAR 2 CAR Communication Consortium. Car 2 car communication consortium man- ifesto. [9] Kurt Debattista, Piotr Dubla, Francesco Banterle, Luis Paulo Santos, and Alan Chalmers. Instant caching for interactive global illumination. [10] F. Escarieu, V. Degardin, and L. Aveneau. 3d modelling of the propagation in an indoor environment : a theoretical and experimental approach. Proceedings of the Eu- ropean Conference on Wireless Technologie, 2001. [11] Tony Finch. Incremental calculation of weighted mean and variance, 2009. University of Cambridge Computing Service. [12] M. Fiore, J. Harri, F. Filali, and C. Bonnet. Vehicular mobility simulation for vanets. Simulation Symposium, 2007. ANSS ’07. 40th Annual. [13] Tim Foley and Jeremy Sugerman. Kd-tree acceleration structures for a gpu raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hard- ware, 2005. [14] Kirill Garanzha and Charles Loop. Fast ray sorting and breadth-first packet traversal for gpu ray tracing. EUROGRAPHICS, 29, 2010. [15] T. Gaugel, L. Reichardt, J. Mittag, T. Zwick, and H. Hartenstein. Accurate simulation of wireless vehicular networks based on ray tracing and physical layer simulation. Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2011, pages 619–630. 121
  • 138.
    Bibliography [16] J. Gunther,S. Popov, and P. Slusallek. Realtime ray tracing on gpu with bvh-based packet traversal. Interactive Ray Tracing, 2007. RT ’07. IEEE Symposium on, pages 113 – 118, 2007. [17] M. Harris, D. Luebke, I. Buck, N. Govindaraju, J. Kr ˜A¼ger, A. Lefohn, T. Purcell, and J. Wooley. Gpgpu: General-purpose computation on graphics processing units. SIGGRAPH 2004 Course Notes. [18] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming, 1st Edition. Morgan Kaufmann, 2008. [19] Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Lan- guages and Systems, Volume 13 Issue 1:124–149, 1991. [20] Asger Hoedt. Morton codes, 2014. [21] Denis Jarema. Grammars for space-filling curves. WWW. Solution for Worksheet 11. Algorithms of Scientific Computing - Summer 2013. Technical University Munich. [22] Edmund Wright John Daintith. A Dictionary of Computing (6 ed.). Oxford University Press, 2008. [23] Tero Karras. Maximizing parallelism in the construction of bvhs, octrees, and k-d trees. High Performance Graphics, 2012. [24] Tero Karras. Thinking parallel, part iii: Tree construction on the gpu, December 2012. [25] Donald Knuth. The Art of Computer Programming. Addison-Wesley, 1995. [26] Jonathan Ledy, Herve Boeglen, Anne-Marie Poussard, Benoit Hilt, and Rodolphe Vauzelle. A semi-deterministic channel model for vanets simulations. International Journal of Vehicular Technology, Volume 2012. [27] A. E. Lefohn, J. M. Kniss, C. D. Hansen, and R. T. Whitaker. Interactive deformation and visualization of level set surfaces using graphics hardware. IEEE Visualization, page 75ˆa82, 2003. [28] Aaron E. Lefohn, Shubhabrata Sengupta, Joe Kniss, Robert Strzodka, and John D. Owens. Generic, efficient, random-access gpu data structures. ACM Transactions on Graphics, 25 Issue 1:60–99, January 2006. [29] Kyle Loudon. Mastering Algorithms with C. O’Reilly Media, 1999. [30] R. Mantiuk, K. J. Kim, A. G. Rempel, and W. Heidrich. Hdr-vdp-2: a calibrated vi- sual metric for visibility and quality predictions in all luminance conditions. ACM Transactions on Graphics, 30, 4, 2011. [31] Randima Fernando Matt Pharr, editor. GPU Gems 2: Programming Techniques For High- Performance Graphics And General-Purpose Computation. Pearson Addison Wesley Prof, 2005. 122
  • 139.
    Bibliography [32] David Meko.Applied time series analysis. Notes for lessons. [33] Duane Merrill and Andrew Grimshaw. High performance and scalable radix sort- ing: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters, 21(02):245–272, 2011. [34] Prabhakar Misra and Mainak Chaudhuri. Performance evaluation of concurrent lock- free data structures on gpus. Parallel and Distributed Systems (ICPADS), pages 53 – 60, 2012. [35] Bochang Moon, Byun Yongyoung, Kim Tae-Joon, Claudio Pio, Kim Hye-sun, Ban Yun-ji, Nam Seung Woo, and Yoon Sung-eui. Cache-oblivious ray reordering. ACM Transactions on Graphics, 29, 2010. [36] Pat Morin. Open Data Structures (in C++). [37] Nvidia. Cuda c programming guide, October 2012. [38] Nvidia. Optix ray tracing engine. programming guide., November 2012. [39] NVIDIA. Ray cache, 2014. [40] Lars Nyland and Stephen Jones. Understanding and using atomic memory opera- tions. In GPU Technology Conference. NVIDIA, 2013. [41] OpenSceneGraph. The openscenegraph project website, 2014. [42] Stefan Popov, Iliyan Georgiev, Philipp Slusallek, and Carsten Dachsbacher. Adaptive quantization visibility caching. EUROGRAPHICS 2013, Volume 32, 2013. [43] Przemyslaw Prusinkiewicz and Aristid Lindenmayer. The Alogirthmic Beauty of Plants. Springer-Verlag, New York, 1996. [44] C. Donner M. Cammarano H. W. Jensen Purcell, T. J. and P. Hanrahan. Photon map- ping on programmable graphics hardware. Proceedings of the SIGGRAPH/Eurographics Workshop on Graphics Hardware, page 41ˆa50., 2003. [45] I. Buck W. R. Mark Purcell, T. J. and P. Hanrahan. Ray tracing on programmable graphics hardware. ACM Transactions on Graphics(Proceedings of SIGGRAPH), 21(3):703ˆa712, 2002. [46] Christian F. Ruff, Esteban W. G. Clua, and Leandro A. F. Fernandes. Dynamic per object ray caching textures for real-time ray tracing. Graphics, Patterns and Images (SIBGRAPI), 2013, pages 258 – 265, 2013. [47] Jason Sanders and Edward Kandrot. CUDA by Example: An Introduction To General- Purpose GPU Programming. Addison-Wesley, 2010. [48] Daniel Scherzer, Lei Yang, and Oliver Mattausch. Exploiting temporal coherence in real-time rendering. Proceedings of ACM SIGGRAPH, 2010. [49] Robert Sedgewick. Algorithms (1st ed.). Addison-Wesley, 1983. 123
  • 140.
    Bibliography [50] Parag Tole,Fabio Pellacini, Bruce Walter, and Donald P. Greenberg. Interactive global illumination in dynamic scenes. [51] Graham Upton and Ian Cook, editors. A Dictionary of Statistics. Oxford University Press, 2008. [52] I. Wald, C. Benthin, and P. Slusallek. Interactive global illumination in complex and highly occluded environments. Proceedings of the 14th Eurographics Workshop on Ren- dering, 2003. [53] B. Walter, S. Fernandez, A. Arbree, K. Bala, M. Donikian, and Greenberg D. P. Lightcuts: a scalable approach to illumination. ACM Transactions on Graphics, 24, 3:1098ˆa1107, 2005. [54] G. J. Ward, F. M. Rubinstein, and Clearr. D. A ray tracing solution for diffuse inter- reflection. SIGGRAPH, page 85ˆa92, 1988. [55] Daniel Weber, Jan Bender, Markus Schnoes, Andre Stork, and Dieter Fellner. Efficient gpu data structures and methods to solve sparse linear systems in dynamics applica- tions. Computer Graphics Forum, 32, issue 1:16–26, February 2013. [56] Wikipedia. Integer (computer science) — Wikipedia, the free encyclopedia, 2014. 124