ECS TECHNICAL REPORT 2012/01
http://www.dur.ac.uk/resources/ecs/research/technical reports/2012 01.pdf
ADAPTER: a scalable multi-resolution particle data
format∗
Djamel Hassaine† Nicolas S. Holliman† Adrian Jenkins‡
Tom Theuns‡
Abstract: We describe and test an adaptive multi-resolution data format (ADAPTER),
designed to analyse in real time multi Tera-byte particle data sets on a high-end
graphic desk-top computer attached to a file server. The data are distributed over a
set of files, each of which containing a fraction of the data at higher sampling rate,
using an algorithm based on one or more keys (for example spatial location) defined
by the user. The hierarchy of files based on a k-d tree allows for very rapid access
to either a large fraction of the data set at low sampling rate, or a small fraction at
full resolution, without increasing the total amount of data stored. This enables data
exploration as well as more in-depth analysis of a smaller fraction of the full data.
ADAPTER consists of a data format and associated programming interface imple-
mented in a client-server model designed to be scalable to very large data sets.
Keywords: scalable indexing, multi-resolution data format, point data
AMS subject classifications: 68P05, 68P10, 68P15, 68P20, 68W10, 68W15
1. Introduction
1.1. Particle data sets and file formats. Particles (or ‘point-cloud data’) are the basic
data unit found in a wide range of applications and research fields. For example, particles
are used to represent the mass in the Universe in cosmological numerical simulations [5,
e.g.] and the topology measured by airborne laser scanning [1, e.g.]. The most general
application involves removing rendering redundancies in highly detailed 3D models by
using points instead of polygonal primitives such as triangles; this technique is known as
Point-based rendering (PBR) [11, 17] and is of great benefit when the projected primitives
are smaller than the pixels of the display screen. Particle data sets are rapidly growing in
size due to increasingly powerful computers used in performing simulations, and scanning
devices. For example, state-of-the-art supercomputer simulations are producing Terabytes
and even Petabytes of data. The sheer size of the snapshot files hampers efficient exploita-
tion of the full richness represented by the data. This is exacerbated by the fact that the
super computers used to generate the simulations are generally far more powerful than
computers available for the data analysis process.
∗This work was supported by the EPSRC research grant EP F01094X.
†Institute of Advanced Research Computing, School of Engineering and Computing Sciences,
Durham University, Science Laboratories, South Road, Durham DH1 3LE
‡Institute of Computational Cosmology, Department of Physics, Durham University,
Science Laboratories, South Road, Durham DH1 3LE
A number of self-describing data file formats have been developed to store, process
and navigate through such large volumes of data efficiently. Popular file formats are:
Hierarchical Data Format (latest version is HDF5 [23]), Network Common Data Format
(NetCDF) [16], Planetary Data System (PDS) [10, 12], Flexible Image Transport System
(FITS) [24], and VAPOR and the associated Multi-resolution Toolkit (MTK) [4]. These
file formats contain metadata, which store descriptive details of all the data sets and multi-
dimensional arrays within the file, e.g. type of data and dimensions used to store it, file
offset, array size, etc. The metadata allows a corresponding runtime library to identify
and directly access the different arrays or even sub-arrays within the file which greatly
improves I/O performance.
An alternative method of improving access and navigation performance is by using
adaptive spatial indexing techniques and multi-resolution access. Unfortunately, neither
of these functions are widely supported or even available with current file formats. This
paper explores the potential benefits of an adaptive multi-resolution data file format called
ADAPTER. We illustrate this new format on a particle data set of a large cosmological
simulation (GIMIC, [5]). The simulation uses ∼ 200 M particles to follow the formation
of galaxies in a cosmological setting. Each particle is endowed with many properties (32
in the case of GIMIC, e.g. density, temperature, star formation rate) in addition to its
3 dimensional (3D) position and velocity. As the simulation code marches the system
forward in time it outputs all the particle properties at specified time in a ‘snapshot’ file
of approximately 35 Gbytes. ADAPTER reformats this snapshot file as described below.
1.2. Motivation. Extracting even small amounts of data from massive data sets is often
computationally expensive, and interacting with the data visually is almost impossible
without access to a supercomputer. In addition rendering billions of particles on a typical
display with only about one million pixels is obviously redundant [22]. When exploring
data with a large dynamic range it is therefore more efficient to first analyse and visualise
the data set at lower resolution, identify regions of interest, then ‘zoom’into those regions
and exploit the data set at its full resolution. This ability of finding and loading only the
necessary data requires an adaptive spatial index with multiple resolutions or scales of the
data.
1.2.1. Requirements. We have identified from our own science researchers and litera-
ture survey a need for a multi-resolution particle based data format with the following
requirements.
• R1: Spatial indexing generation using hierarchical techniques to exploit locality in
unstructured particle/point data.
• R2: Support efficient search for dense regions of the data at multiple resolutions.
• R3: Set-theoretic operations (intersection, union, difference) to extract and merge
sub-sets of the data, including the ability to extract a sub-set at any resolution up to
the highest resolution in the data.
• R4: Preservation of key physical properties of the data at multiple scales.
• R5: Particle-ID index that uses IDs consistent across timesteps allowing efficient
tracking of the path of specific particles through time.
• R6: Inherently scalable and parallel in operations so that data can be processed
on High-End Computing (HEC) and High Performance Computing (HPC) scale
computers.
2
• R7: The original data must be stored without any data replication since the storage
costs are already prohibitive.
2. Related work
Out of the file formats mention in the background section only VAPOR supports multi-
resolution access to the data; this is achieved by using a lossless wavelet encoding of the
data. However, VAPOR is limited to point data organised in regular grids only. While re-
sampling irregular point distributions to a grid is possible it is not always desirable since
smoothing artifacts can occur and disk storage can be an issue if an original copy of the
data must also be stored.
Regardless of the inherent problems current file formats suffer from, they are still
widely used (HDF5 probably being the most popular). As a result a number of solutions
have been proposed to extend the functionality for the HDF file format without having to
modify the existing HDF data sets. For example, Nam and Sussman [14] implemented
a generic indexing library based on R*-trees which basically stores the minimum and
maximum values for each dimension of the data chunks. The index is built separately from
the data set so the internal structures of the file formats are left unmodified. Significant
performance gains were shown over using the standard HDF5 library; however, their
method does not address the need for multi-resolution access to the data.
Gosnik et al. [7] also designed an indexing library, dubbed HDF5-FastQuery, to be
used in conjunction with the HDF file format. They use bitmap indexing, which can
provide significant performance gains for compound multi-dimensional queries, e.g. find
all particles with (temperature > 1000) AND (70 < pressure < 90). A bitmap index stores
a separate bitmap for each possible value or range of values (bin) for every attribute or
variable. Each bitmap indicates with a zero or one whether each record in the data set has
the bitmap value or range of values. The spatial requirement for the index is high and can
be as large as n2, where n is the number of records, if every record has a unique value.
The use of compression techniques such as that used by FastBit, can bound the index size
in the worst case by 4n words. Bitmap indexing is effectively a fixed-grid method and is
unlikely to work well for non-uniform distributed data.
Pascucci and Frank [15] consider the problem of hierarchical indexing for out-of-core
access to multi-resolution data. They use an index scheme based on the Z-order curve, a
space filling curve, which has the useful property of following points in the same order at
different scales. The use of the Z-order curve is effectively the same as using an Oct-tree
but without any pointers. There is no data replication involved in extracting data at the
different resolutions which is highly desirable. However, the data at each resolution is
sub-sampled in a regular manner which does not reflect random sampling and may not be
appropriate for certain scientific applications. Also, unfortunately the implementation is
limited to regular grid data only.
Spatially indexing data with a space-filling curve method could potentially be ex-
tended to non-uniformly distributed data by building a spatial index, e.g. an Oct-tree,
on top of the space-filling curve. The spatial index can then be used to group sparsely
populated cells into larger cells. One problem with with this approach is that with highly
clustered data, some cells from the original space-filling curve may contain more particles
than are desirable. In this case, the order of the curve must be increased in the simulation
run which increases the key size. It is not hard to imagine in some simulation instances
where the data is so densely clustered that increasing the key range is not an appropriate
solution.
3
An area where a number of groups have designed multi-resolution structures for un-
structured point-cloud and particle data sets is in visualization and rendering algorithms.
Hopf and Ertl [9] describe a method of re-sampling the data set using principle compo-
nents analysis (PCA) and indexing the clusters at different granularities. Data is com-
pressed in a lossy manner which would mean many scientists would also want to preserve
the original data set, thus dramatically increasing storage costs. Furthermore, Yeung and
Ruzzo [25] empirically compared the quality of clusters obtained from the original data
set and PCA and showed that this method captured the cluster structure poorly. It was
found that although the first few principal components contained most of the variance in
the data, they did not necessarily capture most of the cluster structure.
3. Design
The overall design of ADAPTER follows a server/client model with the server storing
the data in a specialized k-d-tree data-structure and managing the clients’ requests for
data. The server/client model was chosen because the sheer scale of the data sets limits
the storage of them to powerful and expensive HEC or possibly smaller cluster-based
systems. Scientists using their desktop computers can remotely access small quantaties
of the data by sending queries to the server. Furthermore, trivial operations, e.g. initial
exploratory analysis, can be offloaded to the client; this frees up more CPU cycles and I/O
time on the server and data storage system thus allowing more clients to efficiently access
the data set. Computationally demanding operations, e.g. performing analysis of the data
at full resolution, can be performed on the server system itself before the final results are
sent back to the client.
3.1. Assumptions on the data. We have made some general assumptions on the struc-
ture of particle data sets. The structure of a single particle from a typical particle data set
is illustrated in Figure 3.1: at the very minimum each particle has a unique ID. The full
data set is composed of multiple time-steps (‘snapshots’), with each snapshot potentially
containing multiple types of particles.
Figure 1: Structure of a particle from a typical particle data set.
3.2. Indexing strategy. Most database management systems allow efficient updates to
the data as well as provide searching mechanisms for data sets which are too large to
reside in main memory, therefore, the underlying data structures are dynamic. The B-
tree is a good example, which allows searches, insertions, and deletions in logarithmic
amortized time [2]. However, the problem with dynamic spatial indexes is that they suffer
from poor storage utilisation because data is inserted into the structure at the leaf nodes,
and when a page or bin becomes full partitions must be propagated upwards. Static data
4
structures, on the other hand, can partition the data in a top-down manner using knowledge
of the entire data set, greatly optimising storage utilisation. Typically data produced from
a computer simulation, e.g. the Millennium Simulation [20] using GADGET [21, 19],
will never be modified after completion; therefore, we will take advantage of the greater
storage utilisation static data structures offer.
The k-d-tree [3] is a versatile multidimensional data structure that has better overall
performance properties over many other types of data structures; for example, Harvan [8]
studied a large number of different rayshooting acceleration schemes, including BSP tree
and Oct-tree based schemes, and found that on average the k-d-tree based scheme per-
formed best. The k-d-tree is essentially a binary tree with every node representing a
hyper-plane that divides the underlying space into two subspaces. At each level of the
tree only one attribute is chosen as a discriminator (the variable to sort and divide the
points by), see Fig. 3.2. The number of intersections is determined by how well the sub-
volumes of the tree enclose the objects or group of points; the k-d-tree is generally more
efficient at partitioning the data than methods such as the Oct-tree and uniform space
subdivision, which is the most likely reason for its superior performance.
Figure 2: An example of a static bucket k-d-tree variant built from a list of two dimen-
sional points using a bucket size of two. The step first involves sorting the particles in
ascending order according to their x coordinate value and forming a hyper plane at the
median value to the divide the data into two volumes; this step is repeated for the left and
right sub-volumes but the points are sorted along the y-axes.
We propose to use a static bucket variant of the k-d-tree [18] for indexing with each
leaf node corresponding to a disk bucket or block of capacity b. The elements in a bucket
are only split when its cardinality exceeds b. The leaf nodes actually only point to the
location of where the data points are stored on the disk, i.e. the buckets are stored as files
and the leaf nodes store the file name. In order to explain our indexing strategy in more
detail we use the example of indexing spatial coordinates in this section and for the rest
of this manuscript (the same indexing scheme can be applied to any of the other attributes
in the data set). Indexing the spatial coordinates involves cycling through the x, y, z axes
in a predefined and constant order, sorting the points and setting the hyperplane at the
median point. Figure 3.2 illustrates an example of building a static bucket k-d-tree, with
a bucket size of b = 2, on the following list of 2D coordinates {(35,42) (52,10) (62,77)
(82,65) (5,45) (27,35) (85,15) (90,5)}.
5
3.2.1. Multi-resolution indexing. While the static indexing strategy described above
has excellent disk optimization, since every leaf node will be approximately the same size
and nearly full, all the data resides at the bottom of the tree. Multi-resolution access could
be achieved by randomly shuffling the points in each leaf node before storing them to the
disk: if, for example, we wanted a region of interest at 10 per cent resolution, a number
of intersecting leaf nodes would have to be identified and then only the first 10 per cent of
the leaf nodes’ data would need to be read in and clipped. However, for extremely large
data sets the index would be huge and a lot of CPU cycles would be required to traverse
down the tree even to extract very small amounts of data; therefore, we do not believe this
method would scale very well.
We propose a method of dividing the data set up into smaller and more manage-
able pieces in such a way that indexing these smaller data sets will allow efficient multi-
resolution access to the original data. We take advantage of the fact that the particle nature
of the data lends itself to creating low resolution versions, simply by random sampling.
This approach will, in an unbiased way, conserve mass, momentum and other particle
attributes provided these are appropriately weighted by the selection probability [13].
Random sampling will of course introduce a Poisson sampling error into estimates of any
given quantity. Some introduction of error is inevitable in a process which degrades the
resolution. One advantage of Poisson sampling is that the induced errors can be estimated
from the low resolution data itself often trivially.
Figure 3: Our multi-resolution indexing approach involves multiple stages of sub-
sampling and indexing.
To achieve scalable multi-resolution indexing we repeat a process of random sub-
sampling, division and indexing of the data set. The first stage involves generating a
coarsely sampled data set from the original data and indexing it with the static bucket k-
d-tree. Each leaf node in this index also represents a coarsely sampled region or volume
(assuming spatial indexing) of the entire data set. Presumably if the sampling rate is low
enough this first level index and data set could be up-loaded very quickly to the client-
side desktop and navigated/rendered with relative ease. The remaining non-indexed data
is then divided up into the regions represented by the leaf nodes and sub-sampled again.
6
The sub-sampled data is indexed separately so that a number of different indices will have
been built. The leaf nodes from the first index are then linked to the appropriate smaller
region indices. These steps are repeated until no further data is left to be indexed, which
depends on the sampling rate at each stage. Larger data sets will likely need more levels
of sub-sampling and indexing so that there is a sufficient range of multi-resolution access.
Figure 3.2.1 illustrates the concept for multi-resolution indexing on the spatial properties
for a single time-step snapshot file.
A further advantage of this approach of dividing the data up into smaller pieces is that
in cases where the index is too large to fit in the main memory, many smaller indices can
be more conveniently distributed or accessed out-of-core.
3.3. Minimizing the index storage requirements. One of the objectives of ADAPTER
is that it should be scalable up to very large data sets. Therefore, we can not assume
the server system will be powerful enough to store the entire data in memory. Many
database management systems often assume even the index will be too large to fit into
memory and so use out-of-core indexing structures such as the B-tree. However, out-of-
core access to the index would severely impede query performance and is probably too
severe a constraint for our requirements. We will assume that the point data is accessed
out-of-core, but the index can be stored in main memory; in order to achieve this goal the
index size will be minimized with the following techniques:
• Use of implicit arrays so that node pointers can be eliminated (see section 3.3.3).
• Storing the internal node discriminator and leaf node data as a union; a union
in C/C++ is a data structure that allows several types of data to be stored in the
same location. The discriminator values of the internal nodes are accessed as floats,
whereas the leaf node data are accessed as integers.
• The leaf node data consists of three pieces of information which are bit packed
together into a single integer to further save space (see section 3.3.1).
3.3.1. Bit packing the leaf node data. The leaf node requires three pieces of informa-
tion: leaf node ID (which is also used to derive the file name where the data points in
that region are stored), a boolean flag determining whether the data is stored locally or
externally (more on this in section 3.4), and a boolean flag to determine whether the leaf
node points to another index, i.e. the next level of detail. The boolean flags can be stored
using just one bit each leaving either 30 or 62 bits to store the leaf node ID, depending
on the system architecture (32 bit or 64 bit machine); this allows approximately 1 × 109
or 4.6 × 1018 unique IDs and leaf nodes. The entire data set can divided up into lots of
smaller data sets and then indexed separately, i.e. two different indices can use the same
IDs for their leaf nodes, therefore, the number of available unique IDs should be more
than ample for any sized data set.
The following snippet of C code shows the function responsible for extracting the
information from the packed leaf node data (assuming a 32 bit architecture):
void i m p l i c i t k d t r e e u n p a c k a d d r e s s ( unsigned i n t packed address ,
unsigned i n t ∗ l e a f n o d e i d , unsigned i n t ∗ l o c al ,
unsigned i n t ∗ n e x t i n d e x )
{
unsigned i n t n e x t i n d e x f l a g = packed address ;
n e x t i n d e x f l a g &= 0x00000001 ;
unsigned i n t l o c a l f l a g = packed address ;
7
l o c a l f l a g &= 0x00000002 ;
l o c a l f l a g = l o c a l f l a g >> 1;
unsigned i n t l e a f = packed address ;
l e a f &= 0 x f f f f 0 0 0 0 ;
l e a f = l e a f >> 16;
∗ l e a f n o d e i d = l e a f ;
∗ n e x t i n d e x = n e x t i n d e x f l a g ;
∗ l o c a l = l o c a l f l a g ;
}
3.3.2. Deriving the point data file name. Each leaf node, defining a region of the data
set, points to a file containing the corresponding particle data. The file names are derived
from a simple naming scheme: the file names are composed of the index ID appended
with the leaf node ID. Each index has a unique ID which also follows a simple naming
scheme: the index ID for the first level of detail index is defined as ”0”, and the IDs for the
subsequent indices are derived by appending the IDs of parent index and leaf node (each
leaf node points to the next level of detail index). Figure 3.3.2 illustrates an example of
the naming process.
Figure 4: Naming convention scheme.
Unfortunately most modern file systems, e.g. NTFS, ext3 and HFS Plus, limit the
length of file names to a maximum of 255 characters. Technically this limits the scalability
of our solution, however, this still leaves us with potentially upto 1×10253 −1 unique leaf
node file names for just the first level of detail index, and in total many more leaf nodes
and levels of detail; therefore, in practice this will unlikely be problematic. Furthermore,
if needed, the range of possible leaf nodes and indices could be extended significantly by
including all the available characters (which is greater than 65,000).
3.3.3. Process of storing the index in an implicit array. Implicit arrays eliminate the
need for using two pointers per node in the tree (pointers from parent to the left and right
child nodes); furthermore, the index tree can be saved and loaded from disk far more
8
efficiently as there is no need to rebuild the entire path of pointers. Building an implicit
index is simple: if a node is stored at index k in the array, then its left child is stored at
2k+1 and its right child is stored at 2k+2 (see figure 3.3.3). A caveat to storing a binary
tree or similar type index in an implicit array is that the tree must be complete, i.e. every
level of the tree, except possibly the deepest level, must be filled completely and on the
deepest level all the nodes must be as far left as possible. This presents a problem as our
partitioning strategy at the median position can result in a tree with a maximum height
difference between any two leaf nodes of up to one and therefore may not be complete
(see [6] for more details).
Figure 5: Storing the index tree using an implicit array requires a balanced tree. In an
implicit array, if a node is stored at index k, then its left child will be stored at 2k+1 and
its right child will be stored at 2k+2.
To satisfy the constraint that the tree be complete, we can quite easily convert the in-
dex tree into a perfect binary tree, i.e. every node in the tree except the leaf nodes must
have two children and all the leaf nodes must be at the same depth or level, which is also
complete. Our solution involves duplicating certain nodes and adding discriminator val-
ues that do not affect the bounding boxes of the original leaf nodes and thus any searching
algorithm or query will give the same results regardless of whether it is performed on the
original or converted tree. The algorithm works as follows: when a leaf node is found
with a depth difference between the bottom most level of the tree, it is converted into an
internal node and its original leaf node ID is assigned to two identical child nodes; the
discriminator value of this newly converted internal node is calculated by determining the
upper right corner of its bounding box and taking the coordinate value representing the
current discriminator axes. Figure 3.3.3 illustrates the concept of this conversion algo-
rithm more clearly.
What is the storage cost advantage of using an implicit array over pointers? In the
worst case scenario all the leaf nodes are on the same level except for two which are
one level deeper. Converting this tree into a perfect tree requires the addition of 2d − 2
nodes, where d is the maximum depth of the tree; ultimately an array of 32 or 64 bit
integers/floats with a capacity of 2d+1 − 1 will be required to store the converted tree.
However, the original tree stored using pointers requires 2d +1 nodes, a boolean flag for
each node to determine whether it is an internal or leaf node, and 2d pointers resulting
in a total of 2d+1 +2d +2 variables (32 or 64 bit). Therefore, in the worst case scenario,
the implicit array will still save on storage cost. Furthermore, we will also benefit from
the more efficient access routines to the implicit array as opposed to traversing the tree
through its pointers.
9
Figure 6: Completing the tree.
3.4. Parallel search process. In order to achieve scalability, we have designed and im-
plemented ADAPTER to work on a distributed computer system, e.g. a cluster of PCs.
Distributed computing is generally considered the most scalable and cost effective solu-
tion as processing nodes can simply be added to the network as required. The message
passing model is generally used for programming distributed systems, and the Message-
Passing Interface (MPI) is the most widely used library API for message passing and
related operations. The message-passing model assumes each processor has its own in-
dependent memory, but can communicate with other processors by sending and receiving
messages; data transfer from the local memory of one processor to another processor’s
memory requires explicit communication operations to be performed by both processors.
The index was designed to be as small as possible in order to fit into the main memory
of each individual PC. Therefore, our scalable solution involves storing an entire copy of
the index on every processor/node but randomly distribute the actual point data associated
with the leaf nodes amongst the processors; the indices on each processor must also be
updated to determine whether the leaf nodes’ point data is stored locally or on another
processor’s disk. During a query the indices are traversed by each CPU in parallel without
requiring any communication, and the extracted data from each processor is accumulated
or collected by the master processor. With this method the processors can perform the
query almost independently, the only communication required between the processors
is at the end when the data must be collected; there is no need for complicated load
balancing algorithms and there are no parent-child dependencies or other common issues
with parallel programming.
4. Performance evaluation
For purposes of a simple evaluation we are interested in the following three types of
queries which are quite common in data analysis:
• Extract a low resolution sub-sample of the entire volume of data; this is very useful
10
to quickly visualize the data.
• Extract a large region of interest from the volume at varying resolutions.
• Extract a small region from the volume at varying resolutions.
Currently the most common approach to carrying out these queries is a simple brute
force approach, i.e. using standard HDF access routines to read in all the data and prune or
clip the unwanted points. An alternative in-house built solution at the Institute for Compu-
tational Cosmology (ICC) at Durham University reorganises the points in HDF data files
using an Oct-tree data-structure. The implementation currently only allows serial execu-
tion but should still provide a more challenging competing algorithm than the brute force
approach. We will record and compare the time taken to complete the queries described
above with ADAPTER and the two alternative methods mentioned, on an example data set
from a snapshot of GIMIC, which consists of 180 million particles distributed across 512
HDF5 file parts. These original individual files each contain all particles in several cubic
blocks, which are ordered along a space-filling curve that traverses the entire computa-
tional volume [21, 19]. This spatial division is useful during the actual simulation itself, as
it preserves data locality and minimises inter-processor communication. Individual files
then simply contain all particles acted upon by a given processor core during the simula-
tion. In simulations with a large dynamic range, this division results in a potentially large
load imbalance: to avoid this each individual processors gets more than one contiguous
section of the space-filling curve. The net result is that each individual file that makes-up
the snapshot contains particles that are distributed across the computational volume.
We applied the ADAPTER algorithm with three levels of detail/resolution and the buck-
et/bin size of the leaf nodes set to a maximum of b = 105 particles. The sampling rate
was calculated in such a way as to approximately increase the number of sub-sampled
particles in each consecutive resolution step exponentially. The resulting index was com-
posed of 131 sub-sampled indices, taking up 24.2 kB of disk space, and 2178 leaf node
files containing the actual point data, taking up 2.6 GB of disk space. This compares
favourably with the Oct-tree approach which required 2 MB of storage for just the index;
although both indices are small in absolute terms, indexing extremely large data sets will
require a compact indexing data structure otherwise the index may become too large to fit
in main memory.
Table 1 summarises the recorded wall time for the different queries and data extrac-
tion methods. We only show the timings of the brute force method, using HDF5 access
routines, for one query (extracting 9M points from a large region of interest) because re-
gardless of the type of query all the data has to be read in and clipped; therefore, all the
timings will be similar or even worse than those shown. Both the Oct-tree approach and
ADAPTER are vastly superior to the brute force approach, at least 10 times faster with a
single core.
The Oct-tree approach performs slightly better than ADAPTER when it comes to ex-
tracting large regions at full resolution. This is because the HDF5 access routines are
very efficient at extracting large amounts of contiguous data, i.e. from 512 files whereas
the ADAPTER index in this case may have to access up to 2178 files which would require
more random disk seeking time. However, the results show that ADAPTER has a modest
advantage when retrieving smaller regions of the volume at full resolution (about 25 per
cent performance improvement). This performance increase is because less I/O and CPU
time involved with pruning the points is required as the data is distributed amongst more
files and the index allows the desired points to be targeted more accurately.
11
Time (sec) Type of query
Brute force (HDF5 routines)
1 core 5 cores
18.15 5.17 large region at full resolution (9M points)
ADAPTER
1 core 5 cores
1.86 0.48 large region at full resolution (9M points)
0.16 0.05 small region at full resolution (500K points)
0.12 0.03 large region at medium resolution (430K points)
0.01 0.01 entire region at low resolution (80K points)
0.01 0.01 small region at medium resolution (23K points)
0.01 0.00 large region at low resolution (4K points)
Oct-tree
1 core
1.50 large region at full resolution (9M points)
0.20 small region at full resolution (500K points)
0.23 large region at medium resolution (430K points)
0.12 entire region at low resolution (80K points)
0.05 small region at medium resolution (23K points)
0.05 large region at low resolution (4K points)
Table 1: Comparison of performance results for ADAPTER, an Oct-tree approach and a
brute-force method using standard HDF5 access routines for various types of queries.
12
The results show that the main advantage ADAPTER offers is for queries extracting
sub-samples of the data: Performance increases of up to 500% over the Oct-tree method
were observed; this proves that the concept of dividing the data and indexing structure
works well. The results also suggest that ADAPTER is scalable; performance scales almost
linearly as the number of cores used increased from one to five; the only situation where
performance did not increase with additional cores was the case for extracting a small
sub-sampled region because the data was stored in very few leaf nodes, therefore, most
of the cores would have been idle. Our solution of randomly distributing the point data
amongst the cores but replicating the index is simple but effective.
An advantage of storing the points in many small files not immediately obvious from
the results is the ability to compress the entire data once and then only uncompress smaller
selected regions of the data. HDF recognises this advantage by providing data chunking;
ADAPTER effectively provides a multi-resolution data chunking solution. Enabling com-
pression (zlib library with default settings) and rebuilding the index on the example data
leads to 2 GB of storage requirements, a 23 per cent improvement. Of course, enabling
compression will degrade performance slightly.
5. Conclusions
We have demonstrated with ADAPTER how very large (HPC scale) particle data sets can
be stored in such a way that users can focus on the more important aspects directly rel-
evant to the information they are trying to extract. Access and retrieval speed are high
enough such that an interactive analysis of large data sets becomes possible, even on a
desktop computer: access performance was at least ten times better than direct HDF5
access routines and up to five times better than an Oct-tree approach.
A key component of ADAPTER is a new data format based on a ‘multi-resolution
parallel k-d-tree’. The index is compact (storage costs are relatively small) and very
efficient at extracting sub-samples of the data. Performance scales well over multiple
cores via a simple method of replicating the index but randomly distributing the point data.
This allows a low resolution view of the data to be previewed quickly. Subsequently those
regions of particular interest can be extracted and queried at higher and higher resolution;
in this way the scientists can see both the wood and the trees as needed. This is made
possible by re-writing the data without increasing the amount of data stored.
6. Future Work
A key factor in obtaining optimum performance from ADAPTER is to customize the num-
ber of resolution levels; sampling rate at each resolution level; and the maximum number
of points to store in each leaf node file. The optimum settings are highly dependent on
the nature of the data being indexed, e.g. is it highly clustered? In the future it would be
highly desirable to have an automatic optimization tool to calculate the optimum settings
when building the index.
One of our requirements was to be able to trace particles through the timesteps, how-
ever, due to time constraints we were not able to implement any form of temporal index-
ing. In order to achieve this the particle’s ID must also be indexed. However, we can not
perform a multi-resolution indexing scheme as described previously for the IDs. Instead
one large index of all the particle IDs with their associated leaf node positions in the cur-
rent multi-resolution indexes must be built. Since the index is likely to be very large, more
13
conventional out-of-core or database style indexing techniques are required; a B-tree or
hash-table would be an excellent choice.
Further work is also required on optimising the index building method. Currently
the implementation only allows one core to index and distribute the data. However, for
scalability a parallel solution is required.
Acknowledgments
This work was funded by the EPSRC research grant EP F01094X. We would like thank
Lydia Heck for setting up and maintaining the cluster PC used to obtain the results and
John Helly for providing and installing his Oct-tree HDF5 indexing scheme.
References
[1] E. P. Baltsavias. Airborne laser scanning: existing systems and firms and other
resources. ISPRS Journal of Photogrammetry & Remote Sensing, 54:164–198, 1999.
[2] R. Bayer and E. McCreight. Organization and maintenance of large ordered indexes.
pages 245–262, 2002.
[3] J. L. Bentley. Multidimensional binary search trees used for associative searching.
Commun. ACM, 18(9):509–517, 1975.
[4] J. Clyne. The multiresolution toolkit: Progressive access for regular gridded data.
In Proceedings of Visualization, Images, and Image Processing, 2003.
[5] R. A. Crain, T. Theuns, C. Dalla Vecchia, V. R. Eke, C. S. Frenk, A. Jenkins, S. T.
Kay, J. A. Peacock, F. R. Pearce, J. Schaye, V. Springel, P. A. Thomas, S. D. M.
White, and R. P. C. Wiersma. Galaxies-Intergalactic Medium Interaction Calculation
–I. Galaxy formation as a function of large-scale environment. ArXiv e-prints, June
2009.
[6] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best
matches in logarithmic expected time. ACM Trans. Math. Softw., 3(3):209–226,
1977.
[7] L. Gosink, J. Shalf, K. Stockinger, K. Wu, and W. Bethel. Hdf5-fastquery: Accel-
erating complex queries on hdf datasets using fast bitmap indices. In In SSDBM,
pages 149–158, 2006.
[8] V. Havran. Heuristic Ray Shooting Algorithms. Ph.d. thesis, Department of Com-
puter Science and Engineering, Faculty of Electrical Engineering, Czech Techni-
cal University in Prague, November 2000. http://www.cgg.cvut.cz/~havran/
phdthesis.html.
[9] M. Hopf and T. Ertl. Hierarchical splatting of scattered data. In VIS ’03: Proceed-
ings of the 14th IEEE Visualization 2003 (VIS’03), page 57, Washington, DC, USA,
2003. IEEE Computer Society.
[10] J. S. Hughes and Y. P. Li. The planetary data system data model. Mass Storage Sys-
tems, 1993. Putting all that Data to Work. Proceedings., Twelfth IEEE Symposium
on, pages 183–189, Apr 1993.
14
[11] M. Levoy and T. Whitted. The use of points as a display primitive, 1985.
[12] S. K. McMahon. Overview of the planetary data system. Planetary and Space Sci-
ence, 44(1):3 – 12, 1996. http://www.sciencedirect.com/science/article/
B6V6T-3WBXRSJ-G/2/66c70a8e69b7977e0143ea49975dd595. Planetary data
system.
[13] D. S. Moore and G. P. McCabe. Introduction to the Practice of Statistics, chapter
Introduction to Inference, pages 416–429. W. H. Freeman and Company, 2003.
[14] B. Nam and A. Sussman. Improving access to multidimensional self-describing
scientific dataset. In The 3rd IEEE/ACM International Symposium on Cluster Com-
puting and the Grid (CCGrid 2003), 2003.
[15] V. Pascucci and R. J. Frank. Hierarchical and Geometrical Methods in Scien-
tific Visualization, chapter Hierarchical Indexing for Out-of-Core Access to Multi-
Resolution Data, pages 225–241. 2002.
[16] R. Rew, G. Davis, S. Emmerson, H. Davies, and E. Hartne. NetCDF user’s guide:
Data model, programming interfaces, and format for self-describing, portable data,
version 4.0. http://www.unidata.ucar.edu/software/netcdf/docs/, 2008.
[17] S. Rusinkiewicz and M. Levoy. Qsplat: A multiresolution point rendering system
for large meshes, 2000.
[18] H. Samet. Foundations of Multidimensional and Metric Data Structures, chapter
Multidimensional Point Data, pages 129–130. 2006.
[19] V. Springel. The cosmological simulation code gadget-2.
MON.NOT.ROY.ASTRON.SOC., 364:1105, 2005. http://www.citebase.
org/abstract?id=oai:arXiv.org:astro-ph/0505010.
[20] V. Springel, S. D. M. White, A. Jenkins, C. S. Frenk, N. Yoshida, L. Gao, J. Navarro,
R. Thacker, D. Croton, J. Helly, J. A. Peacock, S. Cole, P. Thomas, H. Couchman,
A. Evrard, J. Colberg, and F. Pearce. Simulations of the formation, evolution and
clustering of galaxies and quasars. Nature, 435:629–636, June 2005.
[21] V. Springel, N. Yoshida, and S. D. M. White. Gadget: A code for collisionless
and gasdynamical cosmological simulations. NEW ASTRON., 6:79, 2001. http:
//www.citebase.org/abstract?id=oai:arXiv.org:astro-ph/0003162.
[22] T. Szalay, V. Springel, and G. Lemson. Gpu-based interactive visualization of billion
point cosmological simulations. CoRR, abs/0811.2055, 2008.
[23] The HDF Group. Hierarchical Data Format. http://www.hdfgroup.org/.
[24] E. Wells and R. H. Harten. FITS: A flexible image transport system. Astronomy and
Astrophysics Supplement Series, 44:363–370, 1981.
[25] K. Y. Yeung, K. Y. Yeung, W. L. Ruzzo, and W. L. Ruzzo. An empirical study on
principal component analysis for clustering gene expression data. Bioinformatics,
17:763–774, 2001.
15

ADAPTER

  • 1.
    ECS TECHNICAL REPORT2012/01 http://www.dur.ac.uk/resources/ecs/research/technical reports/2012 01.pdf ADAPTER: a scalable multi-resolution particle data format∗ Djamel Hassaine† Nicolas S. Holliman† Adrian Jenkins‡ Tom Theuns‡ Abstract: We describe and test an adaptive multi-resolution data format (ADAPTER), designed to analyse in real time multi Tera-byte particle data sets on a high-end graphic desk-top computer attached to a file server. The data are distributed over a set of files, each of which containing a fraction of the data at higher sampling rate, using an algorithm based on one or more keys (for example spatial location) defined by the user. The hierarchy of files based on a k-d tree allows for very rapid access to either a large fraction of the data set at low sampling rate, or a small fraction at full resolution, without increasing the total amount of data stored. This enables data exploration as well as more in-depth analysis of a smaller fraction of the full data. ADAPTER consists of a data format and associated programming interface imple- mented in a client-server model designed to be scalable to very large data sets. Keywords: scalable indexing, multi-resolution data format, point data AMS subject classifications: 68P05, 68P10, 68P15, 68P20, 68W10, 68W15 1. Introduction 1.1. Particle data sets and file formats. Particles (or ‘point-cloud data’) are the basic data unit found in a wide range of applications and research fields. For example, particles are used to represent the mass in the Universe in cosmological numerical simulations [5, e.g.] and the topology measured by airborne laser scanning [1, e.g.]. The most general application involves removing rendering redundancies in highly detailed 3D models by using points instead of polygonal primitives such as triangles; this technique is known as Point-based rendering (PBR) [11, 17] and is of great benefit when the projected primitives are smaller than the pixels of the display screen. Particle data sets are rapidly growing in size due to increasingly powerful computers used in performing simulations, and scanning devices. For example, state-of-the-art supercomputer simulations are producing Terabytes and even Petabytes of data. The sheer size of the snapshot files hampers efficient exploita- tion of the full richness represented by the data. This is exacerbated by the fact that the super computers used to generate the simulations are generally far more powerful than computers available for the data analysis process. ∗This work was supported by the EPSRC research grant EP F01094X. †Institute of Advanced Research Computing, School of Engineering and Computing Sciences, Durham University, Science Laboratories, South Road, Durham DH1 3LE ‡Institute of Computational Cosmology, Department of Physics, Durham University, Science Laboratories, South Road, Durham DH1 3LE
  • 2.
    A number ofself-describing data file formats have been developed to store, process and navigate through such large volumes of data efficiently. Popular file formats are: Hierarchical Data Format (latest version is HDF5 [23]), Network Common Data Format (NetCDF) [16], Planetary Data System (PDS) [10, 12], Flexible Image Transport System (FITS) [24], and VAPOR and the associated Multi-resolution Toolkit (MTK) [4]. These file formats contain metadata, which store descriptive details of all the data sets and multi- dimensional arrays within the file, e.g. type of data and dimensions used to store it, file offset, array size, etc. The metadata allows a corresponding runtime library to identify and directly access the different arrays or even sub-arrays within the file which greatly improves I/O performance. An alternative method of improving access and navigation performance is by using adaptive spatial indexing techniques and multi-resolution access. Unfortunately, neither of these functions are widely supported or even available with current file formats. This paper explores the potential benefits of an adaptive multi-resolution data file format called ADAPTER. We illustrate this new format on a particle data set of a large cosmological simulation (GIMIC, [5]). The simulation uses ∼ 200 M particles to follow the formation of galaxies in a cosmological setting. Each particle is endowed with many properties (32 in the case of GIMIC, e.g. density, temperature, star formation rate) in addition to its 3 dimensional (3D) position and velocity. As the simulation code marches the system forward in time it outputs all the particle properties at specified time in a ‘snapshot’ file of approximately 35 Gbytes. ADAPTER reformats this snapshot file as described below. 1.2. Motivation. Extracting even small amounts of data from massive data sets is often computationally expensive, and interacting with the data visually is almost impossible without access to a supercomputer. In addition rendering billions of particles on a typical display with only about one million pixels is obviously redundant [22]. When exploring data with a large dynamic range it is therefore more efficient to first analyse and visualise the data set at lower resolution, identify regions of interest, then ‘zoom’into those regions and exploit the data set at its full resolution. This ability of finding and loading only the necessary data requires an adaptive spatial index with multiple resolutions or scales of the data. 1.2.1. Requirements. We have identified from our own science researchers and litera- ture survey a need for a multi-resolution particle based data format with the following requirements. • R1: Spatial indexing generation using hierarchical techniques to exploit locality in unstructured particle/point data. • R2: Support efficient search for dense regions of the data at multiple resolutions. • R3: Set-theoretic operations (intersection, union, difference) to extract and merge sub-sets of the data, including the ability to extract a sub-set at any resolution up to the highest resolution in the data. • R4: Preservation of key physical properties of the data at multiple scales. • R5: Particle-ID index that uses IDs consistent across timesteps allowing efficient tracking of the path of specific particles through time. • R6: Inherently scalable and parallel in operations so that data can be processed on High-End Computing (HEC) and High Performance Computing (HPC) scale computers. 2
  • 3.
    • R7: Theoriginal data must be stored without any data replication since the storage costs are already prohibitive. 2. Related work Out of the file formats mention in the background section only VAPOR supports multi- resolution access to the data; this is achieved by using a lossless wavelet encoding of the data. However, VAPOR is limited to point data organised in regular grids only. While re- sampling irregular point distributions to a grid is possible it is not always desirable since smoothing artifacts can occur and disk storage can be an issue if an original copy of the data must also be stored. Regardless of the inherent problems current file formats suffer from, they are still widely used (HDF5 probably being the most popular). As a result a number of solutions have been proposed to extend the functionality for the HDF file format without having to modify the existing HDF data sets. For example, Nam and Sussman [14] implemented a generic indexing library based on R*-trees which basically stores the minimum and maximum values for each dimension of the data chunks. The index is built separately from the data set so the internal structures of the file formats are left unmodified. Significant performance gains were shown over using the standard HDF5 library; however, their method does not address the need for multi-resolution access to the data. Gosnik et al. [7] also designed an indexing library, dubbed HDF5-FastQuery, to be used in conjunction with the HDF file format. They use bitmap indexing, which can provide significant performance gains for compound multi-dimensional queries, e.g. find all particles with (temperature > 1000) AND (70 < pressure < 90). A bitmap index stores a separate bitmap for each possible value or range of values (bin) for every attribute or variable. Each bitmap indicates with a zero or one whether each record in the data set has the bitmap value or range of values. The spatial requirement for the index is high and can be as large as n2, where n is the number of records, if every record has a unique value. The use of compression techniques such as that used by FastBit, can bound the index size in the worst case by 4n words. Bitmap indexing is effectively a fixed-grid method and is unlikely to work well for non-uniform distributed data. Pascucci and Frank [15] consider the problem of hierarchical indexing for out-of-core access to multi-resolution data. They use an index scheme based on the Z-order curve, a space filling curve, which has the useful property of following points in the same order at different scales. The use of the Z-order curve is effectively the same as using an Oct-tree but without any pointers. There is no data replication involved in extracting data at the different resolutions which is highly desirable. However, the data at each resolution is sub-sampled in a regular manner which does not reflect random sampling and may not be appropriate for certain scientific applications. Also, unfortunately the implementation is limited to regular grid data only. Spatially indexing data with a space-filling curve method could potentially be ex- tended to non-uniformly distributed data by building a spatial index, e.g. an Oct-tree, on top of the space-filling curve. The spatial index can then be used to group sparsely populated cells into larger cells. One problem with with this approach is that with highly clustered data, some cells from the original space-filling curve may contain more particles than are desirable. In this case, the order of the curve must be increased in the simulation run which increases the key size. It is not hard to imagine in some simulation instances where the data is so densely clustered that increasing the key range is not an appropriate solution. 3
  • 4.
    An area wherea number of groups have designed multi-resolution structures for un- structured point-cloud and particle data sets is in visualization and rendering algorithms. Hopf and Ertl [9] describe a method of re-sampling the data set using principle compo- nents analysis (PCA) and indexing the clusters at different granularities. Data is com- pressed in a lossy manner which would mean many scientists would also want to preserve the original data set, thus dramatically increasing storage costs. Furthermore, Yeung and Ruzzo [25] empirically compared the quality of clusters obtained from the original data set and PCA and showed that this method captured the cluster structure poorly. It was found that although the first few principal components contained most of the variance in the data, they did not necessarily capture most of the cluster structure. 3. Design The overall design of ADAPTER follows a server/client model with the server storing the data in a specialized k-d-tree data-structure and managing the clients’ requests for data. The server/client model was chosen because the sheer scale of the data sets limits the storage of them to powerful and expensive HEC or possibly smaller cluster-based systems. Scientists using their desktop computers can remotely access small quantaties of the data by sending queries to the server. Furthermore, trivial operations, e.g. initial exploratory analysis, can be offloaded to the client; this frees up more CPU cycles and I/O time on the server and data storage system thus allowing more clients to efficiently access the data set. Computationally demanding operations, e.g. performing analysis of the data at full resolution, can be performed on the server system itself before the final results are sent back to the client. 3.1. Assumptions on the data. We have made some general assumptions on the struc- ture of particle data sets. The structure of a single particle from a typical particle data set is illustrated in Figure 3.1: at the very minimum each particle has a unique ID. The full data set is composed of multiple time-steps (‘snapshots’), with each snapshot potentially containing multiple types of particles. Figure 1: Structure of a particle from a typical particle data set. 3.2. Indexing strategy. Most database management systems allow efficient updates to the data as well as provide searching mechanisms for data sets which are too large to reside in main memory, therefore, the underlying data structures are dynamic. The B- tree is a good example, which allows searches, insertions, and deletions in logarithmic amortized time [2]. However, the problem with dynamic spatial indexes is that they suffer from poor storage utilisation because data is inserted into the structure at the leaf nodes, and when a page or bin becomes full partitions must be propagated upwards. Static data 4
  • 5.
    structures, on theother hand, can partition the data in a top-down manner using knowledge of the entire data set, greatly optimising storage utilisation. Typically data produced from a computer simulation, e.g. the Millennium Simulation [20] using GADGET [21, 19], will never be modified after completion; therefore, we will take advantage of the greater storage utilisation static data structures offer. The k-d-tree [3] is a versatile multidimensional data structure that has better overall performance properties over many other types of data structures; for example, Harvan [8] studied a large number of different rayshooting acceleration schemes, including BSP tree and Oct-tree based schemes, and found that on average the k-d-tree based scheme per- formed best. The k-d-tree is essentially a binary tree with every node representing a hyper-plane that divides the underlying space into two subspaces. At each level of the tree only one attribute is chosen as a discriminator (the variable to sort and divide the points by), see Fig. 3.2. The number of intersections is determined by how well the sub- volumes of the tree enclose the objects or group of points; the k-d-tree is generally more efficient at partitioning the data than methods such as the Oct-tree and uniform space subdivision, which is the most likely reason for its superior performance. Figure 2: An example of a static bucket k-d-tree variant built from a list of two dimen- sional points using a bucket size of two. The step first involves sorting the particles in ascending order according to their x coordinate value and forming a hyper plane at the median value to the divide the data into two volumes; this step is repeated for the left and right sub-volumes but the points are sorted along the y-axes. We propose to use a static bucket variant of the k-d-tree [18] for indexing with each leaf node corresponding to a disk bucket or block of capacity b. The elements in a bucket are only split when its cardinality exceeds b. The leaf nodes actually only point to the location of where the data points are stored on the disk, i.e. the buckets are stored as files and the leaf nodes store the file name. In order to explain our indexing strategy in more detail we use the example of indexing spatial coordinates in this section and for the rest of this manuscript (the same indexing scheme can be applied to any of the other attributes in the data set). Indexing the spatial coordinates involves cycling through the x, y, z axes in a predefined and constant order, sorting the points and setting the hyperplane at the median point. Figure 3.2 illustrates an example of building a static bucket k-d-tree, with a bucket size of b = 2, on the following list of 2D coordinates {(35,42) (52,10) (62,77) (82,65) (5,45) (27,35) (85,15) (90,5)}. 5
  • 6.
    3.2.1. Multi-resolution indexing.While the static indexing strategy described above has excellent disk optimization, since every leaf node will be approximately the same size and nearly full, all the data resides at the bottom of the tree. Multi-resolution access could be achieved by randomly shuffling the points in each leaf node before storing them to the disk: if, for example, we wanted a region of interest at 10 per cent resolution, a number of intersecting leaf nodes would have to be identified and then only the first 10 per cent of the leaf nodes’ data would need to be read in and clipped. However, for extremely large data sets the index would be huge and a lot of CPU cycles would be required to traverse down the tree even to extract very small amounts of data; therefore, we do not believe this method would scale very well. We propose a method of dividing the data set up into smaller and more manage- able pieces in such a way that indexing these smaller data sets will allow efficient multi- resolution access to the original data. We take advantage of the fact that the particle nature of the data lends itself to creating low resolution versions, simply by random sampling. This approach will, in an unbiased way, conserve mass, momentum and other particle attributes provided these are appropriately weighted by the selection probability [13]. Random sampling will of course introduce a Poisson sampling error into estimates of any given quantity. Some introduction of error is inevitable in a process which degrades the resolution. One advantage of Poisson sampling is that the induced errors can be estimated from the low resolution data itself often trivially. Figure 3: Our multi-resolution indexing approach involves multiple stages of sub- sampling and indexing. To achieve scalable multi-resolution indexing we repeat a process of random sub- sampling, division and indexing of the data set. The first stage involves generating a coarsely sampled data set from the original data and indexing it with the static bucket k- d-tree. Each leaf node in this index also represents a coarsely sampled region or volume (assuming spatial indexing) of the entire data set. Presumably if the sampling rate is low enough this first level index and data set could be up-loaded very quickly to the client- side desktop and navigated/rendered with relative ease. The remaining non-indexed data is then divided up into the regions represented by the leaf nodes and sub-sampled again. 6
  • 7.
    The sub-sampled datais indexed separately so that a number of different indices will have been built. The leaf nodes from the first index are then linked to the appropriate smaller region indices. These steps are repeated until no further data is left to be indexed, which depends on the sampling rate at each stage. Larger data sets will likely need more levels of sub-sampling and indexing so that there is a sufficient range of multi-resolution access. Figure 3.2.1 illustrates the concept for multi-resolution indexing on the spatial properties for a single time-step snapshot file. A further advantage of this approach of dividing the data up into smaller pieces is that in cases where the index is too large to fit in the main memory, many smaller indices can be more conveniently distributed or accessed out-of-core. 3.3. Minimizing the index storage requirements. One of the objectives of ADAPTER is that it should be scalable up to very large data sets. Therefore, we can not assume the server system will be powerful enough to store the entire data in memory. Many database management systems often assume even the index will be too large to fit into memory and so use out-of-core indexing structures such as the B-tree. However, out-of- core access to the index would severely impede query performance and is probably too severe a constraint for our requirements. We will assume that the point data is accessed out-of-core, but the index can be stored in main memory; in order to achieve this goal the index size will be minimized with the following techniques: • Use of implicit arrays so that node pointers can be eliminated (see section 3.3.3). • Storing the internal node discriminator and leaf node data as a union; a union in C/C++ is a data structure that allows several types of data to be stored in the same location. The discriminator values of the internal nodes are accessed as floats, whereas the leaf node data are accessed as integers. • The leaf node data consists of three pieces of information which are bit packed together into a single integer to further save space (see section 3.3.1). 3.3.1. Bit packing the leaf node data. The leaf node requires three pieces of informa- tion: leaf node ID (which is also used to derive the file name where the data points in that region are stored), a boolean flag determining whether the data is stored locally or externally (more on this in section 3.4), and a boolean flag to determine whether the leaf node points to another index, i.e. the next level of detail. The boolean flags can be stored using just one bit each leaving either 30 or 62 bits to store the leaf node ID, depending on the system architecture (32 bit or 64 bit machine); this allows approximately 1 × 109 or 4.6 × 1018 unique IDs and leaf nodes. The entire data set can divided up into lots of smaller data sets and then indexed separately, i.e. two different indices can use the same IDs for their leaf nodes, therefore, the number of available unique IDs should be more than ample for any sized data set. The following snippet of C code shows the function responsible for extracting the information from the packed leaf node data (assuming a 32 bit architecture): void i m p l i c i t k d t r e e u n p a c k a d d r e s s ( unsigned i n t packed address , unsigned i n t ∗ l e a f n o d e i d , unsigned i n t ∗ l o c al , unsigned i n t ∗ n e x t i n d e x ) { unsigned i n t n e x t i n d e x f l a g = packed address ; n e x t i n d e x f l a g &= 0x00000001 ; unsigned i n t l o c a l f l a g = packed address ; 7
  • 8.
    l o ca l f l a g &= 0x00000002 ; l o c a l f l a g = l o c a l f l a g >> 1; unsigned i n t l e a f = packed address ; l e a f &= 0 x f f f f 0 0 0 0 ; l e a f = l e a f >> 16; ∗ l e a f n o d e i d = l e a f ; ∗ n e x t i n d e x = n e x t i n d e x f l a g ; ∗ l o c a l = l o c a l f l a g ; } 3.3.2. Deriving the point data file name. Each leaf node, defining a region of the data set, points to a file containing the corresponding particle data. The file names are derived from a simple naming scheme: the file names are composed of the index ID appended with the leaf node ID. Each index has a unique ID which also follows a simple naming scheme: the index ID for the first level of detail index is defined as ”0”, and the IDs for the subsequent indices are derived by appending the IDs of parent index and leaf node (each leaf node points to the next level of detail index). Figure 3.3.2 illustrates an example of the naming process. Figure 4: Naming convention scheme. Unfortunately most modern file systems, e.g. NTFS, ext3 and HFS Plus, limit the length of file names to a maximum of 255 characters. Technically this limits the scalability of our solution, however, this still leaves us with potentially upto 1×10253 −1 unique leaf node file names for just the first level of detail index, and in total many more leaf nodes and levels of detail; therefore, in practice this will unlikely be problematic. Furthermore, if needed, the range of possible leaf nodes and indices could be extended significantly by including all the available characters (which is greater than 65,000). 3.3.3. Process of storing the index in an implicit array. Implicit arrays eliminate the need for using two pointers per node in the tree (pointers from parent to the left and right child nodes); furthermore, the index tree can be saved and loaded from disk far more 8
  • 9.
    efficiently as thereis no need to rebuild the entire path of pointers. Building an implicit index is simple: if a node is stored at index k in the array, then its left child is stored at 2k+1 and its right child is stored at 2k+2 (see figure 3.3.3). A caveat to storing a binary tree or similar type index in an implicit array is that the tree must be complete, i.e. every level of the tree, except possibly the deepest level, must be filled completely and on the deepest level all the nodes must be as far left as possible. This presents a problem as our partitioning strategy at the median position can result in a tree with a maximum height difference between any two leaf nodes of up to one and therefore may not be complete (see [6] for more details). Figure 5: Storing the index tree using an implicit array requires a balanced tree. In an implicit array, if a node is stored at index k, then its left child will be stored at 2k+1 and its right child will be stored at 2k+2. To satisfy the constraint that the tree be complete, we can quite easily convert the in- dex tree into a perfect binary tree, i.e. every node in the tree except the leaf nodes must have two children and all the leaf nodes must be at the same depth or level, which is also complete. Our solution involves duplicating certain nodes and adding discriminator val- ues that do not affect the bounding boxes of the original leaf nodes and thus any searching algorithm or query will give the same results regardless of whether it is performed on the original or converted tree. The algorithm works as follows: when a leaf node is found with a depth difference between the bottom most level of the tree, it is converted into an internal node and its original leaf node ID is assigned to two identical child nodes; the discriminator value of this newly converted internal node is calculated by determining the upper right corner of its bounding box and taking the coordinate value representing the current discriminator axes. Figure 3.3.3 illustrates the concept of this conversion algo- rithm more clearly. What is the storage cost advantage of using an implicit array over pointers? In the worst case scenario all the leaf nodes are on the same level except for two which are one level deeper. Converting this tree into a perfect tree requires the addition of 2d − 2 nodes, where d is the maximum depth of the tree; ultimately an array of 32 or 64 bit integers/floats with a capacity of 2d+1 − 1 will be required to store the converted tree. However, the original tree stored using pointers requires 2d +1 nodes, a boolean flag for each node to determine whether it is an internal or leaf node, and 2d pointers resulting in a total of 2d+1 +2d +2 variables (32 or 64 bit). Therefore, in the worst case scenario, the implicit array will still save on storage cost. Furthermore, we will also benefit from the more efficient access routines to the implicit array as opposed to traversing the tree through its pointers. 9
  • 10.
    Figure 6: Completingthe tree. 3.4. Parallel search process. In order to achieve scalability, we have designed and im- plemented ADAPTER to work on a distributed computer system, e.g. a cluster of PCs. Distributed computing is generally considered the most scalable and cost effective solu- tion as processing nodes can simply be added to the network as required. The message passing model is generally used for programming distributed systems, and the Message- Passing Interface (MPI) is the most widely used library API for message passing and related operations. The message-passing model assumes each processor has its own in- dependent memory, but can communicate with other processors by sending and receiving messages; data transfer from the local memory of one processor to another processor’s memory requires explicit communication operations to be performed by both processors. The index was designed to be as small as possible in order to fit into the main memory of each individual PC. Therefore, our scalable solution involves storing an entire copy of the index on every processor/node but randomly distribute the actual point data associated with the leaf nodes amongst the processors; the indices on each processor must also be updated to determine whether the leaf nodes’ point data is stored locally or on another processor’s disk. During a query the indices are traversed by each CPU in parallel without requiring any communication, and the extracted data from each processor is accumulated or collected by the master processor. With this method the processors can perform the query almost independently, the only communication required between the processors is at the end when the data must be collected; there is no need for complicated load balancing algorithms and there are no parent-child dependencies or other common issues with parallel programming. 4. Performance evaluation For purposes of a simple evaluation we are interested in the following three types of queries which are quite common in data analysis: • Extract a low resolution sub-sample of the entire volume of data; this is very useful 10
  • 11.
    to quickly visualizethe data. • Extract a large region of interest from the volume at varying resolutions. • Extract a small region from the volume at varying resolutions. Currently the most common approach to carrying out these queries is a simple brute force approach, i.e. using standard HDF access routines to read in all the data and prune or clip the unwanted points. An alternative in-house built solution at the Institute for Compu- tational Cosmology (ICC) at Durham University reorganises the points in HDF data files using an Oct-tree data-structure. The implementation currently only allows serial execu- tion but should still provide a more challenging competing algorithm than the brute force approach. We will record and compare the time taken to complete the queries described above with ADAPTER and the two alternative methods mentioned, on an example data set from a snapshot of GIMIC, which consists of 180 million particles distributed across 512 HDF5 file parts. These original individual files each contain all particles in several cubic blocks, which are ordered along a space-filling curve that traverses the entire computa- tional volume [21, 19]. This spatial division is useful during the actual simulation itself, as it preserves data locality and minimises inter-processor communication. Individual files then simply contain all particles acted upon by a given processor core during the simula- tion. In simulations with a large dynamic range, this division results in a potentially large load imbalance: to avoid this each individual processors gets more than one contiguous section of the space-filling curve. The net result is that each individual file that makes-up the snapshot contains particles that are distributed across the computational volume. We applied the ADAPTER algorithm with three levels of detail/resolution and the buck- et/bin size of the leaf nodes set to a maximum of b = 105 particles. The sampling rate was calculated in such a way as to approximately increase the number of sub-sampled particles in each consecutive resolution step exponentially. The resulting index was com- posed of 131 sub-sampled indices, taking up 24.2 kB of disk space, and 2178 leaf node files containing the actual point data, taking up 2.6 GB of disk space. This compares favourably with the Oct-tree approach which required 2 MB of storage for just the index; although both indices are small in absolute terms, indexing extremely large data sets will require a compact indexing data structure otherwise the index may become too large to fit in main memory. Table 1 summarises the recorded wall time for the different queries and data extrac- tion methods. We only show the timings of the brute force method, using HDF5 access routines, for one query (extracting 9M points from a large region of interest) because re- gardless of the type of query all the data has to be read in and clipped; therefore, all the timings will be similar or even worse than those shown. Both the Oct-tree approach and ADAPTER are vastly superior to the brute force approach, at least 10 times faster with a single core. The Oct-tree approach performs slightly better than ADAPTER when it comes to ex- tracting large regions at full resolution. This is because the HDF5 access routines are very efficient at extracting large amounts of contiguous data, i.e. from 512 files whereas the ADAPTER index in this case may have to access up to 2178 files which would require more random disk seeking time. However, the results show that ADAPTER has a modest advantage when retrieving smaller regions of the volume at full resolution (about 25 per cent performance improvement). This performance increase is because less I/O and CPU time involved with pruning the points is required as the data is distributed amongst more files and the index allows the desired points to be targeted more accurately. 11
  • 12.
    Time (sec) Typeof query Brute force (HDF5 routines) 1 core 5 cores 18.15 5.17 large region at full resolution (9M points) ADAPTER 1 core 5 cores 1.86 0.48 large region at full resolution (9M points) 0.16 0.05 small region at full resolution (500K points) 0.12 0.03 large region at medium resolution (430K points) 0.01 0.01 entire region at low resolution (80K points) 0.01 0.01 small region at medium resolution (23K points) 0.01 0.00 large region at low resolution (4K points) Oct-tree 1 core 1.50 large region at full resolution (9M points) 0.20 small region at full resolution (500K points) 0.23 large region at medium resolution (430K points) 0.12 entire region at low resolution (80K points) 0.05 small region at medium resolution (23K points) 0.05 large region at low resolution (4K points) Table 1: Comparison of performance results for ADAPTER, an Oct-tree approach and a brute-force method using standard HDF5 access routines for various types of queries. 12
  • 13.
    The results showthat the main advantage ADAPTER offers is for queries extracting sub-samples of the data: Performance increases of up to 500% over the Oct-tree method were observed; this proves that the concept of dividing the data and indexing structure works well. The results also suggest that ADAPTER is scalable; performance scales almost linearly as the number of cores used increased from one to five; the only situation where performance did not increase with additional cores was the case for extracting a small sub-sampled region because the data was stored in very few leaf nodes, therefore, most of the cores would have been idle. Our solution of randomly distributing the point data amongst the cores but replicating the index is simple but effective. An advantage of storing the points in many small files not immediately obvious from the results is the ability to compress the entire data once and then only uncompress smaller selected regions of the data. HDF recognises this advantage by providing data chunking; ADAPTER effectively provides a multi-resolution data chunking solution. Enabling com- pression (zlib library with default settings) and rebuilding the index on the example data leads to 2 GB of storage requirements, a 23 per cent improvement. Of course, enabling compression will degrade performance slightly. 5. Conclusions We have demonstrated with ADAPTER how very large (HPC scale) particle data sets can be stored in such a way that users can focus on the more important aspects directly rel- evant to the information they are trying to extract. Access and retrieval speed are high enough such that an interactive analysis of large data sets becomes possible, even on a desktop computer: access performance was at least ten times better than direct HDF5 access routines and up to five times better than an Oct-tree approach. A key component of ADAPTER is a new data format based on a ‘multi-resolution parallel k-d-tree’. The index is compact (storage costs are relatively small) and very efficient at extracting sub-samples of the data. Performance scales well over multiple cores via a simple method of replicating the index but randomly distributing the point data. This allows a low resolution view of the data to be previewed quickly. Subsequently those regions of particular interest can be extracted and queried at higher and higher resolution; in this way the scientists can see both the wood and the trees as needed. This is made possible by re-writing the data without increasing the amount of data stored. 6. Future Work A key factor in obtaining optimum performance from ADAPTER is to customize the num- ber of resolution levels; sampling rate at each resolution level; and the maximum number of points to store in each leaf node file. The optimum settings are highly dependent on the nature of the data being indexed, e.g. is it highly clustered? In the future it would be highly desirable to have an automatic optimization tool to calculate the optimum settings when building the index. One of our requirements was to be able to trace particles through the timesteps, how- ever, due to time constraints we were not able to implement any form of temporal index- ing. In order to achieve this the particle’s ID must also be indexed. However, we can not perform a multi-resolution indexing scheme as described previously for the IDs. Instead one large index of all the particle IDs with their associated leaf node positions in the cur- rent multi-resolution indexes must be built. Since the index is likely to be very large, more 13
  • 14.
    conventional out-of-core ordatabase style indexing techniques are required; a B-tree or hash-table would be an excellent choice. Further work is also required on optimising the index building method. Currently the implementation only allows one core to index and distribute the data. However, for scalability a parallel solution is required. Acknowledgments This work was funded by the EPSRC research grant EP F01094X. We would like thank Lydia Heck for setting up and maintaining the cluster PC used to obtain the results and John Helly for providing and installing his Oct-tree HDF5 indexing scheme. References [1] E. P. Baltsavias. Airborne laser scanning: existing systems and firms and other resources. ISPRS Journal of Photogrammetry & Remote Sensing, 54:164–198, 1999. [2] R. Bayer and E. McCreight. Organization and maintenance of large ordered indexes. pages 245–262, 2002. [3] J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517, 1975. [4] J. Clyne. The multiresolution toolkit: Progressive access for regular gridded data. In Proceedings of Visualization, Images, and Image Processing, 2003. [5] R. A. Crain, T. Theuns, C. Dalla Vecchia, V. R. Eke, C. S. Frenk, A. Jenkins, S. T. Kay, J. A. Peacock, F. R. Pearce, J. Schaye, V. Springel, P. A. Thomas, S. D. M. White, and R. P. C. Wiersma. Galaxies-Intergalactic Medium Interaction Calculation –I. Galaxy formation as a function of large-scale environment. ArXiv e-prints, June 2009. [6] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw., 3(3):209–226, 1977. [7] L. Gosink, J. Shalf, K. Stockinger, K. Wu, and W. Bethel. Hdf5-fastquery: Accel- erating complex queries on hdf datasets using fast bitmap indices. In In SSDBM, pages 149–158, 2006. [8] V. Havran. Heuristic Ray Shooting Algorithms. Ph.d. thesis, Department of Com- puter Science and Engineering, Faculty of Electrical Engineering, Czech Techni- cal University in Prague, November 2000. http://www.cgg.cvut.cz/~havran/ phdthesis.html. [9] M. Hopf and T. Ertl. Hierarchical splatting of scattered data. In VIS ’03: Proceed- ings of the 14th IEEE Visualization 2003 (VIS’03), page 57, Washington, DC, USA, 2003. IEEE Computer Society. [10] J. S. Hughes and Y. P. Li. The planetary data system data model. Mass Storage Sys- tems, 1993. Putting all that Data to Work. Proceedings., Twelfth IEEE Symposium on, pages 183–189, Apr 1993. 14
  • 15.
    [11] M. Levoyand T. Whitted. The use of points as a display primitive, 1985. [12] S. K. McMahon. Overview of the planetary data system. Planetary and Space Sci- ence, 44(1):3 – 12, 1996. http://www.sciencedirect.com/science/article/ B6V6T-3WBXRSJ-G/2/66c70a8e69b7977e0143ea49975dd595. Planetary data system. [13] D. S. Moore and G. P. McCabe. Introduction to the Practice of Statistics, chapter Introduction to Inference, pages 416–429. W. H. Freeman and Company, 2003. [14] B. Nam and A. Sussman. Improving access to multidimensional self-describing scientific dataset. In The 3rd IEEE/ACM International Symposium on Cluster Com- puting and the Grid (CCGrid 2003), 2003. [15] V. Pascucci and R. J. Frank. Hierarchical and Geometrical Methods in Scien- tific Visualization, chapter Hierarchical Indexing for Out-of-Core Access to Multi- Resolution Data, pages 225–241. 2002. [16] R. Rew, G. Davis, S. Emmerson, H. Davies, and E. Hartne. NetCDF user’s guide: Data model, programming interfaces, and format for self-describing, portable data, version 4.0. http://www.unidata.ucar.edu/software/netcdf/docs/, 2008. [17] S. Rusinkiewicz and M. Levoy. Qsplat: A multiresolution point rendering system for large meshes, 2000. [18] H. Samet. Foundations of Multidimensional and Metric Data Structures, chapter Multidimensional Point Data, pages 129–130. 2006. [19] V. Springel. The cosmological simulation code gadget-2. MON.NOT.ROY.ASTRON.SOC., 364:1105, 2005. http://www.citebase. org/abstract?id=oai:arXiv.org:astro-ph/0505010. [20] V. Springel, S. D. M. White, A. Jenkins, C. S. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, J. A. Peacock, S. Cole, P. Thomas, H. Couchman, A. Evrard, J. Colberg, and F. Pearce. Simulations of the formation, evolution and clustering of galaxies and quasars. Nature, 435:629–636, June 2005. [21] V. Springel, N. Yoshida, and S. D. M. White. Gadget: A code for collisionless and gasdynamical cosmological simulations. NEW ASTRON., 6:79, 2001. http: //www.citebase.org/abstract?id=oai:arXiv.org:astro-ph/0003162. [22] T. Szalay, V. Springel, and G. Lemson. Gpu-based interactive visualization of billion point cosmological simulations. CoRR, abs/0811.2055, 2008. [23] The HDF Group. Hierarchical Data Format. http://www.hdfgroup.org/. [24] E. Wells and R. H. Harten. FITS: A flexible image transport system. Astronomy and Astrophysics Supplement Series, 44:363–370, 1981. [25] K. Y. Yeung, K. Y. Yeung, W. L. Ruzzo, and W. L. Ruzzo. An empirical study on principal component analysis for clustering gene expression data. Bioinformatics, 17:763–774, 2001. 15