Css trees


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Css trees

  1. 1. Cache Conscious Indexing for Decision-Support in Main Memory Jun Rao Kenneth A. Ross∗ Columbia University Columbia University junr@cs.columbia.edu kar@cs.columbia.edu a small space overhead, we can reduce the cost of binary search on the array by more Abstract than a factor of two. We also show that our technique dominates B+-trees, T-trees, and binary search trees in terms of both space We study indexing techniques for main and time. A cache simulation verifies that memory, including hash indexes, binary the gap is due largely to cache misses. search trees, T-trees, B+-trees, interpola- tion search, and binary search on arrays. In a decision-support context, our primary 1 Introduction concerns are the lookup time, and the space As random access memory gets cheaper, it becomes occupied by the index structure. increasingly affordable to build computers with large Our goal is to provide faster lookup times main memories. The recent “Asilomar Report” than binary search by paying attention to ([BBC+ 98]) predicts “Within ten years, it will be reference locality and cache behavior, with- common to have a terabyte of main memory serving out using substantial extra space. We as a buffer pool for a hundred-terabyte database. propose a new indexing technique called All but the largest database tables will be resident “Cache-Sensitive Search Trees” (CSS-trees). in main memory.” But main memory data process- Our technique stores a directory structure ing is not as simple as increasing the buffer pool on top of a sorted array. Nodes in this size. An important issue is cache behavior. The directory have size matching the cache-line traditional assumption that memory references have size of the machine. We store the directory uniform cost is no longer valid given the current in an array and do not store internal-node speed gap between cache access and main memory pointers; child nodes can be found by per- access. So, improving cache behavior is going to be forming arithmetic on array offsets. an imperative task in main memory data processing. In this paper, we focus on how to make indexes cache We compare the algorithms based on their conscious. time and space requirements. We have im- 10000 CPU Performance (60%/yr) plemented all of the techniques, and present DRAM Performance (10%/yr) performance improvement a performance study on two popular mod- 1000 ern machines. We demonstrate that with 100 ∗ This research was supported by a David and LucilePackard Foundation Fellowship in Science and Engineering, 10by an NSF Young Investigator Award, by NSF grant numberIIS-98-12014, and by NSF CISE award CDA-9625374. 1 1980 1985 1990 1995 2000Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed Figure 1: Processor-memory performance imbalancefor direct commercial advantage, the VLDB copyright notice Index structures are important even in mainand the title of the publication and its date appear, and noticeis given that copying is by permission of the Very Large memory database systems. Although there are noData Base Endowment. To copy otherwise, or to republish, disk accesses, indexes can be used to reduce overallrequires a fee and/or special permission from the Endowment. computation time without using too much extraProceedings of the 25th VLDB Conference, space. With a large amount of RAM, most ofEdinburgh, Scotland, 1999. the indexes can be memory resident. Past work Page 1
  2. 2. 0.45 0.2 array binary search enhanced B+ tree T tree search full CSS-tree 0.4 enhanced B+ tree 0.19 level CSS-tree full CSS-tree 0.35 level CSS-tree 0.18 hash 0.3 0.17 time (s) time (s) 0.25 0.16 0.2 0.15 0.15 0.14 0.1 0.13 0.05 0.12 0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 space (bytes) space (bytes) (a) Whole Picture (b) Close-up Figure 2: Space/time Tradeoffson measuring the performance of indexing in main- structures are space and time. Space is critical inmemory databases includes [LC86b, WK90], with a main memory database; we may have a limited[LC86b] being the most comprehensive on the spe- amount of space available for precomputed struc-cific issue of indexing. In the thirteen years since tures such as indexes. Given space constraints, we[LC86b] was published, there have been substantial try to optimize the time taken by index lookups. Inchanges in the architecture of computer chips. The a main-memory database there are several factorsmost relevant change is that CPU speeds have been influencing the speed of database operations. Anincreasing at a much faster rate (60% per year) than important factor is the degree of locality in datamemory speeds (10% per year) as shown in Figure 1 references for a given algorithm. Good data locality(borrowed from [CLH98]). Thus, the relative cost of leads to fewer (expensive) cache misses, and bettera cache miss has increased by two orders of magni- performance.tude since 1986. As a result, we cannot assume that We study a variety of existing techniques, includ-the ranking of indexing algorithms given in [LC86b] ing hash indexes, binary search on a sorted list ofwould be valid on today’s architectures. In fact, our record identifiers, binary trees, B+-trees [Com79],experimental results indicate very different relative T-trees [LC86a], and interpolation search. We alsooutcomes from [LC86b] for lookup speed. introduce a new technique called “Cache-Sensitive Another recent development has been the explo- Search Trees” (CSS-trees). CSS-trees augment bi-sion of interest in On-Line Analytical Processing nary search by storing a directory structure on top(OLAP). [Fre95, Fre97] contrasts the requirements of the sorted list of elements. CSS-trees differ fromof OLAP with OLTP systems and contends that the B+-trees by avoiding storing the child pointers inreal performance gains can be obtained by separat- each node. The CSS-tree is organized in such a waying the two systems. A dedicated OLAP system that traversing the tree yields good data referencecan have a much better query performance if we locality (unlike binary search), and hence relativelyare willing to sacrifice update performance. Com- few cache misses. CSS-trees also take advantage ofmercial systems such as Sybase IQ [Syb97] were the OLAP context to optimize search performance,designed for such purposes. Thus, typical OLAP at the expense of update performance.workloads are query-intensive, and have infrequent We summarize the space/time tradeoffs of vari-batch updates. For example, a data warehouse for ous methods in Figure 2. Each point for T-trees,a university, containing all the student records, is enhanced B+-trees and CSS-trees corresponds toprobably updated once per day. Census data sets a specific node size (multiple of cache line size).are collected periodically and will remain static for Normally a larger node size means less space buta relatively long period of time. These systems are more search time. The stepped line basically tellsin the scale of several gigabytes and can already us what’s the optimal searching time for a givenfit in RAM today. Given the current trend, we amount of space. Our conclusion is that CSS-treesexpect more and more disk-based applications will dominate T-trees and enhanced B+-trees in bothbe moved to main memory in the future. Since space and time. There are tradeoffs between spaceupdates are batched in those systems, incremental and time for binary search, CSS-trees and hash in-updates of indexes may not be very important. In dices. CSS-trees reasonably balance space and time.fact, we may want to rebuild indexes from scratch We discuss this issue in more detail in Section 7.after a batch of updates, if that leads to faster indexsearches. In this paper, we focus on such an OLAPenvironment. 2 Main Memory Databases Two important criteria for the selection of index Main Memory Database Systems: Some past Page 2
  3. 3. work on main memory databases has addressed used.the problems of concurrency, transaction process- Assumptions: We assume an OLAP environ-ing and logging [GMS86, LN88, JLRS94], and re- ment, so we don’t care too much about updates.covery [Hag86, LC87]. Systems with a significant Our main concerns are the lookup time for an index,query-processing component include OBE [WK90], and the space required to store the index. ThereMM-DBMS [LC87], and Starburst [LSC92]. More are many applications in this class, as described inrecently, the TimesTen corporation (formerly the Section 1.Smallbase project at Hewlett-Packard) has devel-oped a commercial main-memory system, withclaims of tenfold speedups over disk-based sys- 3 Cache Optimization on Indexestems [Sof97]. Most of the systems do not addressthe issue of cache reference locality. In this section, we first describe cache memories and the impact of cache optimization. We then give a Data Layout: [AHK85] and others describe the survey of the related work. Finally, we analyze theconcept of a domain. When data is first loaded into cache behavior of various existing index structuresmain memory, distinct data values are stored in an for searching and point out their shortcomings.external structure, the domain, and only pointers todomain values are stored in place in each column. Cache memories are small, fast static RAM mem-This has the benefits of: (a) saving space in the ories that improve program performance by holdingpresence of duplicates, (b) simplified handling of recently referenced data [Smi82]. Memory refer-variable-length fields and (c) pointer comparisons ences satisfied by the cache, called hits, proceedcan be used for equality tests. In the main memory at processor speed; those unsatisfied, called misses,database project at Columbia University, we focus incur a cache miss penalty and have to fetch theon an OLAP main memory database system. We go corresponding cache block from the main memory.further than [AHK85] by keeping the domain values A cache can be parameterized by capacity, blockin order and associate each value with a domain size and associativity, where capacity is the size ofID (represented by an integer). As a result, we the cache, block size is the basic transferring unitcan process both equality and inequality tests on between cache and main memory, and associativitydomain IDs directly, rather than on the original determines how many slots in the cache are potentialvalues. Although keeping values in order has extra destinations for a given address reference.cost, we expect the data is updated infrequently. Cache optimization in a main memory databaseIndependently, Tandem Inc.’s InfoCharger storage system is similar to main memory optimization inengine [Eng98] has also chosen to keep domain values a disk-based system. But the management of thein sorted order. The use of domains means that cache is done by the hardware and the databasemany indexes can be built with smaller keys. system doesn’t have direct control of which block to Indexing in Main Memory Databases: Al- bring into a cache. This makes cache optimizationthough sequential data access is much cheaper in more subtle.main memory than in disk-based systems, indexing Typical cache optimization techniques includeremains very important in main memory databases. clustering, compression and coloring [CLH98]. Clus-First of all, searching an index is still useful for tering tries to pack, in a cache block, data structureanswering single value selection queries and range elements that are likely to be accessed successively.queries. Next, cheaper random access makes indexed Compression tries to remove irrelevant data and thusnested loop joins more affordable in main memory increases cache block utilization by being able todatabases. Indexed nested loop join is pipelinable, put more useful elements in a cache block. Thisrequiring minimal storage for intermediate results includes key compression, structure encodings suchand is relatively easy to implement. As a matter of as pointer elimination and fluff extraction. Cachesfact, indexed nested loop join is the only join method have finite associativity, which means that only aused in [WK90]. This approach requires a lot of limited number of concurrently accessed data el-searching through indexes on the inner relations. ements can map to the same cache line withoutLast but not least, transforming domain values to causing conflict. Coloring maps contemporaneously-domain IDs (as described in the previous section) accessed elements to non-conflicting regions of therequires searching on the domain. cache. A list of record identifiers sorted by some columns Previous research has attacked the processor-provides ordered access to the base relation. Or- memory gap using the above techniques. Wolf anddered access is useful for range queries and for satis- Lam [WL91] exploited cache reference locality tofying interesting orders [SAC+ 79]. A sorted array is improve the performance of matrix multiplication.an index structure itself since binary search can be LaMarca and Ladner [LL96, LL97] considered the Page 3
  4. 4. effects of caches on sorting algorithms and improved is low. Since the number of key comparisons is stillperformance by restructuring these algorithms to the same, T-Trees do not provide any better cacheexploit caches. In addition, they constructed a behavior than binary search.cache-conscious heap structure that clustered and Another problem with the T-Tree is that it hasaligned heap elements to cache blocks. [CLH98] to store a record pointer for each key within a node.demonstrated that cache optimization techniques Since most of the time the record pointers won’t becan improve the spatial and temporal locality of needed, essentially half of the space in each node ispointer-based data structures. They showed im- wasted. Potentially, one could put just RIDs in theprovement on various benchmarks. T-tree with no keys, but then search is much slower In [NBC+ 94], Nyberg et al. have shown that due to indirection.for achieving high performance sorting, one should B+-trees: Although B+-trees were designed forfocus carefully on cache memory behavior. disk-based database systems, they actually have a Cache conscious algorithms have been considered much better cache behavior than T-trees. In eachin database systems also. In [SKN94], the authors internal node we store keys and child pointers, butsuggested several ways to improve the cache refer- the record pointers are stored on leaf nodes only.ence locality of query processing operations such as Multiple keys are used to search within a node. Ifjoins and aggregations. They showed that the new we fit each node in a cache line, this means that aalgorithms can run 8% to 200% faster. cache load can satisfy more than one comparison. Although cache optimization has been considered So each cache line has a better utilization ratio.on tree-based structures, nobody has looked at the Enhanced B+-trees: In an OLAP environ-influence of the cache on index structures used in ment, we can use all the slots in a B+-tree nodedatabase systems. Some other papers have consid- (similar to compact B-Trees [CS83] and the ISAMered the issue of compact representations of B-tree method used in the IBM OS/360 operating sys-indexes [CS83, JTR87]. These papers appeared too tem [GR93]) and rebuild the tree when batch up-early to consider cache issues since memory speed dates arrive. We can design the node size to bewasn’t too slow compared with CPU speed. In the exactly the same as a cache line and align each node.rest of this section, we study the cache behavior We can also hard-code the node search since we knowof various typical index structures used in main all the slots are used. But B+-trees enhanced inmemory database systems. these ways still need to store child pointers within Array Binary Search: The problem with bi- each node (even when compact). So for any givennary search is that many accesses to elements of the node size, only half of the space can be used to storesorted array result in a cache miss. We do not get keys.misses for the first references because of temporal In the rest of this paper we shall consider onlylocality over many searches. We avoid misses for “enhanced B+-trees” as described above; however,the last references, due to spatial locality, if many we may sometimes use “B+-trees” for brevity.records from the array fit inside a single cache line. Hash: Hash indices can also benefit from cacheHowever, when the array is substantially bigger than optimization. The most common hashing methodthe cache, many of the intervening accesses cause is chained bucket hashing [Knu73]. In [GBC98], thecache misses. In the worst case, the number of authors use the cache line size as the bucket size andcache misses is of the order of the number of key squeeze in as many < key, RID > pairs as possible.comparisons. This can reduce the number of cache misses when T-Trees: T-Trees have been proposed as a better scanning through the buckets. Hash indices are fastindex structure in main memory database systems. for searching only if the length of each bucket chainA T-Tree [LC86a] is a balanced binary tree with is small. This requires a fairly large directory sizemany elements in a node. Elements in a node con- and thus a fairly large amount of space. Skewedtain adjacent key values and are stored in order. Its data can seriously affect the performance of hashaim is to balance the space overhead with searching indices unless we have a relatively sophisticated hashtime and cache behavior is not considered (thirteen function, which will increase the computation time.years ago the gap between processor and main mem- Hash indices do not preserve order. To provideory speeds was not that large). T-Trees put more ordered access, an ordered list has to be kept inkeys in each node and give the impression of being addition to the hash indices.cache conscious. But if we think carefully, we canobserve that for most of the T-Tree nodes, only 4 Cache Sensitive Search Treesthe two end keys are actually used for comparison(in the improved version [LC86b], only one key is In this section, we present our cache consciousused). This means that the utilization of each node searching methods called CSS-trees. Section 4.1 Page 4
  5. 5. internal nodes leaf nodes 0 0 1 15 16 30 31 80 1 2 3 4 5 directory in an array I II 6-10 11-15 16-20 21-25 26-30 31-55 56-80 sorted array a 0 49 50 64 II I Full CSS-Tree (m=4) 65*4 elements in the array Figure 3: Layout of a full CSS-treeintroduces the concept of “full” CSS-trees. We talk one cache miss. As a result we get at most logm+1 nabout “level” CSS-trees in Section 4.2. cache misses for a lookup, unlike binary search which Suppose that we have a sorted array a[1..n] of gets up to log2 n cache misses. (Even if a noden elements. The array a could contain the record- occupies two cache lines, half the time only one cacheidentifiers of records in some database table in the miss will be generated, while half the time there willorder of some attribute k. a could alternatively be two cache misses.) Second, we can hard-code thecontain column-k keys from the records in the table, traversal within a node, so that calculations neededwith a companion array holding the corresponding to find the next node happen logm+1 n times ratherrecord-identifiers, using some extra space to avoid than log2 n times. (Note that the total number ofan indirect reference during the search. a could comparisons is the same.)alternatively contain records of a table or domain Suppose that we number the nodes and keysvalues. starting at 0. If we have an internal node numbered Binary search of a has a serious cache usage b, then it is not difficult to show that the childrenproblem as described in Section 3. A second problem of that node are numbered from b(m + 1) + 1 towith binary search is that it requires a calculation b(m + 1) + (m + 1). Within the directory, we haveto be performed log2 n times to find the next el- m keys per node, so key number i in the directory iement to search. Even if this calculation uses a array maps to node number m . As a result, oneshift rather than a division by two [WK90], the can find the offset of the first key of the child nodescalculation represents a significant portion of the within the directory array as ( m ∗ (m + 1) + 1) ∗ m iexecution time needed. Nevertheless, binary search through ( m ∗ (m + 1) + m + 1) ∗ m. ihas the benefit that no additional space beyond a isneeded to perform a search. Our goal is to improve A subtle point in the structure of a CSS-tree isupon the search time of binary search without using that we store the leaf nodes in a contiguous arraya significant amount of additional space. in key order. This conflicts with the natural order of the CSS-tree that stores the nodes from left to right, level by level. The CSS-tree order would split4.1 Full CSS-Trees the array, putting the right half of the array (whichWe create a search tree with nodes containing ex- appears at a higher level than the left half of theactly m keys. (We’ll see how to choose m later.) If array) ahead of the left half of the array. In Figure 3,the depth of the tree is d, then the tree is a complete the natural tree order is to store nodes 16-30 before(m + 1)-ary search tree up to depth d − 1, and at nodes 31-80. However, when the leaves are stored indepth d the leaves are filled from left to right. An a sorted array, nodes 31-80 come before nodes 16-30.example tree is shown for m = 4 in the left diagram Since maintaining a contiguous array in key orderof Figure 3 (the numbers in the boxes are node is desirable for other purposes, and since the arraynumbers and each node has four keys). The nodes is given to us without assumptions that it can beof a CSS-tree can be stored in an array as shown on restructured, we leave the array in key order. To getthe right in Figure 3. Note that we do not need to to the correct leaves when searching the CSS-tree,store explicit pointers to child nodes; the children of we modify the natural search algorithm.a node can be computed from offsets in the CSS-tree When performing a search, we move from parentarray. (The exact formulas are given below.) to child by recalculating the offset within the direc- Our basic idea is to store a CSS-tree as a directory tory structure as above. We mark the end of thestructure on top of the array a. There are two directory structure (in Figure 3, that’s the last keyreasons why we expect a traversal of such a tree in node 15), and terminate this portion of the searchto be faster than binary search. First, if we choose when the computed offset exceeds the endpoint. Ifm such that a node fits in a cache line, then all the leaves were stored in the natural CSS-tree order,local searching within a node happens with at most we’d use this offset to look up the directory directly. Page 5
  6. 6. However, since the two parts of the leaf nodes are way that the leftmost key will always be found instored in reverse order in a separate array, we need case of duplicates. So we will never reach the leafto process this offset further. nodes in the deepest level with an index out of the The two parts of the leaf nodes are mapped into range of the first half of the sorted array.the sorted array as shown in Figure 3. We use y to Although it’s difficult to incrementally update adenote the boundary of the two parts of the leaf full CSS-tree, it’s relatively inexpensive to build suchnodes, which is the offset of the first key at the a tree from scratch. Our experiments show that todeeper leaf level in the directory array. (In Figure 3 build a full CSS-tree from a sorted array of twenty-that’s the first key in node 31.) Given an offset x of a five million integer keys takes less than one second onkey in a leaf node, we compare it with y to determine a modern machine. Therefore, when batch updateswhich part of the sorted array to look into. If x > y, arrive, we can afford to rebuild the CSS-tree.we find the element at position x − y from the start Searching a Full CSS-Tree: Once a full CSS-of the sorted array a. Otherwise, we find the element tree is built, we can search for a key. We start fromat position y − x from the end of a. For example, in the root node. Every time we reach an internal node,Figure 3, the first key in leaf node 30 can be found we do a binary search to find out which branch to goat the first key in node 64 in the sorted array. to. We repeat until we reach a leaf node. Finally, we Note that our techniques apply to sorted arrays map the leaf node into the sorted array and binaryhaving elements of size different from the size of a search the node.key. Offsets into the leaf array are independent of All the searches within a node consist of hard-the record size within the array; the compiler will coded if-else statements. When doing binary searchgenerate the appropriate byte offsets. in the internal nodes, we keep checking whether the The following lemma tells us how to calculate keys in the left part are greater than or equal to thethe node number of the last internal node, and the searching key. We stop when we find the first slotnode number y. Key offsets can be obtained by that has a value smaller than the searching key andmultiplying these numbers by m. then follow the branch on the right. (If such a slot can’t be found, we follow the leftmost branch.) InLemma 4.1: Let n = N ∗ m be the number of this way, if there are duplicates in a node, we areelements in the sorted array a (N is the number of guaranteed to find the leftmost key among all theleaf nodes). The total number of internal nodes in duplicates. When there are duplicate keys beinga full CSS-tree is (m+1) −1 − (m+1) −N . The first k k indexed, we can return the leftmost match in a m m (m+1)k −1 fashion similar to B+-trees.leaf node in the bottom level is number m .In both formulas, k = logm+1 (N ) . 4.2 Level CSS-Trees 4 Notice that CSS-trees are stored in a way similarto heaps. This is possible because of the way we 2 6“virtually” split the sorted array. B+-trees can’t use 1 3 5 7the same technique since they require all the leavesto be on the same level. 8 Building a Full CSS-Tree: To build a full CSS- Figure 4: Node with 8 keystree from a sorted array, we first split the sorted A full CSS-tree with m entries per node will havearray logically into two parts (based on Lemma 4.1) exactly m keys, i.e., all the entries are fully used.and establish the mapping between the leaf nodes Figure 4 shows the binary search tree of a node withand elements in the sorted array. We then start m = 23 entries. Out of the nine possible branches,with the last internal node. For each entry in the seven of them need three comparisons and two ofnode, we fill it with the value of the largest key in them need four. But if we waste one entry andits immediate left subtree. Finding the largest key just put seven keys per node, we will have a fullin a subtree can be done by following the link in the binary search tree and all the branches need threerightmost branch until we reach the leaf nodes. comparisons. This may give us some benefit. So for Some internal nodes, namely ancestors of the last m = 2t , we define a tree that only uses m − 1 entriesleaf node at the deepest level, may not have a full per node and has a branching factor of m a levelcomplement of keys. In our algorithm, we simply fill CSS-tree. A level CSS-tree will be deeper than thein those dangling keys with the last element in the corresponding full CSS-tree since now the branchingfirst half of array a. So there may be duplicate keys factor is m instead of m+1. However, we have fewerin some internal nodes. In our searching algorithm, comparisons per node. If N is the number of nodeswe tune the searching within each node in such a that the elements in the sorted array can form, a Page 6
  7. 7. Method branching # of levels comparisons per comparisons per factor (l) internal node (nComp) leaf node (Achild ) Binary search 2 log 2 n 1 1 T-trees 2 log 2 m − 1 n 1 log2 m enhanced B+-trees m 2 log m m n log2 m − 1 log2 m 2 n 2 Full CSS-trees m+1 log m+1 m (1 + m+1 ) log2 m log2 m n Level CSS-trees m log m m log2 m log2 m Method Total comparisons Moving across Level Cache Misses Cache Misses mK mK Level c <= 1 c >1 Binary search log2 n log2 n ∗ Ab log2 n log 2 n T-trees log2 n log2 n ∗ D log2 n log 2 n log2 n enhanced B+-trees log2 n log m m ∗ D n log2 m−1 log m n((log 2 mK ) + c mK c ) 2 2 log2 n Full CSS-trees m+3 m+1 logm+1 m log2 n logm+1 m ∗ Af css n log2 (m+1) mK c log m+1 n((log2 c ) + mK ) log2 n Level CSS-trees log2 n logm m ∗ Alcss n log2 m log m n((log2 mK ) + mK ) c c Table 1: Time analysislevel CSS-tree has logm N levels while a full CSS-tree raw data in the hash table). c denotes the size of ahas logm+1 N levels. The number of comparisons per cache line in bytes, and s denotes the size of a nodenode is t for a level CSS-tree and t ∗ (1 + m+1 ) for 2 in a T-tree, CSS-tree or enhanced B+-tree measureda full CSS-tree. So the total number of comparisons in cache-lines. We choose typical values as follows:for a level CSS-tree is logm N ∗ t = log2 N and that R = K = P = 4 bytes, n = 107 , h = 1.2, c = 64for a full CSS-tree is logm+1 N ∗ t ∗ (1 + m+1 ) = 2 bytes and s = 1.log2 N ∗ logm+1 m ∗ (1 + m+1 ). The ratio of the 2 Time Analysis: To make the analysis easy, in (m+1) logm (m+1) this section we assume that R, P and K are theformer to the latter is m+3 . Thus a levelCSS-tree always uses fewer comparisons than a full same. Thus we have a single parameter m, which isCSS-tree for searching. On the other hand, level the number of slots per node. The size of a node inCSS-trees may require logm N cache accesses and cache-lines is given by s = mK . clogm N node traversals, compared with logm+1 N The first table in Table 1 shows the branchingfor full CSS-trees. Whether we obtain a net gain factor, number of levels, comparisons per internalin speed depends upon how time-consuming a com- node and comparisons per leaf node for each method.parison operation is compared with node traversals B+-trees have a smaller branching factor than CSS-and cache accesses. A level CSS-tree still utilizes trees since they need to store child pointers explic-most of the data in each cache line. It uses a little itly. The total cost of each searching method hasmore space than a full CSS-tree, but this may be three parts, namely the comparison cost, the cost ofdesirable for users who want to trade space for time. moving across levels and the cache miss cost. We The building of level CSS-trees is similar to that show the three costs for each method in the secondof full CSS-trees. We can also make good use of the table in Table 1. We use D to denote the cost ofempty slot in each node. During the population, we dereferencing a pointer, Ab , Af css , Alcss to denotecan use that slot to store the largest value in the last the cost of computing the child address for binarybranch of each node. We can thus avoid traversing search, full CSS-tree and level CSS-tree respectively.the whole subtree to find the largest element. The First of all, the comparison cost is more or lesssearching algorithm of a level CSS-tree is similar to the same for all the methods. Full CSS-trees havethat of a full CSS-tree. The only difference is the slightly more comparisons than level CSS-trees ascalculation of the offset of a child node. we described earlier. Some of the methods find the child node by following pointers and others by5 Time and Space Analysis arithmetic calculations. The relative cost depends on computation complexity and the hardware. ForIn this section, we analytically compare the time example, while Ab could be smaller than D, Af cssperformance and the space requirement for different is likely to be more expensive than D. Nevertheless,methods. We let R denote the space taken by a methods with a higher branching factor have fewerrecord-identifier, K denote the space taken by a key, levels and thus usually have a lower cost of movingP denote the space taken by a child pointer and across levels. But that doesn’t mean the larger then denote the number of records being indexed. h branching factor the better. Too large a node sizedenotes a hashing fudge factor (typically about 1.2, will increase the cache miss cost, which is probablymeaning that the hash table is 20% bigger than the the most important factor since each cache miss can Page 7
  8. 8. Method Space (indirect) Typical Value Space (direct) Typical Value RID-Ordered Access Binary search 0 0 MB 0 0 MB Y nK 2 nK 2 Full CSS-trees sc 2.5 MB sc 2.5 MB Y nK 2 nK 2 Level CSS-trees sc−K 2.7 MB sc−K 2.7 MB Y nK(P +K) nK(P +K) enhanced B+-trees sc−P −K 5.7 MB sc−P −K 5.7 MB Y Hash table (h − 1)nR 8 MB hnR 48 MB N 2nP (K+R) 2nP (K+R) T-trees sc−2P 11.4 MB sc−2P + nR 51.4 MB Y Table 2: Space analysisbe an order of magnitude more expensive than a unit column, we do not count the space used by thecomputation. record-identifiers themselves since all methods share this space requirement. We assume a cold start in the cache. If the nodesize is smaller than the cache line size, each level has The column “Space (direct)” describes the spaceone cache miss. When the node size is larger than required by the algorithms if the structure beingthe cache line size, we estimate the number of cache indexed is a collection of records that cannot bemisses per node to be (log2 s)+ 1 = (log2 mK )+ mK c rearranged. In other words, it is not acceptable for s c(the total number of cache misses for all the keys a method to store the records internally within itsadds up to s ∗ (log2 s) + 1 and we divide that by structure. In this column, we count the space useds assuming each cache-line is equally likely to be by record-identifiers for T-trees and hash tables sincechosen). The results are summarized in the last two record-identifiers would not be necessary with othercolumns of the table. For most reasonable config- methods.urations, the number of cache misses is minimized All methods other than hash tables support ac-when the node size is the same as cache line size. cess in RID-order. The formula for Level CSS-treesIn the third column, we can see that binary search assumes that sc is a power or 2. Kand T-trees always have a number of misses thatare independent of m. B+-trees and CSS-trees haveonly a fraction of the cache misses of binary search. 6 Experimental ResultsThe fractions for CSS-trees are even smaller thanthat of B+-trees. So CSS-trees have the lowest We perform an experimental comparison of the al-values for the cache related component of the cost. gorithms on two modern platforms. We analyze theAs we can see, as m gets larger, the number of wall-clock time taken to perform a large numbercache misses for all the methods approaches log2 n of random successful lookups to the index. We(essentially all methods degrade to binary search). summarize our experiments in this section.There is a tradeoff between full CSS-trees and level Experimental Setup: We ran our experimentsCSS-trees. While the latter has slightly more cache on an Ultra Sparc II machine (296MHz, 1GB RAM)misses, it also performs fewer comparisons. It’s hard and a Pentium II (333MHz, 128M RAM) personalto compare the moving cost of the two since Alcss is computer. The Ultra machine has a < 16k, 32B, 1 >cheaper than Af css . We will show an experimental (<cache size, cache line size, associativity>) on-chipcomparison in Section 6. cache and a < 1M, 64B, 1 > secondary level cache. The PC has a < 16k, 32B, 4 > on-chip cache and To summarize, we expect CSS-trees to perform a < 512k, 32B, 4 > secondary level cache. Bothsignificantly better than binary search and T-trees machines are running Solaris 2.6. We implementedin searching, and also to outperform B+-trees. If a chained bucket hashing, array binary search, inter-bunch of searches are performed in sequence, the top polation search [Pet57], T-tree, enhanced B+-tree,level nodes will stay in the cache. Since CSS-trees full CSS-tree and level CSS-tree in C++. We chosehave fewer levels than all the other methods, it will to implement all the existing methods ourselvesalso gain the most benefit from a warm cache. because we are considering them in the context of Space Analysis: Table 2 lists the space re- main memory OLAP environment. Existing imple-quirements of the various algorithms. The column mentations won’t be optimized for space allocation“Space (indirect)” describes the space required by and cache related issues such as alignment. As athe algorithms if the structure being indexed is a col- result, we believe our implementation will run faster.lection of record-identifiers that can be rearranged Since cache optimization can be sensitive to compil-if necessary. In other words, it is acceptable for a ers [SKN94], we also chose two different compilers:method to store record-identifiers internally within one is Sun’s native compiler CC and the other isits structure, as opposed to leaving the list of record- GNU’s g++. We used the highest optimization levelidentifiers untouched as a contiguous list. In this of both compilers. However, the graphs for different Page 8
  9. 9. compilers look very similar, so we only report the Results: In the first experiment, we test howresults for CC. All the keys are integers and are long it takes to build a CSS-tree. Both building timechosen randomly between 0 and 1 million. Each curves are linear in the size of the sorted array. Levelkey takes four bytes. The lookup keys are generated CSS-trees are cheaper to build than full CSS-treesin advance to prevent the key generating time from because they don’t need to traverse each subtreeaffecting our measurements. We performed 100,000 to find the largest key. We are not claiming thatsearches on randomly chosen matching keys. We CSS-trees can be built much faster than other indexrepeated each test five times and report the minimal structures. Instead, we want to show that since thetime to find the first matching key. absolute rebuilding time is small (less than a second Algorithm Implementation Details: We for 25 million keys), we can afford to rebuild CSS-tried our best to optimize all the searching methods. trees periodically. The graph is omitted due to spaceFor methods that can have different node sizes, we limitations.implemented specialized versions for selected node We then measure the searching time for all thesize. We allocate a large chunk of memory at the algorithms. We first fix the node size and vary thebeginning and create all the nodes from that pool size of the sorted array. We choose the node sizeto save allocation time. We use logical shifts in to be each of the cache line size of the two levels ofplace of multiplication and division whenever possi- cache in the Ultra Sparc (32 bytes and 64 bytes).ble. For T-trees, B+-trees and CSS-trees, we unfold Figure 5 shows the result on the Ultra Sparc. Firstthe binary search loop in each internal node by of all, when all the data can fit in cache, therehardcoding all the if-else tests to reduce the amount is hardly any difference among all the algorithmsof overhead. The search within a leaf node is also (except for interpolation search). As the data sizehardcoded. Additionally, once the searching range increases, we can see that our cache conscious CSS-is small enough, we simply perform the equality trees perform the best among all the methods excepttest sequentially on each key. This gives us better for hashing. T-tree and binary search are the worstperformance when there are less than 5 keys in the and run more than twice as slow as CSS-trees. Therange. Code specialization is important. When our B+-tree curve falls in the middle. Although thesecode was more “generic” (including a binary search searching methods all have to do the same numberloop for each node), we found the performance to be of key comparisons, they differ on how many of those20% to 45% worse than the specialized code. comparisons cause a cache miss. T-tree search and The sorted array is aligned properly according to binary search essentially have one cache miss forthe cache line size. For T-trees, B+-trees and CSS- each comparison. For CSS-trees, all comparisonstrees, all the tree nodes are allocated at once and within a node are performed with only one cachethe starting addresses are also aligned properly. miss. B+-trees also have one cache miss per node. We implemented enhanced B+-trees. They use But since a B+-tree stores half as many keys perall the slots in each node.1 Each node is designed to node as CSS-trees, it has more levels than a CSS-have a prespecified size and all the nodes are aligned tree. Although it’s hard to discern visually, the levelproperly. We also forced each key and child pointer CSS-tree performs slightly better than the full CSS-to be adjacent to each other physically. tree. Across all of our tests we observed that level CSS-trees were up to 8% faster than full CSS-trees. We avoid storing the parent pointer in each The performance of interpolation search depends onnode of a T-tree since it’s not necessary for search- how well the data fits a linear distribution. Althoughing. We implemented the improved version of T- not shown here, we also did some tests on non-Trees [LC86b], which is a little bit better than the uniform data and interpolation search performs evenbasic version. For each T-tree node, we store the worse than binary search. So in practice, we wouldtwo child pointers adjacent to the smallest key so not recommend using interpolation search. For B+-that they will be brought into the same cache line trees and the two CSS-trees, the numbers in Fig-together. (Most of the time, the improved version ure 5(b) are smaller than that in Figure 5(a). Thechecks only the smallest key in each node.) reason is that the miss penalty for the second level of For the chained bucket hashing, we followed the cache is larger than that of the on-chip cache. B+-techniques used in [GBC98] by using the cache line trees get more benefit than CSS-trees from havingsize as the bucket size. Besides keys, each bucket a larger node size. This is because B+-trees alwaysalso contains a counter indicating the number of hold half the number of keys per node as CSS-treesoccupied slots in the bucket and the pointer to the and thus its benefit from avoiding extra cache missesnext bucket. Our hash function simply uses the low is more significant. Although the T-tree has manyorder bits of the key and thus is cheap to compute. keys per node, it doesn’t benefit from having a larger 1 Since there is always one more pointer than keys, for node size. In this experiment, we chose the hashnodes with an even number of slots, we leave one slot empty. table directory size to be 4 million. Hashing uses Page 9
  10. 10. 0.5 0.5 array binary search array binary search 0.45 tree binary search 0.45 tree binary search interpolation search interpolation search 0.4 T-tree 0.4 T-tree enhanced B+ tree enhanced B+ tree 0.35 full CSS-tree 0.35 full CSS-tree level CSS-tree level CSS-tree hash hash 0.3 0.3 time(s) time(s) 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 100 1000 10000 100000 1e+06 1e+07 100 1000 10000 100000 1e+06 1e+07 size of sorted array size of sorted array (a) 8 integers per node (b) 16 integers per node Figure 5: Varying array size, Ultra Sparc II 2e+06 2e+06 array binary search array binary search tree binary search tree binary search interpolation search interpolation search T-tree T-tree B+-tree B+-tree secondary cache accesses 1.5e+06 1.5e+06 secondary cache misses full CSS-tree full CSS-tree level CSS-tree level CSS-tree hash hash 1e+06 1e+06 500000 500000 0 0 100 1000 10000 100000 1e+06 1e+07 100 1000 10000 100000 1e+06 1e+07 size of sorted array size of sorted array (a) Total number of key accesses (b) Cache misses in secondary level cache Figure 6: Ultra Sparc II, 16 integers per nodeonly one third of the time of CSS-trees. But we between CPU and RAM widens, this ratio will gethave to keep in mind that it’s using 20 times as much higher and the difference between the performancespace. of each method will be even more significant. To compare the number of cache misses, we ran all The results on the Pentium PC are very similarthe searching methods through a cache simulator.2 and are omitted here. In [LC86a, LC86b], theFigure 6(a) and 6(b) show the total number of authors reported that T-trees perform better thankey accesses and the total number of key misses binary search and B+-trees. Our results show thein the secondary level cache in a configuration that contrary conclusion. The explanation is that thematches the Sun Ultra Sparc. As the data set gets CPU speed has improved by two orders of magnitudelarger, a larger proportion of the key accesses are relative to memory latency during the past thirteencache misses. B+-trees have approximately 50% years [CLH98]. 0.5more cache misses than CSS-trees. Let’s assumethat each RAM access takes 180ns.3 Then enhanced 0.45B+-trees take 0.06 seconds to load data from RAM 0.4to cache (when there are 25 million keys in the 0.35 T treesorted array) and this is about 30% of the total 0.3 enhanced B+ tree full CSS-tree time(s) level CSS-treeamount of time taken (0.22 seconds). Out of the 0.04 0.25 hashseconds time difference between B+-trees and CSS- 0.2trees, about 0.014 seconds is caused by the difference 0.15in cache miss penalties. Thus the cache simulation 0.1verifies that 30% of the observed performance gap is 0.05due to the cache miss behavior. As the speed gap 0 0 20 40 60 80 100 120 140 number of entries per node 2 We implemented the simulator ourselves and instru-mented our code (using #ifdef SIMULATOR macros) to log all Figure 7: Varying node size, 10 million keysmemory accesses. 3 We assume the memory has a peak bandwidth of In our next experiment, we fix the size of the350MB/s [Inc99]. 64/350M=180ns sorted array and vary the node size of T-trees, B+- Page 10
  11. 11. trees, full CSS-trees and level CSS-trees. Figure 7shows the result on the Ultra Sparc. For both CSS- 0.1600 0.0350trees, the lowest point is 16 integers per node, which 0.1400 0.0300corresponds to the cache line size of the machine. 0.1200 cpuB+-trees have a minimum at 32 integers per node. 0.0250Although this doesn’t quite fit in the time analysis 0.1000 cpu 0.0200in Section 5, the difference between 16-integers and 0.0800 cpu cpu cpu cpu32-integers per node is not very significant. There 0.0600 0.0150is a bump for full CSS-trees and B+-trees when 0.0100 cache miss 0.0400each node has 24 integers. One reason is that the 0.0050 cache miss cache missnode size is not a multiple of the cache line size. 0.0200 cache miss cache miss cache missThus nodes are not properly aligned with cache 0.0000 0.0000line and cause more cache misses. The bumps in enhanced B+ level CSS full CSS enhanced B+ level CSS full CSSfull CSS-trees are more dramatic. This is becausethat the arithmetic computation for the child node (a) now (b) in five yearsis more expensive for m = 24 since division and Figure 8: Time breakdown (16 integers per node)multiplication must be used to compute child nodes 0.04instead of logical shifts. We also tested the optimal enhanced B+ treehash directory size of chained bucket hashing. Each full CSS-tree level CSS-treepoint corresponds to a directory size from 223 to 218 0.035with the leftmost point having the largest directory 0.03size. When the directory size has been increased to223 , the system starts to run out of memory and thus time (s) 0.025the searching time goes up. T-trees perform muchworse on all the node sizes. 0.027 Space/Time Tradeoffs 0.015An indexing method is measured by the pair (S, T ) 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 space (bytes)of space required for the whole index structure (S), Figure 9: Space/time Tradeoffs (in five years)and time taken for a single lookup (T ). Figure 2shows the space requirement and searching time for time being, the cache miss cost corresponds to aeach method.4 Each point for T-trees, B+-trees and relatively small fraction of the total cost. Given theCSS-trees corresponds to a specific node size. Since trend shown in Figure 1, in five years, we expect thekeeping an ordered RID list is usually necessary, we CPU to be 10 times faster while the memory speedcalculate the space requirement using the formula will be only 1.6 times faster. So the breakdown atfor “direct” space listed in Table 2. The stepped line that time will probably look like that in Figure 8(b).basically tells us how to find the optimal searching The cache miss cost will be much more significant.time for a given amount of space. All the methods Figure 9 shows the projected space/time tradeoffon the upper-right side of the line are dominated between B+-trees and CSS-trees in five years. Asby some methods on the line. T-trees and B+-trees we can see, the optimal CSS-trees are almost 30%are both dominated by CSS-trees. Methods on the faster than the optimal enhanced B+-trees. In theline have tradeoffs between space and time. On the limit, if cache miss cost dominates the total cost,bottom, we have binary search on the sorted array, CSS-trees can be up to 50% faster than enhancedwhich doesn’t require any extra space, but uses three B+-trees. Also, the optimal node size in five yearstimes as much time as CSS-trees and seven times is different from what it is now.as much as hashing. If we invest a little bit ofextra space, we can use CSS-trees, which reduce the 8 Conclusionsearching time to one third of binary search. Tomake another factor of two improvement, we have We studied the cache behavior of various in-memoryto pay 20 times as much space as CSS-trees to be indexes in an OLAP environment and showed thatable to use hashing. When space is limited, CSS- cache conscious methods such as CSS-trees can im-trees balance the space and time. prove searching performance by making good use of cache lines. As the gap between CPU and memory We show the breakdown of the cost of enhanced speed is widening, we expect the improvement thatB+-trees and CSS-trees in Figure 8(a). For the can be achieved by exploiting the cache will be 4 We present the numbers from searching a sorted array of even more significant in the future. Cache conscious5 million elements on an Ultra Sparc II. searching behavior is just one step towards efficiently Page 11
  12. 12. utilizing the cache in database systems. We aim to [Knu73] Donald Ervin Knuth. Sorting and Search-study the effect of cache behavior on other database ing, volume 3 of The Art of Computer Program-operations in the future. ming. Addison-Wesley, Reading, Massachusetts, USA, 1973. Acknowledgment: We thank John Grogan forthe implementation of our cache simulator. [LC86a] Tobin J. Lehman and Michael J. Carey. Query processing in main memory database management systems. In Proceedings of the ACM SIGMOD Con-References ference, pages 239–250, 1986.[AHK85] Arthur C. Ammann, et al. Design of a memory [LC86b] Tobin J. Lehman and Michael J. Carey. A resident DBMS. In Proceedings of the IEEE COM- study of index structures for main memory database PCOM Conference, pages 54–57, 1985. management systems. In Proceedings of the 12th VLDB Conference, pages 294–303, 1986.[BBC+ 98] Phil Bernstein, et al. The Asilomar report on database research. ACM Sigmod Record, 27(4), [LC87] Tobin J. Lehman and Michael J. Carey. A 1998. recovery algorithm for a high performance memory-[CLH98] Trishul M. Chilimbi, James R. Larus, and resident database system. In Proceedings of the ACM Mark D. Hill. Improving pointer-based codes SIGMOD Conference, pages 104–117, 1987. through cache-conscious data placement. Technical [LL96] Anthony LaMarca and Richard E. Ladner. The report 98, University of Wisconsin-Madison, Com- influence of caches on the performance of heaps. puter Science Department, 1998. ACM Journal of Experimental Algorithmics, 1, 1996.[Com79] D. Comer. The ubiquitous B-tree. ACM Com- [LL97] Anthony LaMarca, et al. The influence of puting Surverys, 11(2), 1979. caches on the performance of sorting. In Eighth[CS83] F. Cesarini, et al. An algorithm to construct a Annual ACM-SIAM Symposium on Discrete Algo- compact B-tree in case of ordered keys. Information rithms, 1997. Processing Letters, 17(1):1612–1630, 1983. [LN88] K. Li and J. F. Naughton. Multiprocessor main[Eng98] InfoCharger Engine. Optimization for de- memory transaction processing. In International cision support solutions (available from http:// Symposium on Databases in Parallel and Distributed www.tandem.com/prod des/ifchegpd/ifchegpd.htm). Systems, pages 177–189, 1988. 1998. [LSC92] Tobin J. Lehman, et al. An evaluation of star-[Fre95] Clark D. French. “One size fits all” database burst’s memory resident storage component. IEEE architectures do not work for DDS. In Proceedings of Transactions on knowledge and data enginnering, the ACM SIGMOD Conference, page 449–450, 1995. 4(6):555–566, 1992.[Fre97] Clark D. French. Teaching an OLTP database [NBC+ 94] Chris Nyberg, et al. Alphasort: a RISC kernel advanced data warehousing techniques. In machine sort. In Proceedings of the ACM SIGMOD Proc. IEEE Int’l Conf. on Data Eng., 1997. Conference, pages 233–242, May 1994.[GBC98] Goetz Graefe, et al. Hash joins and hash teams [Pet57] W. W. Peterson. In IBM J. Res. & Devel., No. in Microsoft SQL server. In Proceedings of the 24th 1, pages 131–132, 1957. VLDB Conference, pages 86–97, 1998. [SAC+ 79] Patricia G. Selinger, et al. Access path selec-[GMS86] Hector Garcia-Molina, et al. High perfromance tion in a relational database management system. transaction processing with memory resident data. In Proceedings of the ACM SIGMOD Conference, In Proceedings of the International Workshop on pages 23–34, 1979. High Performance Transaction Systems, 1986. [SKN94] Ambuj Shatdal, et al. Cache conscious algo-[GR93] Jim Gray, et al. Transaction processing:concepts rithms for relational query processing. In Proceed- and techniques. The Morgan Kaufmann Publishers, ings of the 20th VLDB Conference, 1994. San Francisco, CA, USA, 1993. [Smi82] Alan J. Smith. Cache memories. ACM Com-[Hag86] R. B. Hagmann. A crash recovery scheme for a puting Surverys, 14(3):473–530, 1982. memory-resident database system. IEEE Transac- [Sof97] TimesTen Performance Software. Main-memory tions on Computing, C-35:839–842, 1986. data management technical white paper (available[Inc99] Sun Microsystems Inc. Datasheet on mem- from http://www.timesten.com). 1997. ory (available from http://www.sun.com/ microelec- [Syb97] Sybase Corporation. Sybase IQ 11.2.1, 1997. tronics/datasheets/sme1040/04.html as of feb. 15, 1999). 1999. [WK90] Kyu-Young Whang and Ravi Krishnamurthy. Query optimization in a memory-resident domain[JLRS94] H.V. Jagadish, et al. Dali: A high perfor- relational calculus database system. ACM Trans- mance main memory storage manager. In Proceed- actions on Database Systems, 15(1):67–95, 1990. ings of the 20th VLDB Conference, 1994. [WL91] Michael E. Wolf, et al. A data locality opti-[JTR87] Wiebren De Jonge, et al. Two access methods mizing algorithm. SIGPLAN Notices, 26(6):30–44, using compact binary trees. IEEE Transactions on 1991. Software Engineering, 13(7):799–810, 1987. Page 12