Maximal Vector Computation in Large Data Sets                          Parke Godfrey1                  Ryan Shipley2      ...
cessing steps or data-structures such as indexes are         show it is not dominated. If there are not too manynot requir...
algorithm        ext.            best-case                               average-case                         worst-case  ...
sulting maximal set. This requires that DD&C recurse                  by a point in B, and vice versa, to result in just t...
window point has gone full cycle. (That is, it has been                    tance of this effect is emphasized in the discus...
The expected number of points dominated by x (and            available for scanning in k sorted ways, sorted alonghence, d...
re-sort the data points; rather, the sorted orders aremaintained. In a linked-list implementation, it is easy             ...
is m2 , for the small i sizes of data sets encountered      k,i                                                           ...
but as records are replaced in the EF window, the col-                          Buffer Pool                               ...
#I/O’s (thousands)                      100                                         25                       80           ...
(1 − n−1/2k ). Thus the size of B, xk , is (1 − n−1/2k )k .             3.4    Issues and ImprovementsIn the limit of n, t...
resents the worst-case scenario for maximal-vector al-       rithms for finding the maximals, there has not beengorithms (§...
Upcoming SlideShare
Loading in …5

P229 godfrey


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

P229 godfrey

  1. 1. Maximal Vector Computation in Large Data Sets Parke Godfrey1 Ryan Shipley2 Jarek Gryz1 1 2 York University The College of William and Mary Toronto, ON M3J 1P3, Canada Williamsburg, VA 23187-8795, USA {godfrey, jarek} Abstract vectors from the set. One vector dominates another if each of its components has an equal or higher value Finding the maximals in a collection of vec- than the other vector’s corresponding component, and tors is relevant to many applications. The it has a higher value on at least one of the correspond- maximal set is related to the convex hull— ing components. One may equivalently consider points and hence, linear optimization—and near- in a k-dimensional space instead of vectors. In this est neighbors. The maximal vector prob- context, the maximals have also been called the ad- lem has resurfaced with the advent of skyline missible points, and the set of maximals called the queries for relational databases and skyline Pareto set. This problem has been considered for many algorithms that are external and relationally years, as identifying the maximal vectors—or admissi- well behaved. ble points—is useful in many applications. A number The initial algorithms proposed for maximals of algorithms have been proposed for efficiently finding are based on divide-and-conquer. These es- the maximals. tablished good average and worst case asymp- The maximal vector problem has been rediscovered totic running times, showing it to be O(n) recently in the database context with the introduction average-case, where n is the number of vec- of skyline queries. Instead of vectors or points, this tors. However, they are not amenable to ex- is to find the maximals over tuples. Certain columns ternalizing. We prove, furthermore, that their (with numeric domains) of the input relation are des- performance is quite bad with respect to the ignated as the skyline criteria, and dominance is then dimensionality, k, of the problem. We demon- defined with respect to these. The non-dominated tu- strate that the more recent external skyline ples then constitute the skyline set. algorithms are actually better behaved, al- Skyline queries have attracted a fair amount of at- though they do not have as good an apparent tention since their introduction in [5]. It is thought asymptotic complexity. We introduce a new that skyline offers a good mechanism for incorporat- external algorithm, LESS, that combines the ing preferences into relational queries, and, of course, best features of these, experimentally evalu- its implementation could enable maximal vector ap- ate its effectiveness and improvement over the plications to be built on relational database systems field, and prove its average-case running time efficiently. While the idea itself is older, much of the is O(kn). recent skyline work has focused on designing good al- gorithms that are well-behaved in the context of a re-1 Introduction lational query engine and are external (that is, thatThe maximal vector problem is to find the subset of the work over data sets too large to fit in main-memory).vectors such that each is not dominated by any of the On the one hand, intuition says it should be fairly trivial to design a reasonably good algorithm for find- Part of this work was conducted at William & Mary whereRyan Shipley was a student and Parke Godfrey was on faculty ing the maximal vectors. We shall see that many ap-while on leave of absence from York. proaches are O(n) average-case running time. On thePermission to copy without fee all or part of this material is other hand, performance varies widely for the algo-granted provided that the copies are not made or distributed for rithms when applied to input sets of large, but real-direct commercial advantage, the VLDB copyright notice and istic, sizes (n) and reasonable dimensionality (k). Inthe title of the publication and its date appear, and notice is truth, designing a good algorithm for the maximal vec-given that copying is by permission of the Very Large Data BaseEndowment. To copy otherwise, or to republish, requires a fee tor problem is far from simple, and there are manyand/or special permission from the Endowment. subtle, but important, issues to attend.Proceedings of the 31st VLDB Conference, In this paper, we focus on generic maximal-vectorTrondheim, Norway, 2005 algorithms; that is, on algorithms for which prepro- 229
  2. 2. cessing steps or data-structures such as indexes are show it is not dominated. If there are not too manynot required. The contributions are two-fold: first, we maximals, this will not be too expensive. Given m, the √thoroughly analyze the existing field of generic max- expected number of maximals, if m < n, the numberimal vector algorithms, especially with consideration of maximal-to-maximal comparisons, O(m2 ), is O(n).of the dimensionality k’s impact; second, we present Thus, assuming m is sufficiently small, in best-case,a new algorithm that essentially combines aspects of this approach is O(n). A goal then is for algorithmsa number of the established algorithms, and offers a with average-case running time of O(n).substantial improvement over the field. Performance then is also dependent on the number In §2, we simultaneously review the work in the area of maximals (m). In worst-case, all points are maximaland analyze the proposed algorithms’ runtime perfor- (m = n); that is, no point dominates any other. Wemances. We first introduce a cost model on which we shall consider average-case performance based on thecan base average-case analyses (§2.1). This assumes expected value of m. To do this, we shall need certainan estimate of the number of maximals expected, on assumptions about the input set.average, assuming statistical independence of the di- Definition 1 Consider the following properties of amensions and distinct values of the vectors along each set of points:dimension. We summarize the generic algorithmic a. (independence) the values of the points over a sin-approaches—both older algorithms and newer, exter- gle dimension are statistically independent of thenal skyline algorithms—for computing maximal vector values of the points along any other dimension;sets (§2.2). We briefly discuss some of the index-based b. (distinct values) points (mostly) have distinct val-approaches to maximal vector computation, and why ues along any dimension (that is, there are notindex-based approaches are necessarily of limited util- many repeated values); andity (§2.3). We then formally analyze the run-time per- c. (uniformity) the values of the points along anyformances of the generic algorithms to identify the bot- one dimension are uniformly distributed.tlenecks and compare advantages and disadvantages Collectively, the properties of independence and dis-between the approaches, first the divide-and-conquer tinct values are called component independence (CI)approaches (§2.4), and then the external, “skyline” ap- [4]. (In some cases, we shall just assume CI as theproaches (§2.5). property of uniformity is not necessary for the result In §3, we present a new algorithm for maximal vec- at hand.) Let us call collectively the three propertiestor computation, LESS (linear elimination sort for sky- uniform independence (UI).1line), the design of which is motivated by our analyses Additionally, assume under uniformity, without lossand observations (§3.1). We present briefly some ex- of generality, that any value is on the interval (0, 1).2perimental evaluation of LESS that demonstrates its Under CI, the expected value of the number of max-improvement over the existing field (§3.2). We for- imals is known [6, 11]: m = Hk−1,n , where Hk,n is themally analyze its runtime characteristics, prove it has k-th order harmonic of n. Let H0,n = 1, for n > 0, andO(kn) average runtime performance, and demonstrate n Hk−1,iits advantages with respect to the other algorithms Hk,n be inductively defined as Hk,n = , for(§3.3). Finally, we identify the key bottlenecks for any i=1 imaximal-vector algorithm, and discuss further ways k > 1. Hk,n ≈ Hk /k ! ≈ ((ln n) + γ)k /k !. 1,nthat LESS could be improved (§3.4). When the distinct-value assumption is violated and We discuss future issues and conclude in §4. there are many repeated values along a dimension, the expected value of m goes down, up to the point at which the set is dominated by duplicate points (that2 Algorithms and Analyses is, equivalent on all the dimensions) [11].2.1 Cost Model For best-case, assume that there is a total ordering of the points, p1 , . . . , pn , such that any pi dominatesA simple approach would be to compare each point all pj , for i < j. Thus, in best-case, m = 1 (the oneagainst every other point to determine whether it is point being p1 ).3dominated. This would be O(n2 ), for any fixed dimen-sionality k. Of course, once a point is found that dom- 1 Note that, given a set of points that is CI, we could replaceinates the point in question, processing for that point the points’ values with their ordinal ranks over the data set withcan be curtailed. So average-case running time should respect to each dimension. Then the set of points would be UI. However, this transformation would not be computationallybe significantly better, even for this simple approach. insignificant.In the best-case scenario, for each non-maximal point, 2 This is without loss of generality since it makes no furtherwe would find a dominating point for it immediately. assumptions about the data distributions. The data values canSo each non-maximal point would be eliminated in always be normalized onto (0, 1). Furthermore, this adds no sig- nificant computational load. Knowing the maximum and mini-O(1) steps. Each maximal point would still be ex- mum values of the points for each dimension is sufficient to makepensive to verify; in the least, it would need to be the mapping in a single pass.compared against each of the other maximal points to 3 We consider a total order so that, for any subset of the 230
  3. 3. algorithm ext. best-case average-case worst-case k−3 DD&C [14] no O(knlg n) §2.2 Ω(knlg n + (k − 1) n) Thm.12 O(nlg k−2 n) [14] k−2 LD&C [4] no O(kn) §2.2 O(n), Ω((k − 1) n) [4], Thm. 11 O(nlg k−1 n) [4] FLET [3] no O(kn) §2.2 O(kn) [3] O(nlg k−2 n) [3] √ SD&C [5] – O(kn) Thm. 2 Ω( k 22k n) Thm. 10 O(kn2 ) Thm. 3 BNL [5] yes O(kn) Thm. 4 – O(kn2 ) Thm. 5 SFS [8] yes O(nlg n + kn) Thm. 6 O(nlg n + kn) Thm. 8 O(kn2 ) Thm. 9 LESS – yes O(kn) Thm. 14 O(kn) Thm. 13 O(kn2 ) Thm. 15 Figure 1: The generic maximal vector algorithms. We shall assume that k ≪ n. Furthermore, we as- sorted set is then split in half along one of the di-sume that, generally, k < lg n. We include the dimen- mensions, say dk , with respect to the the sorted ordersionality k in our O-analyses. over dk . This is recursively repeated until the result- We are now equipped to review the proposed algo- ing set is below threshold in size (say, a single point).rithms for finding the maximal vectors, and to analyze At the bottom of this recursive divide, each set (onetheir asymptotic runtime complexities (O’s). Not all point) consists of just maximals with respect to thatof the O(n) average cases can be considered equivalent set. Next, these (locally) maximal sets are merged.without factoring in the impact of the dimensionality On each merge, we need to eliminate any point thatk. Furthermore, we are interested in external algo- is not maximal with respect to the unioned set. Con-rithms, so I/O cost is pertinent. After our initial anal- sider merging sets A and B. Let all the maximals in Ayses, we shall look into these details. have a higher value on dimension dk than those in B (given the original set was divided over the sorted list2.2 The Algorithms of points with respect to dimension dk ). The maximals of A ∪ B are determined by applying DD&C, but nowThe main (generic) algorithms that have been pro- over dimensions d1 , . . . , dk−1 , so with reduced dimen-posed for maximal vectors are listed in Figure 1. We sionality.5have given our own names to the algorithms (not nec- Once the dimensionality is three, an efficientessarily the same names as used in the original papers) special-case algorithm can be applied. Thus, in worst-for the sake of discussion. For each, whether the al- case, O(nlg k−2 n) steps are taken. In the best-case, thegorithm was designed to be external is indicated, and double-divide-and-conquer is inexpensive since eachthe known best, average, and worst case running time maximal set only has a single point. (It resolves toanalyses—with respect to CI or UI and our model for O(n).) However, DD&C needs to sort the data by eachaverage case in §2.1—are shown. For each runtime dimension initially, and this costs O(knlg n). We estab-analysis, it is indicated where the analysis appears. lish DD&C’s average-case performance in Section 2.4.For each marked with ‘§’, it follows readily from the Note that DD&C establishes that the maximal vectordiscussion of the algorithm in that Section. BNL’s problem is, in fact, o(n2 ).average-case, marked with ‘–’, is not directly amica- LD&C [4] improves on the average-case over DD&C.ble to analysis (as discussed in §2.5). The rest are Their analysis exploits the fact that they showed m toproven in the indicated theorems.4 be O(ln k−1 n) average-case. LD&C does a basic divide- The first group consists of divide-and-conquer- and-conquer recursion first, randomly splitting the setbased algorithms. DD&C (double divide and conquer) into two equal sets each time. (The points have not[14], LD&C (linear divide and conquer) [4], and FLET been sorted.) Once a set is below threshold size, the(fast linear expected time) [3] are “theoretical” algo- maximals are found. To merge sets, the DD&C algo-rithms that were proposed to establish the best bounds rithm is applied. This can be modeled by the recur-possible on the maximal vector problem. No attention rencewas paid to making the algorithms external. Their T (1) = 1initial asymptotic analyses make them look attractive, T (n) = 2T (n/2) + (ln k−1 n)lg k−2 (ln k−1 n)however. Note that (ln k−1 n)lg k−2 (ln k−1 n) is o(n). Therefore, DD&C does divide-and-conquer over both the data LD&C is average-case linear, O(n) [4].(n) and the dimensions (k). First, the input set issorted in k ways, once for each dimension. Then, the In best case, each time LD&C calls DD&C to merge to maximal sets, each maximal set contains a singlepoints, there is just one maximal with respect to that subset. point. Only one of the two points survives in the re-This is necessary for discussing the divide-and-conquer-basedalgorithms. 5 All points in A are marked so none will be thrown away. 4 Some of the theorems are relatively straightforward, but we Only points in B can be dominated by points in A, since thoseput them in for consistency. in A are better along dimension dk . 231
  4. 4. sulting maximal set. This requires that DD&C recurse by a point in B, and vice versa, to result in just theto the bottom of its dimensional divide, which is k maximals with respect to A ∪ B.deep, to determine the winning point. O(n) merges are Theorem 2 Under CI (Def. 1) and the model in §2.1,then done at a cost of O(k) steps each. Thus, LD&C’s SD&C has a best-case runtime of O(kn).average-case running time is, at least, Ω(kn). (In Sec- Proof 2 Let mA denote the number of points in Ation 2.4, we establish that, in fact, it is far worse.) (which are maximal with respect to A). Let mAB de-In worst case, the set has been recursively divided an note the number of points in A that are maximal withextra time, so LD&C is lg n times worse than DD&C. respect to A ∪ B. Likewise, define mB and mBA in the FLET [3] takes a rather different approach to im- same way with respect to B. An upper bound on theproving on DD&C’s average-case. Under UI,6 a virtual cost of merging A and B is kmA mB and a lower boundpoint x—not necessarily an actual point in the set—is is kmAB mBA . In best case, SD&C is O(kn). 2determined so that the probability that no point from For a fixed k, average case is O(n). (We shall con-the set dominates it is less than 1/n. The set of points sider more closely the impact of k on the average caseis then scanned, and any point that is dominated by x in §2.4.)is eliminated. It is shown that the number of points x Theorem 3 Under CI (Def. 1), SD&C has a worst-will dominate, on average, converges on n in the limit, case runtime of O(kn2 ).and the number it does not is o(n). It is also tracked Proof 3 The recurrence for SD&C under worst casewhile scanning the set whether any point is found that isdominates x. If some point did dominate x, it does not T (1) = 1matter that the points that x dominates were thrown T (n) = 2T (n/2) + (n/2)2away. Those eliminated points are dominated by a real 2point from the set anyway. DD&C is then applied to This is O(n ) number of comparisons. Each com-the o(n) remaining points, for a O(kn) average-case parison under SD&C costs k steps, so the runtime isrunning time. This happens at least (n − 1)/n frac- O(kn2 ). 2tion of trials. In the case no point was seen to dom- No provisions were made to make SD&C particu-inate x, which should occur less than 1/n fraction of larly well behaved relationally, although it is clearlytrials, DD&C is applied to the whole set. However, more amenable to use as an external algorithm thanDD&C’s O(nlg k−2 n) running time in this case is amor- DD&C (and hence, LD&C and, to an extent, FLET too,tized by 1/n, and so contributes O(lg k−2 n), which is as they rely on DD&C). The divide stage of SD&C iso(n). Thus, the amortized, average-case running time accomplished trivially by bookkeeping. In the mergeof FLET is O(kn). FLET is no worse asymptotically stage, two files, say A and B, are read into main mem-than DD&C in worst case. ory, and their points pairwise compared. The result is written out. As long as the two input files fit in FLET’s average-case runtime is O(kn) because FLET main memory, this works well. At the point at whichcompares O(n) number of points against point x. Each the two files are too large, it is much less efficient. Acomparison involves comparing all k components of the block-nested loops strategy is employed to compare alltwo points, and so is k steps. DD&C and LD&C never A’s points against all of B’s, and vice two points directly on all k dimensions sincethey do divide-and-conquer also over the dimensions. The second algorithm proposed in [5] is BNL, blockIn [4] and [14], DD&C and LD&C were analyzed with nested loops. This is basically an implementation ofrespect to a fixed k. We are interested in how k affects the simple approach discussed in §2.1, and works re-their performance, though. markably well. A window is allocated in main memory The second group—the skyline group—consists of for collecting points (tuples). The input file is scanned.external algorithms designed for skyline queries. Sky- Each point from the input stream is compared againstline queries were introduced in [5] along with two gen- the window’s points. If it is dominated by any of them,eral algorithms proposed for computing the skyline in it is eliminated. Otherwise, any window points dom-the context of a relational query engine. inated by the new point are removed, and the new point itself is added to the window. The first general algorithm in [5] is SD&C, singledivide-and-conquer. It is a divide-and-conquer algo- At some stage, the window may become full. Oncerithm similar to DD&C and LD&C. It recursively di- this happens, the rest of the input file is processed dif-vides the data set. Unlike LD&C, DD&C is not called ferently. As before, if a new point is dominated by ato merge the resulting maximal sets. A divide-and- window point, it is eliminated. However, if the newconquer is not performed over the dimensions. Con- point is not dominated, it is written to an overflowsider two maximal sets A and B. SD&C merges them file. (Dominated window points are still eliminatedby comparing each point in A against each point in B, as before.) The creation of an overflow file meansand vice versa, to eliminate any point in A dominated another pass will be needed to process the overflow points. Thus, BNL is a multi-pass algorithm. On a 6 For the analysis of FLET, we need to make the additional subsequent pass, the previous overflow file is read asassumption of uniformity from Def. 1. the input. Appropriate bookkeeping tracks when a 232
  5. 5. window point has gone full cycle. (That is, it has been tance of this effect is emphasized in the discussioncompared against all currently surviving points.) Such of LESS in §3 and in the proof that LESS is O(kn)window points can be removed from the window and (Thm. 13).written out, or pipelined along, as maximals. SFS maintains a main-memory window as does BNL differs substantially from the divide-and- BNL. It is impossible for a point off the stream to dom-conquer algorithms. As points are continuously re- inate any of the points in the window, however. Anyplaced in the window, those in the window are a sub- point is thus known to be maximal at the time it isset of the maximals with respect to the points seen so placed in the window. The window’s points are usedfar (modulo those written to overflow). These global to eliminate stream points. Any stream point not elim-maximals are much more effective at eliminating other inated is itself added to the window. As in BNL, oncepoints than are the local maximals computed at each the window becomes full, surviving stream points mustrecursive stage in divide-and-conquer. be written to an overflow file. At the end of the inputTheorem 4 Under CI (Def. 1) and the model in §2.1, stream, if an overflow file was opened, another pass isBNL has a best-case runtime of O(kn). required. Unlike BNL, the window can be cleared atProof 4 BNL’s window will only ever contain one the beginning of each pass, since all points have beenpoint. Each new point off the stream will either re- compared against those maximals. The overflow file isplace it or be eliminated by it. Thus BNL will only then used as the input stream.require one pass. 2 SFS has less bookkeeping overhead than BNL since A good average case argument with respect to UI when a point is added to the window, it is alreadyis difficult to make. We discuss this in §2.5. Let w be known that the point is maximal. This also meansthe size limit of the window in number of points. that SFS is progressive: at the time a point is addedTheorem 5 Under CI (Def. 1), BNL has a worst-case to the window, it can also be shipped as a maximalruntime of O(kn2 ). to the next operation. In [8], it was shown that SFSProof 5 In worst case, every point will need to be performs better I/O-wise than BNL, and runs in bettercompared against every other point for O(kn2 ). This time (and this includes SFS’s necessary sorting step).requires ⌈n/w⌉ passes. Each subsequent overflow file The experiments in [8] were run over million-tuple datais smaller by w points. So this requires writing n2 /2w sets and with dimensions of five to seven.points and reading n2 /2w points. The size of w is fixed. Theorem 6 Under CI (Def. 1) and the model in §2.1,In addition to requiring O(n2 ) I/O’s, every record will SFS has a best-case runtime of O(kn + nlg n).need to be compared against every other record. Every Proof 6 Under our best-case scenario, there is onerecord is added to the window; none is ever removed. maximal point. This point must have the largest vol-Each comparison costs k steps. So the work of the ume. Thus it will be the first point in SFS’s sortedcomparisons is O(kn2 ). 2 stream, and the only point to be ever added to the win- In [8], SFS, sort filter skyline, is presented. It differs dow. This point will eliminate all others in one pass.from BNL in that the data set is topologically sorted So SFS is sorting plus O(kn) in best-case, and worksinitially. A common nested sort over the dimensions in one filtering pass. 2d1 , . . . , dk , for instance, would suffice. In [9], the utility For average-case analysis of SFS, we need to knowof sorting for finding maximals and SFS are considered how many of the maximal points dominate any givenin greater depth. Processing the sorted data stream non-maximal. For any maximal point, it will be com-has the advantage that no point in the stream can be pared against every maximal point before it in thedominated by any point that comes after it. In [8, 9], sorted stream to confirm its maximality. Thus theresorting the records by volume descending, k t[di ];7 i=1 will be m2 /2 of these comparisons. For any non- kor, equivalently, by entropy descending, i=1 ln t[di ] maximal point, how many maximals (points in the(with the assumption that the values t[di ] > 0 for all window) will it need to be compared against beforerecords t and dimensions i) is advocated.8 This has the being eliminated?advantage of tending to push records that dominate Lemma 7 Under UI (Def. 1), in the limit of n, themany records towards the beginning of the stream. probability that any non-maximal point is dominated by SFS sorts by volume because records with higher the maximal point with the highest volume converges tovolumes are more likely to dominate more records in one.the set. They have high “dominance” numbers. As- Proof 7 Assume UI. Let the values of the points besuming UI, the number of records a given record dom- distributed uniformly on (0, 1) on each dimension.inates is proportional to its volume. By putting these We draw on the proof of FLET’s average case run-earlier in the stream, non-maximal records are elimi- time in [3]. Consider the (virtual) point x with coor-nated in fewer comparisons, on average. The impor- dinates x[di ] = 1 − ((ln n)/n)1/k , for each i ∈ 1, . . . , k. 7 As in the discussion about FLET, we are assuming unifor- The probability that no point from the data set dom-mity (Def. 1. inates x then is (1 − (ln n)/n)n , which is at most 8 Keeping entropy instead of volume avoids register overflow. e−ln n = 1/n. 233
  6. 6. The expected number of points dominated by x (and available for scanning in k sorted ways, sorted alonghence, dominated by any point that dominates x) is each dimension. If a tree index were available for each(1 − ((ln n)/n)1/k )k . dimension, this approach could be applied. Index-based algorithms for computing the skyline lim (1 − ((ln n)/n)1/k )k = 1 (the maximal vectors) additionally have serious limi- n→∞ tations. The performance of indexes—such as R-treesThus any maximal with a volume greater than x’s as used in [13, 15]—does not scale well with the num-(which would include any points that dominate x) will ber of dimensions. Although the dimensionality of adominate all points in the limit of n. The probabil- given skyline query will be typically small, the rangeity there is such a maximal is greater than (n − 1)/n, of the dimensions over which queries can be composedwhich converges to one in the limit of n. 2 can be quite large, often exceeding the performanceTheorem 8 Under UI (Def. 1), SFS has an average- limit of the indexes. For an index to be of practicalcase runtime of O(kn + nlg n). use, it would need to cover most of the dimensionsProof 8 The sort phase for SFS is O(nlg n). On used in queries.the initial pass, the volume of each point can be com- Note also that building several indexes on small sub-puted at O(kn) expense. During the filter phase of SFS, sets of dimensions (so that the union covers all them2 /2 maximal-to-maximal comparisons will be made. dimensions) does not suffice, as the skyline of a setExpected m is Θ((ln k−1 n)/(k − 1)!), so this is o(n). of dimensions cannot be computed from the skylinesNumber of comparisons of non-maximal to maximal is of the subsets of its dimensions. It is possible, andO(n). Thus the comparison cost is O(kn). 2 probable, thatTheorem 9 Under CI (Def. 1), SFS has a worst-case maxes{d1 ,...,di } (T) ∪ maxes{di+1 ,...,dk } (T)runtime of O(kn2 ). maxes{d1 ,...,dk } (T)Proof 9 In the worst-case, all records are maximal. Furthermore, if the distinct-values assumption fromEach record will be placed in the skyline window af- §2.1 is lifted, the union is no longer even guaranteedter being compared against the records currently there. to be a subset. (This is due to the possibility of tiesThis results in n(n − 1)/2 comparisons, each taking k over, say, d1 , . . . , di .)steps. The sorting phase is O(nlg n) again. 2 Another difficulty with the use of indexes for com- On the one hand, it is observed experimentally that puting skyline queries is the fact that the skyline oper-SFS makes fewer comparisons than BNL. SFS com- ator is holistic, in the sense of holistic aggregation op-pares only against maximals, whereas BNL will often erators. The skyline operator is not, in general, com-have non-maximal points in its window. On the other mutative with selections.9 For any skyline query thathand, SFS does require sorting. For much larger n, the involves a select condition then, an index that wouldsorting cost will begin to dominate SFS’s performance. have applied to the query without the select will not be applicable.2.3 Index-based Algorithms and Others Finally, in a relational setting, it is quite possible that the input set—for which the maximals are to beIn this paper, as stated already, we focus on generic al- found—is itself computed via a sub-query. In such agorithms to find the maximal vectors, algorithms that case, there are no available indexes on the input not require any pre-processing or pre-existing data-structures. Any query facility that is to offer maximal 2.4 The Case against Divide and Conquervector, or skyline, computation would require a genericalgorithm for those queries for which the pre-existing Divide-and-conquer algorithms for maximal vectorsindexes are not adequate. face two problems: There has been a good deal of interest though in 1. it is not evident how to make an efficient externalindex-based algorithms for skyline. The goals are to version; and,be able to evaluate the skyline without needing to 2. although the asymptotic complexity with respectscan the entire dataset—so for sub-linear performance, to n is good, the multiplicative “constant”—ando(n)—and to produce skyline points progressively, to the effect of the dimensionality k—may be bad.return initial answers as quickly as possible. Since there are algorithms with better average-case The shooting-stars algorithm [13] exploits R-trees run-times, we would not consider DD&C. Furthermore,and modifies nearest-neighbors approaches for finding devising an effective external version for it seems im-skyline points progressively. This work is extended possible. In DD&C, the data set is sorted first inupon in [15] in which they apply branch-and-bound k ways, once for each dimension. The sorted orderstechniques to reduce significantly the I/O overhead. could be implemented in main memory with one nodeIn [10, 12], bitmaps are explored for skyline evalua- per point and a linked list through the nodes for eachtion, appropriate when the number of values possible dimension. During the merge phase, DD&C does notalong a dimension is small. In [1], an algorithm is 9 In [7], cases of commutativity of skyline with other relationalpresented as instance-optimal when the input data is operators are shown. 234
  7. 7. re-sort the data points; rather, the sorted orders aremaintained. In a linked-list implementation, it is easy ratioto see how this could be done. It does not look possible 1e+30 1e+25to do this efficiently as an external algorithm. 1e+20 1e+15 LD&C calls DD&C repeatedly. Thus, for the same 1e+10 90 100reasons, it does not seem possible to make an effec- 100000 70 80 10 60tive external version of LD&C. FLET calls DD&C just 2 4 6 40 50 8 10once. Still, since the number of points that remain 12 14 20 30 lg2(#vectors) #dimensions 16 18 10after FLET’s initial scan and elimination could be sig- 0nificant, FLET would also be hard to externalize. SD&C was introduced in [5] as a viable external Figure 2: Behavior of LD&C.divide-and-conquer for computing maximal vectors.As we argued above, and as is argued in [15], SD&C is come from A and more from B sometimes. So thestill far from ideal as an external algorithm. Further- square of an even split is an over-estimate, given vari-more, its runtime performance is far from what one ance of resulting set sizes. In [2], it is established thatmight expect. the variance of the number of maximals under CI con- Each merge that SD&C performs of sets, say, A verges on Hk−1,n . Thus in the limit of n, runtime ofand B, every maximal with respect to A ∪ B that sur- SD&C will converge up to an asymptotic above the re-vives from A must have been compared against every currence. 2maximal that survives from B, and vice-versa. This This is bad news. SFS requires n comparisons in theis a floor on the number of comparisons done by the limit of n, for any fixed dimensionality k. SD&C, how- √merge. We know the number of maximals in aver- ever, requires on the order of (22k / k)×n comparisonsage case under CI. Thus we can model SD&C’s cost in n’s limit!via a recurrence. The expected number of maximals In [5], it was advocated that SD&C is more appro-out of n points of k dimensions under CI is Hk−1,n ; priate for larger k (say, for k > 7) than BNL, and is(ln k−1 n)/(k − 1)! converges on this from below, so we the preferred solution for data sets with large k. Ourcan use this in a floor analysis. analysis conclusively shows the opposite: SD&C willTheorem 10 Under CI (Def. 1), SD&C has average- perform increasingly worse for larger k and with larger √ n. We believe their observation was an artifact of thecase runtime of Ω( k 22k n).Proof 10 Let n = 2q for some positive integer q, with- experiments in [5]. The data sets they used were onlyout loss of generality. Consider the function T as fol- 100,000 points, and up to 10 dimensions. Even if thelows. data sets used were a million points instead (and 10 dimensions), SD&C would have performed proportion-T (1) = 1 ally significantly much worse.T (n) = 2T (n/2) + ( 1 (ln k−1 n)/(k − 1)!)2 2 We can model LD&C’s behavior similarly. For a c1 = 1/(4(k − 1)!2 ) D = 2k − 2 merge (of A and B) in LD&C, it calls DD&C. Since = 2T (n/2) + c1 ln D n A and B are maximal sets, most point will survive q the merge. The cost of the call to DD&C is bounded = c1 2i (ln n − ln 2i )D below by its worst-case runtime over the number of i=1 c2 = c1 /(lg 2 e)D points that survive. The double recursion must run q to complete depth for these. So if m points sur- k−2 = c2 2i (lg 2 n − lg 2 2i )D vive the merge, the cost is mlg 2 m steps. As in i=1 the proof of Thm. 10 for SD&C, we can approximate q q−1 the expected number of maximals from below. Let i D = c2 2 (q − i) = c2 2q−i iD mk,n = (ln k−1 (n + γ))/(k − 1)!. The recurrence is i=1 i=0 T (1) = 1 j k−2 T (n) = 2T (n/2) + max(mk,n lg 2 mk,n , 1) 2j−i iD ≈ (lg 2 e)D−1 D! 2j+1 10 We plot this in Figure 2. This shows the ratio i=0 of the number of comparisons over n. The recurrence ≈ c2 (lg 2 e)D−1 D! 2q = 4 (ln 2) 2k−2 n 1 k−1 2j 2j √ asymptotically converges to a constant value for any j ≈ 2 / πj (by Stirling’s approximation) given k. It is startling to observe that the k-overhead of ≈ √ ln 2 22k−4 n LD&C appears to be worse than that of SD&C! The ex- π(k−1) k−2This counts the number of comparisons. Each compar- planation is that mk,i lg 2 mk,i is larger initially thanison costs k steps. 10 The behavior near the dimensions axis is an artifact of our For each merge step, we assume that the expected log approximation of Hk−1,i , the expected number of maximals.value of maximals survive, and that exactly half came In computing the graph, mk,i lg k−2 mk,i is rounded up to one 2from each of the two input sets. In truth, fewer might whenever it evaluates to less than one. 235
  8. 8. is m2 , for the small i sizes of data sets encountered k,i average-case runtime of Ω(knlg n + (k − 1)k−3 n). 2near the bottom of the divide-and-conquer. (Of course Should one even attempt to adapt a divide-and- k−2m2 ≫ mk,i lg 2 mk,i in i’s limit; or, in other words, conquer approach to a high-performance, external al- k,ias i approaches n each subsequent merge level, for very gorithm for maximal vectors? Divide-and-conquer islarge n.) However, it is those initial merges near the quite elegant and efficient in other contexts. We havebottom of the divide-and-conquer that contribute most already noted, however, that it is quite unclear howto the cost overall, since there are many more pairs of one could externalize a divide-and-conquer approachsets to merge at those levels. Next, we prove a lower for maximal vectors effectively. Furthermore, we be-bound on LD&C’s average case. lieve their average-case run-times are so bad, in light ofTheorem 11 Under CI (Def. 1), LD&C has average- the dimensionality k, that it would not be runtime of Ω((k − 1)k−2 n). Divide-and-conquer has high overhead with respectProof 11 Let n = 2q for some positive integer q, with- to k because the ratio of the number of maximals toout loss of generality. the size of the set for a small set is much greater than k−2mk,n lg 2 mk,n for a large set. Vectors are not eliminated quickly as they are compared against local maximals. The scan- ≈ ((ln k−1 n)/(k − 1)!)lg 2 (((ln k−1 n)/(k − 1)!))k−2 based approaches such as BNL and SFS find global c1 = 1/(k − 1)! maximals—maximals with respect to the entire set— = c1 (ln 2)k−1 (lg 2 n)k−1 early, and so eliminate non-maximals more quickly. · (lg 2 ((ln 2)k−1 (lg 2 n)k−1 ) − lg 2 (k − 1)!)k−2 c2 = ln k−1 2 2.5 The Case against the Skyline Algorithms k−1 > c1 c2 (lg 2 n) SFS needs to sort initially, which gives it too high an · ((k − 1)((lg 2 q) + (ln 2) − (lg 2 (k − 1)))k−2 average-case runtime. However, it was shown in [8] to when lg 2 q > lg 2 (k − 1) perform better than BNL. Furthermore, it was shown k−1 > c1 c2 (lg 2 n)(k − 1)k−2 in [8] that BNL is ill-behaved relationally. If the data when lg 2 q − lg 2 (k − 1) ≥ 1 set is already ordered in some way (but not for the ben- thus q ≥ 2(k − 1) efit of finding the maximals), BNL can perform veryLet l = 2(k − 1). Consider the function T as follows. badly. Of course, SFS is immune to input order since k−2T (n) = 2T (n/2) + max(mk,n lg 2 mk,n , 1) it must sort. When BNL is given more main-memory q allocation—and thus, its window is larger—its per- > c1 c2 (k − 1)k−2 2q−i ik−1 formance deteriorates. This is because any maximal i=l point is necessarily compared against all the points for n ≥ 2l in the window. There no doubt exists an optimal (q−l) window-size for BNL for a data set of any given n and = c1 c2 (k − 1)k−2 2l 2(q−l)−i ik−1 k. However, not being able to adjust the window size i=0 freely means one cannot reduce the number of passes ≈ c1 c2 (k − 1)k−2 2l (lg 2 e)(k − 1)!2(q−l) k−1 BNL takes. k−2 q = (k − 1) 2 All algorithms we have observed are CPU-bound, as = (k − 1)k−2 n the number of comparisons to be performed often dom-T (n) is a strict lower bound on the number of com- inates the cost. SFS has been observed to have a lowerparisons that LD&C makes, in average case. We only CPU-overhead than BNL, and when the window-sizesum T (n) for n ≥ 2l and show T (n) > (k − 1)k−2 n. 2 is increased for SFS, its performance improves. This is We can use the same reasoning to obtain an asymp- because all points must be compared against all pointstotic lower bound on DD&C’s average-case runtime. in the window (since they are maximals) eventuallyTheorem 12 Under CI (Def. 1), DD&C has an anyway. The number of maximal-to-maximal compar-average-case runtime of Ω(knlg n + (k − 1)k−3 n). isons for SFS is m2 /2 because each point only needsProof 11 DD&C first does a divide and conquer over to be compared against those before it in the stream.the data on the first dimension. During a merge step BNL of course has this cost too, and also comparesof this divide-and-conquer, it recursively calls DD&C eventual maximals against additional non-maximalsto do the merge, but considering one dimension fewer. along the way. Often, m is reasonably large, and soThe following recurrence provides a lower bound. this cost is substantial.T (n) = 1 k−3T (n) = 2T (n/2) + max(mk,n lg 2 mk,n , 1) 3 The LESS Algorithm By the same proof steps as in the proof for Thm. 11, 3.1 Descriptionwe can show T (n) > (k − 1)k−3 n. Of course,DD&C sorts the data along each dimension before We devise an external, maximal-vector algorithm thatit commences divide-and-conquer. The sorting costs we call LESS (linear elimination sort for skyline) thatΘ(knlg n). Thus, DD&C considered under CI has combines aspects of SFS, BNL, and FLET, but that 236
  9. 9. but as records are replaced in the EF window, the col- Buffer Pool Buffer Pool lection has records with increasingly higher entropy Block for quicksort Inputs Output scores. Thus the collection performs well to eliminate EF Window SF Window ... 1 other records. ... 2 LESS’s merge passes of its external-sort phase are ... k the same as for standard external sort, except for the last merge pass. Let pass f be the last merge pass. The (a) in pass zero (b) in pass f final merge pass is combined with the initial skyline- filter pass. Thus LESS creates a skyline-filter window Figure 3: Buffer pool for LESS. (like SFS’s window) for this pass. Of course, there must be room in the buffer pool to perform a multi-does not contain any aspects of divide-and-conquer. way merge over all the runs from pass f − 1 and forLESS filters the records via a skyline-filter (SF) win- a SF window (Fig. 3(b)). As long as there are fewerdow, as does SFS. The record stream must be in sorted than B − 2 runs, this can be done: one frame per runorder by this point. Thus LESS must sort the records for input, one frame for accumulating maximal recordsinitially too, as does SFS. LESS makes two major as found, and the rest for the SF window. (If not,changes: another merge pass has to be done before commencing 1. it uses an elimination-filter (EF) window in pass the SF passes.) This is the same optimization done in zero of the external sort routine to eliminate the standard two-pass sort-merge join, implemented records quickly; and by many database systems. This saves a pass over the 2. it combines the final pass of the external sort with data by combining the last merge pass of external sort the first skyline-filter (SF) pass. with join-merge pass. For LESS, this typically saves a The external sort routine used to sort the records is pass by combining the last merge pass of the externalintegrated into LESS. Let b be the number of buffer- sort with the first SF pass.pool frames allocated to LESS. Pass zero of the stan- As with SFS, multiple SF passes may be needed. Ifdard external sort routine reads in b pages of the data, the SF window becomes full, then an overflow file willsorts the records across those b pages (say, using quick- be created. Another pass then is needed to process thesort), and writes the b sorted pages out as a b-length overflow file. After pass f —if there is an overflow filesorted run. All subsequent passes of external sort are and thus more passes are required—LESS can allocatemerge passes. During a merge pass, external sort does b − 2 frames of its buffer-pool allocation to the SFa number of (b − 1)-way merges, consuming all the window for the subsequent passes.runs created by the previous pass. For each merge,(up to) b − 1 of the runs created by the previous pass In effect, LESS has all of SFS’s benefits with no ad-are read in one page at a time, and written out as a ditional disadvantages. LESS should consistently per-single sorted run. form better than SFS. Some buffer-pool space is allo- LESS sorts the records by their entropy scores, cated to the EF window in pass zero for LESS whichas discussed in §2.2 with regards to SFS. LESS ad- is not for SFS. Consequently, the initial runs producedditionally eliminates records during pass zero of its by LESS’s pass zero are smaller than SFS’s; this mayexternal-sort phase. It does this by maintaining a occasionally force that LESS will require an additionalsmall elimination-filter window. Copies of the records pass to complete the sort. Of course LESS saves awith the best entropy scores seen so far are kept in the pass since it combines the last sort pass with the firstEF window (Fig. 3(a)). When a block of records is read skyline, the records are compared against those in the EF LESS also has BNL’s advantages, but effectivelywindow. Any input record that is dominated by any none of its disadvantages. BNL has the overhead ofEF record is dropped. Of the surviving input records, tracking when window records can be promoted asthe one with the highest entropy is found. Any records known maximals. LESS does not need this. Maxi-in the EF window that are dominated by this highest mals are identified more efficiently once the input isentropy record are dropped. If the EF window has effectively sorted. Thus LESS has the same advan-room, (a copy of) the input record is added. Else, if tages as does SFS in comparison to BNL. LESS willthe EF window is full but there is a record in it with a drop many records in pass zero via use of the EF win-lower entropy than this input record, the input record dow. The EF window works to the same advantagereplaces it in the window. Otherwise, the window is as BNL’s window. All subsequent passes of LESS thennot modified. are over much smaller runs. Indeed, LESS’s efficiency The EF window acts then similarly to the elimina- rests on how effective the EF window is at eliminat-tion window used by BNL. The records in the EF win- ing records early. In §3.3, we show this elimination isdow are accumulated from the entire input stream. very effective—as it is for FLET and much for the sameThey are not guaranteed to be maximals, of course, reason—enough to reduce the sort time to O(n). 237
  10. 10. #I/O’s (thousands) 100 25 80 20 SFS 1 time (secs) LESS 60 SFS LESS 15 x A 40 10 20 5 0 5 6 7 0 5 6 7 B #dimensions #dimensions x 1 (a) I/O’s (b) time Figure 5: Choosing point v. Figure 4: Performance of LESS versus SFS. the external sorting is the same in each case, and the3.2 Experimental Evaluation number of SF pass I/O’s needed were the same. LESS shows a remarkable improvement in I/O usage, as weThe LESS prototype is in C and uses Pthreads to expect. Many records are eliminated by the EF win-implement non-blocking reads and writes. It imple- dow during the sorting phase. The I/O usage climbsments external sort with double buffering. It was im- slightly as the dimensionality increases. This is due toplemented and run on RedHat Linux 7.3 on an Intel that the elimination during sorting becomes less effec-Pentium III 733 MHz machine. tive (for a fixed n) as the dimensionality k increases. All tests calculate the skyline of a 500,000 record set The time performance for SFS and LESS are shownwith respect to 5, 6, and 7 dimensions. Each record in Fig. 4(b). For five dimensions, LESS clocks in atis 100 bytes. A disk-page size of 4,096 bytes is used. a third of SFS’s time. The difference between LESSEach column used as a skyline criterion has a value and SFS closes as the dimensionality increases. This is1 . . . 10, 000. The values were chosen randomly, and because, for higher dimensionality, more time is spentthe record sets obey the UI criteria from Def. 1. by LESS and SFS in the skyline-filtering phase. This is Input and output is double-buffered to increase per- simply due to the fact that more records are maximal.formance and to simulate a commercial-caliber rela- The algorithms become CPU-bound as most of thetional algorithm. Each input / output block has a time is spent comparing skyline records against onethread watchdog that handles the reading / writing of another to verify that each is, in fact, maximal.the block. The thread blocks on write but the mainprocess is free to continue processing. 3.3 Analysis If the EF window is too large, LESS will take moretime simply as management of the EF window starts LESS also incorporates implicitly aspects of FLET. Un-to have an impact. If the EF window is too small like FLET, we do not want to guess a virtual point(say a single page), the algorithm may become less to use for elimination. In the rare occasion that theeffective at eliminating records early. As more records virtual point was not found to be dominated, FLETsurvive the sort to the SF-phase, LESS’s performance must process the entire data set by calling DD&C.degrades. We experimented with varying the size of Such hit-or-miss algorithms are not amenable to re-the EF window from one to thirty pages. Its size makes lational systems. Instead, LESS uses real points accu-virtually no difference to LESS’s performance in time mulated in the EF window for eliminating. We shallor I/O usage. (We make clear why this should be the show that these collected points ultimately do as goodcase in §3.3.) Below five pages, there was some modest a job of elimination as does FLET’s virtual point. Fur-increase in LESS’s I/O usage. We set the EF window thermore, the EF points are points from the data set,at five pages for what we report here. so there is no danger of failing in the first pass, as there We experimented with various buffer pool allot- is with FLET.ments from 10 to 500 pages. The size affects primarily To prove that the EF points are effective at elimi-the efficiency of the sorting phase, as expected. We set nating most points, we can construct an argument sim-the allocation at 76 pages for what we report here. ilar to that used in [3] to prove FLET’s O(n) average- SFS and BNL are bench marked in [8] where SFS is case runtime performance and in Lemma 7.shown to provide a distinct improvement. Hence we Theorem 13 Under UI (Def. 1), LESS has anbench marked LESS against SFS. We implemented SFS average-case runtime of O(kn).within our LESS prototype. In essence, when in SFS- Proof 13 Let the data set be distributed on (0, 1)kmode, the external sort is done to completion without under UI.using an EF window for elimination, and then the SF Consider a virtual point v with coordinate x ∈ (0, 1)passes are commenced. on each dimension. Call the “box” of space that dom- I/O performance for SFS and LESS are shown in inates v A, and the “box” of space dominated by v B.Fig. 4(a). It is the same for SFS in this case for the (This is shown in Fig. 5 for k = 2.) The size of Bfive, six, and seven dimensional runs. This is because is then xk , and the size of A is (1 − x)k . Let x = 238
  11. 11. (1 − n−1/2k ). Thus the size of B, xk , is (1 − n−1/2k )k . 3.4 Issues and ImprovementsIn the limit of n, the size of B is 1. Since our experiments in §3.2, we have been focus- lim (1 − n−1/2k )k = 1 ing on how to decrease the CPU load of LESS, and of n→∞ If a point exists in A, it dominates all points in B. maximal-vector algorithms generally. LESS and SFSThe expected number of points that occupy A is pro- must make m2 /2 comparisons just to verify that the √portional to A’s volume, which is 1/ n by our con- maximals are, indeed, maximals. The m2 is cut in half √struction. There are n points, thus n is the expected since the input stream is in sorted order; we know nonumber of points occupying A. record can be dominated by any after it. BNL faces If points are drawn at random with replacement this same computational load, and does cumulativelyfrom the data set, how many must be explored, on av- more comparisons as records are compared againsterage, before √ finding one belonging to A? 11 If there non-maximal records in its window.were exactly n points in A, the expected number of There are two ways to address the comparison load: √ √ reduce further somehow the number of comparisonsdraws would be n/ n = n. √ that must be made; and improve the efficiency of the Of course, n is only the expected √ number of pointsoccupying A. Sometimes fewer than n points fall in comparison operation itself. The divide-and-conquerA; sometimes, more. The actual number is distributed algorithms have a seeming advantage here. DD&C, √ LD&C, and FLET have a o(n2 ) worst-case performance.around n via a binomial distribution. Taking the re-ciprocal of this distribution, the number of draws, on They need not compare every maximal against ev-average, to finding a point in A (or to find no point is ery maximal. Of course, §2.4 demonstrates that the √ divide-and-conquer algorithms have their own limita-in A) is bound above by (ln n) n. tions. So during LESS’s pass zero, in average case, the The sorted order of the input stream need not be thenumber of points that will be processed before finding √ same as that in which the records are kept in the EFan A point is bounded above by (ln n) n. Once found, and the SF windows. Indeed, we have learned that us-that A point will be added to the EF window; else, there ing two different orderings can be advantageous. (Like-is a point in the EF window already that has a better wise, this is true for SFS also.) Say that we sort thevolume score than this A point. After this happens, data in a nested sort with respect to skyline columns,every subsequent B point will be eliminated. and keep the EF and SF windows sorted by entropy as The number of points that remain, on average, after√ before. This has the additional benefit that the datapass zero then is at most 1 − (1 − n−1/2k )k + (ln n) n. can be sorted in a natural way, perhaps useful to otherThis is o(n). Thus, the surviving set is bound above by parts of a query plan. Now when a stream record isnf , for some f < 1. Effectively, LESS only spends ef- compared against the SF records, the comparison canfort to sort these surviving points, and nf lg nf is O(n). be stopped early, as soon as the stream record’s en- Thus the sort phase of LESS is O(kn). The skyline tropy is greater than the next SF record’s. At thisphase of LESS is clearly bound above by SFS’s average- point, we know the stream record is maximal. Wecase, minus the sorting cost. SFS average-case cost have observed this change to reduce the maximal-to-after sorting is O(kn) (Thm. 8). In this case, only nf maximal comparisons needed to 26% of that requiredpoints survived the sorting phase, so LESS’s SF phase before for 500,000 points and seven dimensions. Thisis bounded above by O(kn). 2 would reduce LESS’s performance on the seven dimen- Proving LESS’s best-case performance directly is sional data-set in Fig. 4 from 15 to, say, 7 seconds.not as straightforward. Of course, it follows directly LESS can be used with any tree index that providesfrom the average-case analysis. a sorted order which is also a topological order withTheorem 14 Under CI (Def. 1) and the model in respect to the maximality conditions. In this case, the§2.1, LESS has a best-case runtime of O(kn). index replaces the need to sort the data initially. TheProof 14 The records have a linear ordering. Thus, existence of an applicable index for a given skylinethis is like considering the average-case runtime for query is much more likely than for the types of indexskyline problem with dimensionality one. 2 structures the index-based skyline algorithms employ Worst-case analysis is straightforward. (as discussed in §2.3).12 The index also can be a stan- dard type commonly employed in databases such as aTheorem 15 Under CI (Def. 1), LESS has a worst- B+ tree index. It only needs to provide a way to readcase runtime of O(kn2 ). the data in sorted order.Proof 15 Nothing is eliminated in the sort phase, The dual-order versions of LESS—one order for thewhich costs O(nlg n). The SF phase costs the same input stream and one for the skyline window—thatas the worst-case of SFS, O(kn2 ) (Thm. 9). 2 we are investigating have given us insight into how we can handle better sets with anti-correlation. This rep- 11 This is simpler to consider than without replacement, andis an upper bound with respect to the number of draws needed 12 To be fair, researchers investigating index-based skyline al-without replacement. gorithms are seeking o(n) performance. 239
  12. 12. resents the worst-case scenario for maximal-vector al- rithms for finding the maximals, there has not beengorithms (§2.1). We are able to handle well some cases a clear understanding of the performance issues in-of anti-correlation, for instance, when one dimension volved. This work should help to clarify these issues,is highly anti-correlated with another one. A modified and lead to better understanding of maximal-vectorversion of LESS will run still in O(kn) average-case computation and related problems. Lastly, LESS rep-time for this. We believe we shall be able to extend resents a high-performance algorithm for maximal-this to handle most anti-correlation effects in the input vector still achieve good running time. Furthermore, thismay lead to ways to improve on LESS-like algorithms Referencesworst-case running time to better than O(kn2 ). There may be other ways to reduce the computa- [1] W.-T. Balke and U. G¨ ntzer. Multi-objective utional load of the comparisons themselves. Clearly, query processing for database systems. In VLDB,there is much to gain by making the comparison op- pp. 936–947, Aug. 2004.eration that the maximal-vector algorithm must do so [2] O. Barndorff-Nielsen and M. Sobel. On the dis-often more efficient. We are exploring these techniques tribution of the number of admissible points in afurther, both experimentally and analytically, to see vector random sample. Theory of Prob. and itshow much improvement we can accomplish. We antic- Appl., 11(2):249–269, 1966.ipate improving upon the algorithm significantly more. [3] J.L. Bentley, K.L. Clarkson, and D.B. Levine. Fast linear expected-time algorithms for comput- ing maxima and convex hulls. In SODA, pp. 179–4 Conclusions 187, Jan. 1990.We have reviewed extensively the existing field of algo- [4] J.L. Bentley, H.T. Kung, M. Schkolnick, andrithms for maximal vector computation and analyzed C.D. Thompson. On the average number of max-their runtime performances. We show that the divide- ima in a set of vectors and applications. JACM,and-conquer based algorithms are flawed in that the 25(4):536–543, 1978.dimensionality k results in very large “multiplicative- [5] S. B¨rzs¨nyi, D. Kossmann, and K. Stocker. The o oconstants” over their O(n) average-case performance. skyline operator. In ICDE, pp. 421–430, 2001.The scan-based skyline algorithms, while seemingly [6] C. Buchta. On the average number of maxima inmore na¨ are much better behaved in practice. We ıve, a set of vectors. Information Processing Letters,introduced a new algorithm, LESS, which improves sig- 33:63–65, 1989.nificantly over the existing skyline algorithms, and we [7] J. Chomicki. Querying with intrinsic preferences.prove that its average-case performance is O(kn). This In EDBT, pp. 34–51, Mar. linear in the number of data points for fixed dimen- [8] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang.sionality k, and scales linearly as k is increased. Skyline with presorting. In ICDE, pp. 717–719, There remains room for improvement, and there Mar. 2003.are clear directions for future work. Our proofs for [9] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang.average-case performance rely on a uniformity as- Skyline with presorting: Theory and optimiza-sumption (Def. 1). In truth, LESS and related algo- tion. In Int. Inf. Sys. Conference (IIS) pp. 593–rithms are quite robust in the absence of uniformity. 602, Jun. 2005.We may be able to reduce the number of assump- [10] P.-K. Eng, B.C. Ooi, and K.-L. Tan. Indexingtions that need to be made for the proofs of asymp- for progressive skyline computation. Data andtotic complexity. We want to reduce the comparison Knowledge Engineering, 46(2):169–201, 2003.load of maximal-to-maximal comparisons necessary in [11] P. Godfrey. Skyline cardinality for relational pro-LESS-like algorithms. While the divide-and-conquer cessing. In FoIKS, pp. 78–97, Feb. 2004.algorithms do not work well, their worst-case running [12] K.-L. Tan, P.-K. Eng, and B.C. Ooi. Efficient pro-times are o(n2 ), while LESS’s is O(n2 ). It is a ques- gressive skyline computation. In VLDB, pp. 301–tion whether the O(n2 ) worst-case of scan-based al- 310, Sept. 2001.gorithms can be improved. Even if not, we want an [13] D. Kossmann, F. Ramask, and S. Rost. Shootingalgorithm to avoid worst-case scenarios as much as stars in the sky: An online algorithm for skylinepossible. For maximal vectors, anti-correlation in the queries. In VLDB, pp. 275–286, Aug. causes m to approach n. We want to be able [14] H.T. Kung, F. Luccio, and F.P. Preparata. Onto handle sets with anti-correlation much better. We finding the maxima of a set of vectors. JACM,are presently working on promising ideas for this, as 22(4):469–476, 1975.discussed in §3.4. [15] D. Papadias, Y. Tao, G. Fu, and B. Seeger. We have found it fascinating that a problem as An optimal and progressive algorithm for skylineseemingly simple as maximal vector computation is, queries. In SIGMOD, pp. 467–478, Jun. fact, fairly complex to accomplish well. While therehave been a number of efforts to develop good algo- 240