Upcoming SlideShare
×

# P229 godfrey

219 views
148 views

Published on

Published in: Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
219
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
3
0
Likes
0
Embeds 0
No embeds

No notes for slide

### P229 godfrey

1. 1. Maximal Vector Computation in Large Data Sets Parke Godfrey1 Ryan Shipley2 Jarek Gryz1 1 2 York University The College of William and Mary Toronto, ON M3J 1P3, Canada Williamsburg, VA 23187-8795, USA {godfrey, jarek}@cs.yorku.ca Abstract vectors from the set. One vector dominates another if each of its components has an equal or higher value Finding the maximals in a collection of vec- than the other vector’s corresponding component, and tors is relevant to many applications. The it has a higher value on at least one of the correspond- maximal set is related to the convex hull— ing components. One may equivalently consider points and hence, linear optimization—and near- in a k-dimensional space instead of vectors. In this est neighbors. The maximal vector prob- context, the maximals have also been called the ad- lem has resurfaced with the advent of skyline missible points, and the set of maximals called the queries for relational databases and skyline Pareto set. This problem has been considered for many algorithms that are external and relationally years, as identifying the maximal vectors—or admissi- well behaved. ble points—is useful in many applications. A number The initial algorithms proposed for maximals of algorithms have been proposed for eﬃciently ﬁnding are based on divide-and-conquer. These es- the maximals. tablished good average and worst case asymp- The maximal vector problem has been rediscovered totic running times, showing it to be O(n) recently in the database context with the introduction average-case, where n is the number of vec- of skyline queries. Instead of vectors or points, this tors. However, they are not amenable to ex- is to ﬁnd the maximals over tuples. Certain columns ternalizing. We prove, furthermore, that their (with numeric domains) of the input relation are des- performance is quite bad with respect to the ignated as the skyline criteria, and dominance is then dimensionality, k, of the problem. We demon- deﬁned with respect to these. The non-dominated tu- strate that the more recent external skyline ples then constitute the skyline set. algorithms are actually better behaved, al- Skyline queries have attracted a fair amount of at- though they do not have as good an apparent tention since their introduction in [5]. It is thought asymptotic complexity. We introduce a new that skyline oﬀers a good mechanism for incorporat- external algorithm, LESS, that combines the ing preferences into relational queries, and, of course, best features of these, experimentally evalu- its implementation could enable maximal vector ap- ate its eﬀectiveness and improvement over the plications to be built on relational database systems ﬁeld, and prove its average-case running time eﬃciently. While the idea itself is older, much of the is O(kn). recent skyline work has focused on designing good al- gorithms that are well-behaved in the context of a re-1 Introduction lational query engine and are external (that is, thatThe maximal vector problem is to ﬁnd the subset of the work over data sets too large to ﬁt in main-memory).vectors such that each is not dominated by any of the On the one hand, intuition says it should be fairly trivial to design a reasonably good algorithm for ﬁnd- Part of this work was conducted at William & Mary whereRyan Shipley was a student and Parke Godfrey was on faculty ing the maximal vectors. We shall see that many ap-while on leave of absence from York. proaches are O(n) average-case running time. On thePermission to copy without fee all or part of this material is other hand, performance varies widely for the algo-granted provided that the copies are not made or distributed for rithms when applied to input sets of large, but real-direct commercial advantage, the VLDB copyright notice and istic, sizes (n) and reasonable dimensionality (k). Inthe title of the publication and its date appear, and notice is truth, designing a good algorithm for the maximal vec-given that copying is by permission of the Very Large Data BaseEndowment. To copy otherwise, or to republish, requires a fee tor problem is far from simple, and there are manyand/or special permission from the Endowment. subtle, but important, issues to attend.Proceedings of the 31st VLDB Conference, In this paper, we focus on generic maximal-vectorTrondheim, Norway, 2005 algorithms; that is, on algorithms for which prepro- 229
2. 2. cessing steps or data-structures such as indexes are show it is not dominated. If there are not too manynot required. The contributions are two-fold: ﬁrst, we maximals, this will not be too expensive. Given m, the √thoroughly analyze the existing ﬁeld of generic max- expected number of maximals, if m < n, the numberimal vector algorithms, especially with consideration of maximal-to-maximal comparisons, O(m2 ), is O(n).of the dimensionality k’s impact; second, we present Thus, assuming m is suﬃciently small, in best-case,a new algorithm that essentially combines aspects of this approach is O(n). A goal then is for algorithmsa number of the established algorithms, and oﬀers a with average-case running time of O(n).substantial improvement over the ﬁeld. Performance then is also dependent on the number In §2, we simultaneously review the work in the area of maximals (m). In worst-case, all points are maximaland analyze the proposed algorithms’ runtime perfor- (m = n); that is, no point dominates any other. Wemances. We ﬁrst introduce a cost model on which we shall consider average-case performance based on thecan base average-case analyses (§2.1). This assumes expected value of m. To do this, we shall need certainan estimate of the number of maximals expected, on assumptions about the input set.average, assuming statistical independence of the di- Deﬁnition 1 Consider the following properties of amensions and distinct values of the vectors along each set of points:dimension. We summarize the generic algorithmic a. (independence) the values of the points over a sin-approaches—both older algorithms and newer, exter- gle dimension are statistically independent of thenal skyline algorithms—for computing maximal vector values of the points along any other dimension;sets (§2.2). We brieﬂy discuss some of the index-based b. (distinct values) points (mostly) have distinct val-approaches to maximal vector computation, and why ues along any dimension (that is, there are notindex-based approaches are necessarily of limited util- many repeated values); andity (§2.3). We then formally analyze the run-time per- c. (uniformity) the values of the points along anyformances of the generic algorithms to identify the bot- one dimension are uniformly distributed.tlenecks and compare advantages and disadvantages Collectively, the properties of independence and dis-between the approaches, ﬁrst the divide-and-conquer tinct values are called component independence (CI)approaches (§2.4), and then the external, “skyline” ap- [4]. (In some cases, we shall just assume CI as theproaches (§2.5). property of uniformity is not necessary for the result In §3, we present a new algorithm for maximal vec- at hand.) Let us call collectively the three propertiestor computation, LESS (linear elimination sort for sky- uniform independence (UI).1line), the design of which is motivated by our analyses Additionally, assume under uniformity, without lossand observations (§3.1). We present brieﬂy some ex- of generality, that any value is on the interval (0, 1).2perimental evaluation of LESS that demonstrates its Under CI, the expected value of the number of max-improvement over the existing ﬁeld (§3.2). We for- imals is known [6, 11]: m = Hk−1,n , where Hk,n is themally analyze its runtime characteristics, prove it has k-th order harmonic of n. Let H0,n = 1, for n > 0, andO(kn) average runtime performance, and demonstrate n Hk−1,iits advantages with respect to the other algorithms Hk,n be inductively deﬁned as Hk,n = , for(§3.3). Finally, we identify the key bottlenecks for any i=1 imaximal-vector algorithm, and discuss further ways k > 1. Hk,n ≈ Hk /k ! ≈ ((ln n) + γ)k /k !. 1,nthat LESS could be improved (§3.4). When the distinct-value assumption is violated and We discuss future issues and conclude in §4. there are many repeated values along a dimension, the expected value of m goes down, up to the point at which the set is dominated by duplicate points (that2 Algorithms and Analyses is, equivalent on all the dimensions) [11].2.1 Cost Model For best-case, assume that there is a total ordering of the points, p1 , . . . , pn , such that any pi dominatesA simple approach would be to compare each point all pj , for i < j. Thus, in best-case, m = 1 (the oneagainst every other point to determine whether it is point being p1 ).3dominated. This would be O(n2 ), for any ﬁxed dimen-sionality k. Of course, once a point is found that dom- 1 Note that, given a set of points that is CI, we could replaceinates the point in question, processing for that point the points’ values with their ordinal ranks over the data set withcan be curtailed. So average-case running time should respect to each dimension. Then the set of points would be UI. However, this transformation would not be computationallybe signiﬁcantly better, even for this simple approach. insigniﬁcant.In the best-case scenario, for each non-maximal point, 2 This is without loss of generality since it makes no furtherwe would ﬁnd a dominating point for it immediately. assumptions about the data distributions. The data values canSo each non-maximal point would be eliminated in always be normalized onto (0, 1). Furthermore, this adds no sig- niﬁcant computational load. Knowing the maximum and mini-O(1) steps. Each maximal point would still be ex- mum values of the points for each dimension is suﬃcient to makepensive to verify; in the least, it would need to be the mapping in a single pass.compared against each of the other maximal points to 3 We consider a total order so that, for any subset of the 230
3. 3. algorithm ext. best-case average-case worst-case k−3 DD&C [14] no O(knlg n) §2.2 Ω(knlg n + (k − 1) n) Thm.12 O(nlg k−2 n) [14] k−2 LD&C [4] no O(kn) §2.2 O(n), Ω((k − 1) n) [4], Thm. 11 O(nlg k−1 n) [4] FLET [3] no O(kn) §2.2 O(kn) [3] O(nlg k−2 n) [3] √ SD&C [5] – O(kn) Thm. 2 Ω( k 22k n) Thm. 10 O(kn2 ) Thm. 3 BNL [5] yes O(kn) Thm. 4 – O(kn2 ) Thm. 5 SFS [8] yes O(nlg n + kn) Thm. 6 O(nlg n + kn) Thm. 8 O(kn2 ) Thm. 9 LESS – yes O(kn) Thm. 14 O(kn) Thm. 13 O(kn2 ) Thm. 15 Figure 1: The generic maximal vector algorithms. We shall assume that k ≪ n. Furthermore, we as- sorted set is then split in half along one of the di-sume that, generally, k < lg n. We include the dimen- mensions, say dk , with respect to the the sorted ordersionality k in our O-analyses. over dk . This is recursively repeated until the result- We are now equipped to review the proposed algo- ing set is below threshold in size (say, a single point).rithms for ﬁnding the maximal vectors, and to analyze At the bottom of this recursive divide, each set (onetheir asymptotic runtime complexities (O’s). Not all point) consists of just maximals with respect to thatof the O(n) average cases can be considered equivalent set. Next, these (locally) maximal sets are merged.without factoring in the impact of the dimensionality On each merge, we need to eliminate any point thatk. Furthermore, we are interested in external algo- is not maximal with respect to the unioned set. Con-rithms, so I/O cost is pertinent. After our initial anal- sider merging sets A and B. Let all the maximals in Ayses, we shall look into these details. have a higher value on dimension dk than those in B (given the original set was divided over the sorted list2.2 The Algorithms of points with respect to dimension dk ). The maximals of A ∪ B are determined by applying DD&C, but nowThe main (generic) algorithms that have been pro- over dimensions d1 , . . . , dk−1 , so with reduced dimen-posed for maximal vectors are listed in Figure 1. We sionality.5have given our own names to the algorithms (not nec- Once the dimensionality is three, an eﬃcientessarily the same names as used in the original papers) special-case algorithm can be applied. Thus, in worst-for the sake of discussion. For each, whether the al- case, O(nlg k−2 n) steps are taken. In the best-case, thegorithm was designed to be external is indicated, and double-divide-and-conquer is inexpensive since eachthe known best, average, and worst case running time maximal set only has a single point. (It resolves toanalyses—with respect to CI or UI and our model for O(n).) However, DD&C needs to sort the data by eachaverage case in §2.1—are shown. For each runtime dimension initially, and this costs O(knlg n). We estab-analysis, it is indicated where the analysis appears. lish DD&C’s average-case performance in Section 2.4.For each marked with ‘§’, it follows readily from the Note that DD&C establishes that the maximal vectordiscussion of the algorithm in that Section. BNL’s problem is, in fact, o(n2 ).average-case, marked with ‘–’, is not directly amica- LD&C [4] improves on the average-case over DD&C.ble to analysis (as discussed in §2.5). The rest are Their analysis exploits the fact that they showed m toproven in the indicated theorems.4 be O(ln k−1 n) average-case. LD&C does a basic divide- The ﬁrst group consists of divide-and-conquer- and-conquer recursion ﬁrst, randomly splitting the setbased algorithms. DD&C (double divide and conquer) into two equal sets each time. (The points have not[14], LD&C (linear divide and conquer) [4], and FLET been sorted.) Once a set is below threshold size, the(fast linear expected time) [3] are “theoretical” algo- maximals are found. To merge sets, the DD&C algo-rithms that were proposed to establish the best bounds rithm is applied. This can be modeled by the recur-possible on the maximal vector problem. No attention rencewas paid to making the algorithms external. Their T (1) = 1initial asymptotic analyses make them look attractive, T (n) = 2T (n/2) + (ln k−1 n)lg k−2 (ln k−1 n)however. Note that (ln k−1 n)lg k−2 (ln k−1 n) is o(n). Therefore, DD&C does divide-and-conquer over both the data LD&C is average-case linear, O(n) [4].(n) and the dimensions (k). First, the input set issorted in k ways, once for each dimension. Then, the In best case, each time LD&C calls DD&C to merge to maximal sets, each maximal set contains a singlepoints, there is just one maximal with respect to that subset. point. Only one of the two points survives in the re-This is necessary for discussing the divide-and-conquer-basedalgorithms. 5 All points in A are marked so none will be thrown away. 4 Some of the theorems are relatively straightforward, but we Only points in B can be dominated by points in A, since thoseput them in for consistency. in A are better along dimension dk . 231
4. 4. sulting maximal set. This requires that DD&C recurse by a point in B, and vice versa, to result in just theto the bottom of its dimensional divide, which is k maximals with respect to A ∪ B.deep, to determine the winning point. O(n) merges are Theorem 2 Under CI (Def. 1) and the model in §2.1,then done at a cost of O(k) steps each. Thus, LD&C’s SD&C has a best-case runtime of O(kn).average-case running time is, at least, Ω(kn). (In Sec- Proof 2 Let mA denote the number of points in Ation 2.4, we establish that, in fact, it is far worse.) (which are maximal with respect to A). Let mAB de-In worst case, the set has been recursively divided an note the number of points in A that are maximal withextra time, so LD&C is lg n times worse than DD&C. respect to A ∪ B. Likewise, deﬁne mB and mBA in the FLET [3] takes a rather diﬀerent approach to im- same way with respect to B. An upper bound on theproving on DD&C’s average-case. Under UI,6 a virtual cost of merging A and B is kmA mB and a lower boundpoint x—not necessarily an actual point in the set—is is kmAB mBA . In best case, SD&C is O(kn). 2determined so that the probability that no point from For a ﬁxed k, average case is O(n). (We shall con-the set dominates it is less than 1/n. The set of points sider more closely the impact of k on the average caseis then scanned, and any point that is dominated by x in §2.4.)is eliminated. It is shown that the number of points x Theorem 3 Under CI (Def. 1), SD&C has a worst-will dominate, on average, converges on n in the limit, case runtime of O(kn2 ).and the number it does not is o(n). It is also tracked Proof 3 The recurrence for SD&C under worst casewhile scanning the set whether any point is found that isdominates x. If some point did dominate x, it does not T (1) = 1matter that the points that x dominates were thrown T (n) = 2T (n/2) + (n/2)2away. Those eliminated points are dominated by a real 2point from the set anyway. DD&C is then applied to This is O(n ) number of comparisons. Each com-the o(n) remaining points, for a O(kn) average-case parison under SD&C costs k steps, so the runtime isrunning time. This happens at least (n − 1)/n frac- O(kn2 ). 2tion of trials. In the case no point was seen to dom- No provisions were made to make SD&C particu-inate x, which should occur less than 1/n fraction of larly well behaved relationally, although it is clearlytrials, DD&C is applied to the whole set. However, more amenable to use as an external algorithm thanDD&C’s O(nlg k−2 n) running time in this case is amor- DD&C (and hence, LD&C and, to an extent, FLET too,tized by 1/n, and so contributes O(lg k−2 n), which is as they rely on DD&C). The divide stage of SD&C iso(n). Thus, the amortized, average-case running time accomplished trivially by bookkeeping. In the mergeof FLET is O(kn). FLET is no worse asymptotically stage, two ﬁles, say A and B, are read into main mem-than DD&C in worst case. ory, and their points pairwise compared. The result is written out. As long as the two input ﬁles ﬁt in FLET’s average-case runtime is O(kn) because FLET main memory, this works well. At the point at whichcompares O(n) number of points against point x. Each the two ﬁles are too large, it is much less eﬃcient. Acomparison involves comparing all k components of the block-nested loops strategy is employed to compare alltwo points, and so is k steps. DD&C and LD&C never A’s points against all of B’s, and vice versa.compare two points directly on all k dimensions sincethey do divide-and-conquer also over the dimensions. The second algorithm proposed in [5] is BNL, blockIn [4] and [14], DD&C and LD&C were analyzed with nested loops. This is basically an implementation ofrespect to a ﬁxed k. We are interested in how k aﬀects the simple approach discussed in §2.1, and works re-their performance, though. markably well. A window is allocated in main memory The second group—the skyline group—consists of for collecting points (tuples). The input ﬁle is scanned.external algorithms designed for skyline queries. Sky- Each point from the input stream is compared againstline queries were introduced in [5] along with two gen- the window’s points. If it is dominated by any of them,eral algorithms proposed for computing the skyline in it is eliminated. Otherwise, any window points dom-the context of a relational query engine. inated by the new point are removed, and the new point itself is added to the window. The ﬁrst general algorithm in [5] is SD&C, singledivide-and-conquer. It is a divide-and-conquer algo- At some stage, the window may become full. Oncerithm similar to DD&C and LD&C. It recursively di- this happens, the rest of the input ﬁle is processed dif-vides the data set. Unlike LD&C, DD&C is not called ferently. As before, if a new point is dominated by ato merge the resulting maximal sets. A divide-and- window point, it is eliminated. However, if the newconquer is not performed over the dimensions. Con- point is not dominated, it is written to an overﬂowsider two maximal sets A and B. SD&C merges them ﬁle. (Dominated window points are still eliminatedby comparing each point in A against each point in B, as before.) The creation of an overﬂow ﬁle meansand vice versa, to eliminate any point in A dominated another pass will be needed to process the overﬂow points. Thus, BNL is a multi-pass algorithm. On a 6 For the analysis of FLET, we need to make the additional subsequent pass, the previous overﬂow ﬁle is read asassumption of uniformity from Def. 1. the input. Appropriate bookkeeping tracks when a 232