Measuring the sky on computing data cubes via skylining the measures.bak


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Measuring the sky on computing data cubes via skylining the measures.bak

  1. 1. 492 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 Measuring the Sky: On Computing Data Cubes via Skylining the Measures Man Lung Yiu, Eric Lo, and Duncan Yung Abstract—Data cube is a key element in supporting fast OLAP. Traditionally, an aggregate function is used to compute the values in data cubes. In this paper, we extend the notion of data cubes with a new perspective. Instead of using an aggregate function, we propose to build data cubes using the skyline operation as the “aggregate function.” Data cubes built in this way are called “group-by skyline cubes” and can support a variety of analytical tasks. Nevertheless, there are several challenges in implementing group-by skyline cubes in data warehouses: 1) the skyline operation is computational intensive, 2) the skyline operation is holistic, and 3) a group-by skyline cube contains both grouping and skyline dimensions, rendering it infeasible to precompute all cuboids in advance. This paper gives details on how to store, materialize, and query such cubes. Index Terms—Query processing; data warehouse and repository. Ç1 INTRODUCTIONI N the data-warehousing environment, OLAP tools have attribute Ai of the tuple t. Given a set D of tuples, the been extensively used for a wide range of decision- skyline operation É on D is defined assupport applications such as sales analysis, customeranalysis, marketing, and services planning. These OLAP ÉðD; SÞ ¼ ft 2 D j6 9t0 2 D; t0 1S tg: ð2Þtools are built upon a multidimensional data model, in which In other words, a tuple t belongs to the skyline result set ifdata tuples are partitioned into different cells based on the no other tuple dominates it.values of their dimension attributes. For each cell of tuples, We observe that a group-by skyline cube has twoan aggregate function (e.g., SUM()) is applied to their measure distinguished features when compared to a traditional dataattributes (e.g., sales) and the resulting aggregated value http://ieeexploreprojects.blogspot.comcontributes to the value of that cell. A data cuboid is then cube. First, a cell of a group-by skyline cube stores a set offormed by constituting cells that are created from the same skyline tuples rather than a single numeric aggregate value.set of dimension attributes. Each data cuboid represents a Second, since the skyline operation can be applied tounique view of the underlying data. The collection of data multiple measure attributes, a group-by skyline cubecuboids, which are based on different combinations of includes not only a set of usual grouping dimensions, butdimension attributes, forms a data cube. also a set of skyline dimensions, in which the latter is defined In traditional data cubes, an aggregate function takes as on the measure attributes.input a set of measure values and returns a single numeric We illustrate the concept of group-by skyline cube by thevalue. In this paper, we extend the notion of data cube with example in Fig. 1, which is a table D of employee records. Eacha new perspective. Specifically, we study the issues of tuple represents an employee, with the employee’s “Region”building data cubes that exploit the skyline operator [4] as and “Position,” as well as his/her evaluated scores “Mental-the postoperation instead of the traditional aggregate ity,” “Attitude,” and “Flexibility.” Suppose that lower scoresfunctions. We name this type of data cubes as group-by are preferable over higher ones. If we want to find the outstanding employees of each position, the table in Fig. 2askyline cubes. The skyline operator has been well recognized can be used to answer the query directly. That table is called aas a very important decision-support operator in recent group-by skyline cuboid with grouping dimension set G ¼literature and has started to be implemented in commercial f‘‘Position’’g and skyline dimension set S ¼ f‘‘Mentality; ’’query engines [7]. Given a set S of skyline attributes, a tuple ‘‘Attitude; ’’ ‘‘Flexibility’’g. That essentially means that thet is said to dominate another tuple t0 , denoted by t 1S t0 , if employee in D is first grouped based on their positions and ð9 Ai 2 S; t½Ai Š < t0 ½Ai ŠÞ ^ ð8 Ai 2 S; t½Ai Š t0 ½Ai ŠÞ ð1Þ then the skyline operation is applied to each group to find those who are not dominated by the others on all three scores.assuming that smaller values are preferable over larger Figs. 2b and 2c show two more examples of group-by skylineones. Here, we use t½Ai Š to represent the value of the cuboids for the employee data set. Cuboid C2 groups the employees based on their regions and positions and the. The authors are with the Department of Computing, Hong Kong skyline operation is applied to their “Mentality” and Polytechnic University, Hung Hom, Kowloon, Hong Kong. “Attitude” attributes. Cuboid C3 essentially partitions the E-mail: {csmlyiu, ericlo, cskwyung} employees in the same way as C2 but the skyline operation isManuscript received 4 Dec. 2009; revised 5 May 2010; accepted 17 Aug. 2010; applied to only the attribute “Attitude.” Hence, for the datapublished online 22 Dec. 2011. set D in Fig. 1, its corresponding group-by skyline cube is aFor information on obtaining reprints of this article, please send e-mail, and reference IEEECS Log Number TKDE-2009-12-0819. collection of all cuboids, whereas each such cuboid Ci ðGi ; Si ÞDigital Object Identifier no. 10.1109/TKDE.2010.253. has its grouping dimension set Gi as a nonempty subset of 1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
  2. 2. YIU ET AL.: MEASURING THE SKY: ON COMPUTING DATA CUBES VIA SKYLINING THE MEASURES 493 NBA in different periods and played different positions. In contrast, we can analyze the NBA data using a group-by skyline cube, treating all dimensions such as “Year,” “Team,” and “Position” as the set of all possible grouping dimensions and treating all measure attributes such as “Points,” “Assists,” and “Rebounds” as the set of all possible skyline dimensions.Fig. 1. Employee database D. Implementing the concept of group-by skyline cube in today’s data warehouses poses several technical challenges.{“Region,” “Position”} and its skyline dimension set Si as a First, the skyline operation is much more computationalnonempty subset of {“Mentality,” “Attitude,” “Flexibility”}. intensive [4], [7] when compared to simple aggregations such We find group-by skyline cube useful for other data as COUNT(), SUM(), and AVG(). Consequently, building oranalysis applications as well. One can extend a super- querying group-by skyline cubes needs to consider not onlymarket’s transactional data warehouse with group-by sky- the I/O cost, but also the computation (CPU) cost. Second,line cube so that attributes like “Location,” “Time,” and the skyline operation is holistic in nature as the skyline of a“Product” still serve as the grouping dimensions and the set cuboid is not necessarily derivable from another cuboid. Forof measure attributes like “Sales” and “Number of visitors” example, consider cuboids C2 and C3 in Fig. 2, although theyserve as the skyline dimensions. For hotel analysis, we can share the same set of grouping dimensions and the skylinebuild a group-by skyline cube with respect to all dimensions dimension set of C3 is a subset of that of C2 , their skylinesuch as “Star” (e.g., 5-star, 4-star) and “City” and all measure results are not derivable from each other. Specifically, we canattributes such as “Room price,” “Room quality,” “Trans- see that C2 is not derivable from C3 because the skylineportation convenience,” and “Entertainment quality.” employee b in C2 does not exist in C3 . Moreover, C3 is not Using the skyline operation as an “aggregate function” in derivable from C2 because the skyline employee f in C3 doesdata cubes and the notion of group-by skyline cubes is a not exist in C2 as well. Third, the dimensions that define acompletely new concept. First, while the skyline operator group-by skyline cube include not only the usual groupingand its variants have been extensively studied (e.g., [23], dimensions but also the measure attributes. For a group-by[10]), none of them has ever discussed the incorporation of skyline cube that is built on a data set with jGj groupingthe skyline operation into traditional data warehouses as a dimensions and jSj measure attributes, there would be jGj jSjpostoperation. Second, although the concept of skycubes [36], ð2 À 1Þð2 À 1Þ possible cuboids. That explosive number http://ieeexploreprojects.blogspot.commakes materialization of group-by of possible cuboids[26] exists, that is fundamentally different from our conceptof group-by skyline cube. Skycubes are designed for the skyline cubes especially challenging.efficient evaluation of skyline queries in any nonempty To the best of our knowledge, the building of group-by skyline cube, or the implementation of the skyline operationsubspaces. In other words, that does not consider any as a postoperation in data warehouses, has not beengrouping at all. In practice, however, skyline analysis is often addressed previously in the research literature or into be more meaningful if the data tuples are first grouped commercial products. This paper studies this issue in detail.based on their dimensions before finding their skylines. Let’s Our contributions can be summarized as follows:take the employee data set in Fig. 1 as an example. Withoutgrouping, finding the skyline of all employee, no matter 1. The concept of group-by skyline cube is presented.whether it is in the full space (i.e., all three measure That includes the discussion of what a “group-byattributes) or in any subspace (e.g., “Mentality” and skyline cuboid” is, the relationships between differ-“Attitude”), may not be that meaningful because it is unfair ent group-by skyline cuboids, and how theseto compare an Admin staff with a Technical staff. In contrast, cuboids constitute a group-by skyline cube.building a group-by skyline cube for that data set allows the 2. The technical details of supporting group-by skylinemanagement to find out the outstanding employee of each cube are presented. Specifically, we propose toregion and/or position. Consider the NBA player statistics materialize a group-by skyline cube as an extendedas another example. Obviously, finding the skyline, despite group-by skyline cube (ES-cube). In an ES-cube, sky-that is full-space skyline or subspace skyline, of all NBA line results across cuboids are derivable from eachplayers is not that meaningful because it is unfair to compare other. We further develop construction and queryMichael Jordon with Wilt Chamberlain, which were active in processing algorithms for ES-cube. We also developFig. 2. Group-by skyline cuboids. (a) C1 . (b) C2 . (c) C3 .
  3. 3. 494 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 TABLE 1 Summary of Symbols Fig. 3. A lattice of cuboids with attributes A1 , A2 , and A3 . The number of tuples in each cuboid is in the bracket. refers to the precomputation of a subset of cuboids (i.e., the subcube) and is generally more frequently used. The cost of processing a query using a cuboid Ci is described by a linear cost model [15], i.e., the query time is assumed to solely depend on the I/O time of scanning Ci , which is linear to the size of Ci . During query evaluation, if a cuboid Ci requested by an OLAP query has already been materialized, it can serve as the query result right away. Even if Ci has not been materialized, it can still be efficiently derived from the smallest materialized cuboid Cj in which the dimensions of Ci are a subset of those of Cj . Given a space budget, the materialization algorithm in [15] selects cuboids to be materialized based on the greedy heuristics, aiming at improving the overall query cost. Initially, it materializes the top cuboid, i.e., the cuboid with a budget-based partial materialization algorithm all dimensions, and puts it into a materialized set MT that is able to select and materialize a good subset because other cuboids can always be derived from the top of cuboids in ES-cube that yields the highest query cuboid directly. Afterward, it iteratively materializes a improvement in terms of both CPU and I/O costs. cuboid Ci 62 MT that is expected to bring the maximum 3. An extensive set of experiments has been carried out benefit to the overall query cost and adds it to MT . Based on on both real and synthetic data. Experimental results the linear cost model, the benefit of materializing a cuboid C i show that queries can be answered efficiently is defined as the improvement in I/O, with respect to the set The rest of the paper is organized as follows: Section 2 of already materialized cuboids. For example, Fig. 3 showsgives the background and the related work of this paper. the number of tuples of each cuboid in a bracket. The topSection 3 elaborates the concept of group-by skyline cube in cuboid C1 , with 90 tuples, is the first to be materialized. Next,more detail. Section 4 presents the issues related to the benefit of materializing cuboid C4 is calculated as ð90 Àimplementation of group-by skyline cube. Section 5 reports 50Þ Â 4 because answering queries related to C4 , C6 , C7 , or C8experimental results. Section 6 concludes the paper with can now exploit C4 (which requires scanning only 50 tuples)future research directions. The symbols to be used in instead of the top cuboid C1 (which requires scanningsubsequent discussion are summarized in Table 1. 90 tuples). Since the benefit of C4 is the highest among all the cuboids that have not been materialized, it is next selected to2 BACKGROUND AND RELATED WORK be materialized and added to MT . The iteration continues until the selected cuboids exceed the space budget.2.1 Data Cubes Owning to its importance to data-warehousing technol-A data cube [13] can be viewed as a collection of cuboids, in ogy, research related to data cube has continuouslywhich each cuboid stores the group-by aggregate result attracted the attentions of many researchers. Some focuswith respect to a particular set of attributes called on the efficient computation of cubes (e.g., [3], [27]), somedimensions. For a data cube with nonholistic aggregate focus on the compression and summarization of cubes (e.g.,functions, cuboids can be organized as a lattice. Fig. 3 shows [29]), and some focus on the variations of cubes (e.g., [35]).a lattice of 23 cuboids of a data set with three dimensionattributes A1 , A2 , and A3 . There exists a path from a cuboid 2.2 Skyline ProblemsCi to a cuboid Cj if the attributes of Ci contain those of Cj , 2.2.1 Skyline Computation Algorithmsmeaning that Ci can be used to derive Cj . For example, anycuboid in Fig. 3 can be derived from the top cuboid C1 . The skyline operator, introduced in [4], retrieves each tuple t To meet the performance demands imposed by OLAP from the data set D such that it is not dominated by any other 0operations, the cube materialization approach [14] which tuple t of D. Existing skyline computation methods can beprecomputes some cuboids in advance, is extensively divided into two categories: 1) index-based solutions [30],applied to speed up the evaluation of various OLAP [17], [23], and 2) nonindexed solutions [4], [8], [12], [2], [37].queries. Full materialization refers to the precomputation Early index-based solutions [30] operate on bitmap andof all cuboids (i.e., the full cube) and is impractical because Bþ -tree indices on the skyline attributes, whereas the othersthe number of cuboids is exponential to the number of [17], [23] operate on an R-tree that indexes the data set bydimensions. On the other hand, partial materialization skyline attributes. The Branch and Bound Skyline (BBS)
  4. 4. YIU ET AL.: MEASURING THE SKY: ON COMPUTING DATA CUBES VIA SKYLINING THE MEASURES 495 processing using parallel execution [9] and skylining from distributed data sources [1]. In the OLTP environment, Papadias et al. [23] and Luk et al. [22] have studied the evaluation of skyline queries with group-by attributes. Papadias et al. [23] proposed an efficient R-tree-based algorithm for processing skyline queries with group-by attributes. Unfortunately, their solution relies on the availability of an R-tree (with both group-by attributes and skyline attributes) so it is not suitable for handling ad-hocFig. 4. Example of an OSP tree. queries. Luk et al. [22] focused on the relational engine and investigated how to find an efficient query plan for skylinealgorithm [23] is the state-of-the-art indexed approach for queries with group-by attributes. Their solution, however,skyline computation, and its I/O cost is proven to be does not study the materialization of skyline cuboids in theoptimal with respect to the R-tree instance. OLAP environment. The state-of-the-art skyline computation algorithm fornonindexed data is OSP [37]. It will be adopted as the 2.2.3 Subspace Skyline Analysisskyline computation method in our proposed solution (see Building data cubes using the skyline operation as theSection 4.3). Thus, we describe this method in detail “aggregate function,” to the best of our knowledge, havebelow. Like the SFS algorithm [8], the OSP algorithm first not been addressed before. Work in [24], [36], [25], [26],sorts the data set by a monotone score function F (on [31], [34] is developed for subspace skyline analysis. In theskyline attributes). F is simply the SUM function in [37]. example of Fig. 2, the cuboids in our group-by skylineThe main difference is that OSP employs a (main- cube contain both group-by attributes and skyline attri-memory) OSP tree for indexing the skyline points found butes. In contrast, the skycube [24], [36], [25] of a data set Dso far, rendering it efficient to process the remaining data is defined as the collection of skyline result set ÉðD; SÞ forpoints. Fig. 4 illustrates how OSP works on a set of points each nonempty subset S of S Ã but without any grouping(in the space defined by two skyline attributes S1 and S2 ). attributes. A compressed skycube (CSC) [34], [26] is aEach contour (in dotted line) depicts all possible locations compressed structure for storing a skycube; it still doeswith the same SUM score. The sorted points follow the not involve any grouping attributes. Those works do notorder: t2 ; t4 ; t7 ; t5 ; t1 ; t6 ; t3 . consider any grouping operation and partial materializa- When a point t is retrieved (by the sorted order), we tion, and thus they are different from our work here.check whether the OSP tree contains any point t0 that Another related work is P-cube [35]. P-cube is designed todominates t. This is implemented by a depth-first traversal facilitate the processing of selection skyline queries, i.e.,of the OSP tree branches that potentially contain some point finding the skyline from tuples that satisfy the WHEREdominating t. If there is no such point t0 , then we report t as clause in SQL. P-cube is orthogonal to us because it doesa skyline point and insert it into the OSP tree. Each node of not consider partial materialization and it is designed forthe OSP tree stores exactly one skyline point, and its selection queries that access a small portion of data. Inmaximum number of children is 2jSj À 2, where S is the set contrast, group-by skyline cube is designed for OLAP-of skyline attributes. In the example, when the first point t2 style queries that access a significant portion of data foris retrieved, it is inserted as the tree root, which decomposes report generation.the domain space into 22 ¼ 4 regions. Since t2 is a skylinepoint, its bottom-left region is guaranteed to contain no 2.2.4 Skyline Cardinality and Cost Estimationpoints. Also, its top-right region is dominated by t2 so the In this paper, we adopt a budget-based partial materializa-region cannot contain any skyline points. Those two regions tion approach for selecting group-by skyline cuboids to bedo not need to be maintained by the OSP tree. The next materialized such that they yield the highest improvementretrieved point t4 is a skyline as it is not dominated by any in query cost. Recall that each cell of a cuboid correspondspoint in the OSP tree. Since t4 falls into the top-left region of to a set of skyline tuples (with respect to particular group-t2 , it is inserted into the left subtree of t2 . Then, the next by attributes). Thus, it is important to study existing workpoint t7 is again a skyline point; it falls into the bottom-right on estimating skyline cardinality and its computation costregion of t2 so it becomes the right child of t2 . The next point [11], [7], [12], [38].t5 is also a skyline point, so it is inserted into the left subtree Godfrey et al. [11], [12] propose a mathematical model ofof t7 . The algorithm proceeds in this fashion until all the the skyline cardinality for a uniformly distributed data set,remaining points have been examined. and the asymptotic cost of several skyline algorithms in the average case and the worst case. Chaudhuri et al. [7] apply2.2.2 Variants of Skyline Queries the log sampling technique for estimating the skylineSince the introduction of the skyline query in [4], numerous cardinality and the skyline computation cost of the BNLskyline variants have been proposed, for example, skyband and SFS algorithms, with respect to arbitrary data distribu-[23], spatial skyline [28], reverse skyline [10], subspace tion. Zhang et al. [38] propose a kernel-based model for askyline [32], [31], [16], dominance-based analysis [20], more accurate estimation of the skyline cardinality ofrelaxation of the skyline conditions [6], [5], k most arbitrary data sets; however, it does not study therepresentative skyline points [21], as well as skyline computation cost of any skyline algorithm.
  5. 5. 496 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012Fig. 5. An example data set (on the left) and some of its group-by skyline cuboids (on the right).3 CONCEPT OF GROUP-BY SKYLINE CUBE i.e., 7  3 ¼ 21 cuboids, in total. The right part of Fig. 5 depicts the contents of six cuboids. For instance, cuboid AIn this section, we first introduce the concept of group-by contains three cells, indicating the distinct combinationsskyline cube whose measure is defined by a skylineoperation, and then show that some nice properties of ðc1 ; d1 Þ, ðc1 ; d2 Þ, ðc2 ; d2 Þ on the grouping attributes ðG2 ; G3 Þ,traditional data cubes are no longer valid in this context. We for tuples in the table D. For the cell ðc1 ; d1 Þ of cuboid A, thewill discuss how those invalidated properties can be re- skyline of this group are tuples t1 and t2 .enabled again by using a skyline variant definition in 3.2 Conditions for Sharing Computations of CubesSection 4. Most data cube algorithms, regardless of whether they are3.1 Definition of Group-By Skyline Cube used to compute data cubes or to answer queries, also relyGroup-by skyline cube is intended for the efficient evalua- heavily on the sharing of computations to achieve hightion of various group-by skyline queries. Let D be a relational efficiency. For example, in a sales application, if a cuboidtable instance, with the schema A ¼ ðA1 ; A2 ; . . . ; Ak Þ. For (year, department: COUNT(sales)) that stores the sales of eachsimplicity, suppose that every attribute Ai is numeric. department in every year has been computed, it can be usedGiven a set G & A of grouping attributes and a group to compute the cuboid (year : COUNT(sales)) that stores theinstance g of G, we define the set DðgÞ as the set of tuples of sales of all departments in every year, without accessing theD belonging to the group instance g. Accordingly, a group- base data. However, it is unclear whether such sharing isby skyline query is defined as follows: of group-by skyline cube. still valid in the contextDefinition 1 (Group-by skyline query). Given the sets G and In the following, we show that group-by skyline cubes can S of attributes, a group-by skyline query Q ¼ ðG; SÞ share some but not all computational effort. Specifically, for computes a skyline result set ÉðDðgÞ; SÞ for each group group-by skyline cuboids that share the same set of skyline instance g defined on G.1 dimensions S, the computational effort can be shared: Lemma 1 (Union of skyline). Given any subset D0 D, Considering the data set D in Fig. 1, the result of a group- ÉðD; SÞ ÉðD0 ; SÞ [ ÉðD À D0 ; SÞ.by skyline query Q with its grouping dimension setG ¼ {“Region,” “Position”} and its skyline dimension set Through Lemma 1 [19], we arrive at the followingS ¼ {“Mentality,” “Attitude”} is shown in Fig. 2b. An lemma that shows that one group-by skyline cuboid can beexample of a group instance g is (“Asia,” “Admin”). derived from another if they share the same set of skyline In order to evaluate various group-by skyline queries dimensions:efficiently, we introduce the concepts of group-by skyline Lemma 2 (Group-by skyline cuboid hierarchy). Given twocuboid and group-by skyline cube. group-by skyline cuboids CðG; SÞ and CðG0 ; SÞ such thatDefinition 2 (Group-by skyline cube). Let Gà be the set of all G0 G, the group-by skyline cuboid CðG0 ; SÞ can be derived possible grouping dimensions and S à be the set of all possible from CðG; SÞ. skyline dimensions. Given a subset G Gà and a subset S S à , a group-by skyline cuboid CðG; SÞ is defined as a Proof. By the definition of group-by skyline cuboid, each collection of cells, each cell g (corresponding to a group of G) cell g0 in CðG0 ; SÞ corresponds to a subset V of cells in S stores the set of skyline result set ÉðDðgÞ; SÞ. The group-by CðGÞ, i.e., Dðg0 Þ ¼ g2V DðgÞ and different DðgÞ are skyline cube is defined as the collection of CðG; SÞ, for any disjoint. By applying Lemma 1, the result of g0 can be à subset G G and nonempty S S . à derived from the results of the set V of cells in CðG; SÞ. Consider the relational table D in the left part of Fig. 5. In Thus, this lemma is proved. u tthis example, Gà ¼ fG1 ; G2 ; G3 g and S à ¼ fS1 ; S2 g. Thegroup-by skyline cube for D thus contains Fig. 6 shows the lattice of group-by skyline cubiods (i.e., group-by skyline cube) for the data set in Fig. 5. We à à ð2jG j À 1Þð2jS j À 1Þ; organize the group-by skyline cuboids in a way such that each edge represents a parent-child relationship between two group-by skyline cuboids, meaning that the result of a 1. Note that when S contains only one attribute, the skyline operationbecomes a simple aggregate function MAX or MIN. Nonetheless, our child cuboid can be derived from its parent cuboid throughdefinition includes such case for generality. Lemma 2. For instance, there is an edge from cuboid A to
  6. 6. YIU ET AL.: MEASURING THE SKY: ON COMPUTING DATA CUBES VIA SKYLINING THE MEASURES 497Fig. 6. Group-by skyline cube.cuboid D, showing that cuboid D can be derived fromcuboid A. To illustrate, cell c1 of cuboid D in Fig. 5 can beobtained by finding the skylines among the cells ðc1 ; d1 Þ andðc1 ; d2 Þ of cuboid A. Unfortunately, we remark that computational efforts Fig. 7. Extended group-by skyline cube.between cuboids that share different sets of skyline dimen-sions cannot be shared unless the distinct value condition (no Definition 3 (Extended group-by skyline). The extendedtwo tuples share the same value on any subset of skyline group-by skyline of the tuple set DðgÞ with respect to theattributes) [36] holds. For instance, in Fig. 6, although cuboid skyline attribute set S is defined asC shares the same set of grouping dimensions as cuboid A Éþ ðDðgÞ; SÞ ¼ ft 2 DðgÞ j6 9t0 2 DðgÞ; t0 1þ tg; Sand the skyline dimension of cuboid C is a subset of cuboid A,C cannot be derived from A, unless the tuples have distinct where a tuple t is said to strictly dominate another tuple t0values on the skyline attributes. To illustrate, in Fig. 5, the cell with respect to the attribute set S, denoted by t 1þ t0 , if Sðc2 ; d2 Þ of cuboid C includes a tuple t6 that does not exist in 0 ð8 Ai 2 S; t½Ai Š t ½Ai ŠÞ:cuboid A. As we have mentioned, the efficiency of the state-of-the-art data cube construction/query algorithms relies The notion of extended skyline [33] was originallyheavily on the fact that many computational efforts can be proposed for skyline evaluation in P2P network. We borrowshared. However, in a group-by skyline cube, the sharing of that concept and propose to materialize a group-by skylinecomputation is limited by the distinct value condition, which cube as an extended group-by skyline cube.almost never holds in real data sets. Therefore, the group-by group-by skyline cube, ES-cube). Definition 4 (Extendedskyline cuboids in Fig. 6 are separated into three disjoint Given a set G of grouping attributes and a set S of skylineclusters in which computational efforts are not sharable attributes, we define an extended group-by skyline cuboidacross the clusters. Fortunately, there are ways to remove Cþ ðG; SÞ as a collection of cells, where each cell gsuch barriers and we will discuss them in the next section. (corresponding to a group of G) stores the set of extended skyline Éþ ðDðgÞ; SÞ. The extended group-by skyline cube4 IMPLEMENTATION is defined as the collection of Cþ ðG; SÞ, for any subset G GÃIn this section, we address the issues related to the and S S à .implementation of group-by skyline cube. Specifically, we Lemma 3 (Extended group-by skyline cuboid hierarchy).attempt to answer the following research questions: Given two extended group-by skyline cuboids Cþ ðG; SÞ and Cþ ðG0 ; S 0 Þ such that G0 G and S 0 S, the cuboid . ðIRQQ1Þ Given the poor connections between group-by Cþ ðG0 ; S 0 Þ can be derived from Cþ ðG; SÞ. skyline cuboids, how can we enable the sharing of computational efforts between cuboids? (Section 4.1) Proof. Each cell g0 in the cuboid Cþ ðG0 ; S 0 Þ corresponds to a . ðIRQ Given a materialized cuboid, how can a query Q Q2Þ subset V of cells in the cuboid Cþ ðG; SÞ, i.e., Dðg0 Þ ¼ S be answered efficiently? (Section 4.2) g2V DðgÞ and different DðgÞ are disjoint. Thus, the result . ðIRQ Given a limited storage space budget, how shall Q3Þ of g0 can be derived from the results of the set V of cells we select a subset of cuboids that offers the best in Cþ ðG; SÞ. u t improvement in query performance? (Section 4.3) . ðIRQ After selecting the cuboids, how can we construct Q4Þ Lemma 3 is powerful. It allows us to maximize the them in an efficient manner? (Section 4.4) possibility of sharing computations among the ES-cuboids. Fig. 7 shows the lattice of ES-cuboids (i.e., ES-cube) for the4.1 Materialization: Extended Group-By Skyline data set in Fig. 5. Note that, through Lemma 3, the three Cube disjoint clusters of cuboids in Fig. 6 are now connected. ForIn order to maximize the sharing of computation of group- instance, cuboids B and C cannot be derived from anyby skyline cuboids, especially the sharing of computation cuboid in the cluster of cuboids S1 S2 (the left cluster in Fig. 6)across the set of cuboids separated by the distinct value but now they can be derived from extended cuboid Aþ . Tocondition, we materialize a group-by skyline cuboid using illustrate, Fig. 8 depicts the extended group-by skylinean extended definition of skyline. We show that group-by version of the six cuboids we illustrated in Fig. 5. Note that,skyline cubes materialized in this way can enable sharing now cuboid Aþ contains tuple t6 because t5 cannot strictlyacross various cuboids. dominate t6 on every skyline attribute. Consequently, cell
  7. 7. 498 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 Fig. 9. Extended skyline of a group. (a) A set of extended skyline tuples. (b) Storing extended skyline. Fig. 9b illustrates the decomposition of Éþ in terms of É b and É, with respect to the example in Fig. 9a. The parent b skyline set of each tuple in É is also shown in the figure. ForFig. 8. Example of extended group-by skyline cuboids. instance, the parent set of t3 contains t1 because t1 dominatesðc2 ; d2 Þ of group-by skyline cuboid C can now be obtained by t3 but t1 does not strictly dominates t3 . Note that we onlycomputing the skyline of cell ðc2 ; d2 Þ of cuboid Aþ on store the parent skyline set as a set of IDs. To avoid randomdimension S2 . Similarly, group-by skyline cuboids E and F b page accesses (during query processing), each tuple t of É iscannot be derived from any group-by skyline cuboid in the stored with its parent skyline set sequentially in the disk. Bycluster of cuboids S1 S2 but now they can be derived from decomposing and storing the extended skyline this way, weextended group-by skyline cuboid Dþ . For brevity, we do apply Lemma 4 to exploit parent skyline for pruningnot show all parent-child relationships in Fig. 7, but every unqualified tuples in the child set effectively, therebyES-cuboid on the right-hand side of Fig. 7 (and every group- reducing the query CPU cost (i.e., number of dominanceby skyline cuboid) indeed can be derived from a cuboid on comparisons). http://ieeexploreprojects.blogspot.comthe left-hand side. Furthermore, every ES-cuboid (and every Lemma 4 (Parent-based pruning). Given a sub-data set D0 group-by skyline cuboid) can be derived from the top cuboid D and a skyline attribute set S. Let SQ be any skyline attributeCþ ðG1 G2 G3 ; S1 S2 Þ. set that satisfies S S. Given a tuple t 2 É, if b Q4.2 Storage and Query Processing 9 t0 2 P S ðtÞ; t0 62 ÉðD0 ; SQ Þ, then t 62 ÉðD0 ; SQ Þ.Now, we discuss how to answer a group-by skyline query Proof. It is given that t0 2 P S ðtÞ, so we have t0 1S t. ByQ ¼ ðGQ ; SQ Þ using a materialized ES-cuboid Cþ ðG; SÞ, imposing the condition SQ S on the above, we obtainwhere GQ G and SQ S. We first concentrate on the 8 Ai 2 SQ ; t0 ½Ai Š t½Ai Š.case that GQ ¼ G and SQ S. A simple method in this case It is also given that t0 62 ÉðD0 ; SQ Þ. Thus, there exists ais to project the points in each cell g 2 Cþ to the query tuple t00 2 ÉðD0 ; SQ Þ such that t00 1SQ t0 . By combiningskyline subspace SQ , and then find the skyline. Fig. 9 shows both t00 1SQ t0 and 8 Ai 2 SQ ; t0 ½Ai Š t½Ai Š, we derivea set of extended skyline points in a cell. If SQ ¼ fS1 g, this t00 1SQ t. This implies that t does not belong to the resultmethod projects all four points to the S1 and then finds the set ÉðD0 ; SQ Þ. u tskyline among those points. Note that if the skylinedimensions have low cardinalities, the performance gain Therefore, the idea of the query algorithm is as follows:brought by ES-cubes may significantly decrease because the for a cell g, we first compute the projected skyline ÉþQ on the Snumber of extended skylines increases. However, we are skyline set Éþ along the query subspace SQ . These tuples aregoing to show that, by carefully storing the extended part of the final skyline for that cell. Then, we examine the b b child set É and a tuple t 2 É is pruned when PðtÞ 62 ÉþQ .skyline points, ES-cubes can still provide certain improve- Sment when the skyline dimensions have low cardinalities. Finally, we compare the remaining tuples with the projected skyline ÉþQ on the query subspace SQ and those still remainFirst, from the example above, we can see that t3 is at most S as skyline are also added as the final skyline for that good as t1 in subspace S2 , but t3 gets dominated by t1 inthe full space and subspace S1 . In other words, if t1 is not a 4.2.1 Query Processing Algorithmskyline in the query subspace, then it is not necessary to In the above discussion, we require the cuboid Cþ ðG; SÞ andexamine t3 at all. Therefore, we decompose a set of the query QðGQ ; SQ Þ to have the same group-by attributes,extended skyline points Éþ into two disjoint subsets: b i.e., GQ ¼ G.1) the skyline set É and 2) the child set É, whereb b We now present the query processing method in Algo-É ¼ Éþ À É. Given a point t in the child set É, we define rithm 1, which handles the general case where GQ G. Itthe parent skyline set of t as follows: assumes that a particular cuboid Cþ has been chosen for P S ðtÞ ¼ ft0 2 É j ðt0 1S tÞ ^ ðt0 61þ tÞg: processing the query. The issues on choosing a cuboid and S quantifying its “goodness” will be studied in Section 4.3.
  8. 8. YIU ET AL.: MEASURING THE SKY: ON COMPUTING DATA CUBES VIA SKYLINING THE MEASURES 499 as follows: let PAQ ðCþ Þ and CP Q ðCþ Þ be the I/O cost (page accesses) and CPU cost (number of comparisons) of evaluat- ing any group-by skyline query QðGQ ; SQ Þ using an ES- cuboid Cþ ðG; SÞ (where GQ G; SQ S). The benefit B of including ES-cuboid Cþ ðG; SÞ into MT is based on how much gain in terms of I/O accesses and tuple comparisons if Cþ ðG; SÞ is materialized: X maxf0; minCþ ðG0 ;S0 Þ2MT ;GQ G0 ;SQ S0 i GQ G;SQ S T IO ðPAQ ðCþ Þ À PAQ ðCþ ÞÞ þ T CP ðCP Q ðCþ Þ À CP Q ðCþ ÞÞg; i i ð3Þ where T IO denotes the time of accessing a disk page, T CP denotes the time of performing a dominance comparison, For the general case GQ G, a group instance gQ of GQ and Cþ ðG0 ; S 0 Þ is any ES-cuboid in MT that could be used to icorresponds to a subset Vc of cells of Cþ (see Line 3). At answer Q. T IO and T CP are machine-specific systemLines 4-7, we extract the skyline set Éðgc Þ of each cell gc 2 Vc parameters. They are usually stored in the system catalogas the set Tfirst , and then compute its skyline (along SQ ) as or can be derived, for example, by measuring the number ofthe candidate set Rfirst . At Lines 9-12, we examine the child page accesses and the number of dominance comparisons bset Éðgc Þ of each cell gc 2 Vc . A tuple t from a child set is that can be done in 1 second.inserted into the candidate set Tsecond if all parents of t exist Regarding the cuboid selection problem, we apply thein the set Rfirst . Finally, we merge the sets Rfirst and Tsecond greedy approach (as described in Section 2.1) for pickingtogether, and compute the skyline of the merged tuple set. cuboids to be included into the set MT . The only differenceThe specific skyline computation algorithm used at Line 13 is that, (3) is used to compute the benefit of each cuboid.will be clarified in Section 4.3. Next, we elaborate the I/O and CPU component costs of (3).4.3 Partial Materialization 4.3.1 Derivation of PAQ ðCþ ÞIn order to ensure fast online group-by skyline query We proceed to discuss the first component of benefit, i.e.,processing, it is often desirable to precompute/materialize the I/O cost PAQ ðCþ Þ of evaluating a query QðGQ ; SQ Þ by http://ieeexploreprojects.blogspot.comthe extended skyline cuboids in the ES-cube. Since it is not using an ES-cuboid Cþ ðG; SÞ (where GQ G; SQ S).a practical option to materialize all ES-cuboids, we focus We assume that the main memory is large enough to holdon partial materialization [15], which is one of the most the ES-cuboid Cþ . It suffices to read Cþ from the disk once.popular adopted methods in commercial products. Given a Both the grouping and skyline operations are performedspecific space budget, we greedily choose to materialize a solely in main memory, so they do not incur additional I/Osubset of ES-cuboids that can bring the maximum query cost. Thus, PA ðCþ Þ can be derived as the I/O cost of Qprocessing improvement. scanning Cþ once. Hence, the I/O cost PAQ ðCþ Þ depends on In traditional data cube, the selection of a cuboid for the size (i.e., the number of tuples and the number ofmaterialization is based on a linear cost model, i.e., that the þcost of evaluating a query using a cuboid C is linear to the attributes) of C but it is independent of Q.size of C, i.e., the I/O cost of scanning C once (or at most If the domain of the attributes are real values (i.e., verytwice). Recent work in skyline [7], [38] has mentioned that few tuples share duplicate values), the size of an ES-cuboidthe CPU cost contributes a significant portion of the overall is close to the size of a group-by skyline cuboid. In thatskyline query processing cost. However, does that argu- case, the size of an ES-cuboid can be estimated byment also hold in group-by skyline processing? If yes, we summing the estimated skyline size of each cell of tuplesshall really consider the CPU cost in selecting which ES- in the cuboid through the skyline cardinality estimationcuboid to be materialized. To answer the question, we have functions developed in [7], [38]. However, when thecarried out an investigation experiment. Here, we use the domain of an attribute is finite (e.g., an integer domain ofdefault parameter setting as in the experimental section and (say) [0,99k] for number of rebounds in the NBA data set),we measure the wall clock time of answering 100 random the size of an ES-cuboid would be larger than the size of avalid group-by skyline queries from a set of materialized group-by skyline cuboid. Consequently, in order to þES-cuboids. In this paper and in the experiment, we adopt compute PAQ ðC Þ, we first devise techniques to estimateOSP [37], the best skyline algorithm to date, for computing the size of an ES-cuboid, or essentially, the size of anskyline. Our findings are: “only 26.9 percent of the execution extended skyline, on a finite domain.þAfter that, we willtime is spent on the I/O cost (of reading an ES-cuboid from disk), come back to the derivation of PAQ ðC Þ.whereas skyline computation shares 55.7 percent of the time. The Estimating extended skyline size. Fig. 9a illustrates a setremaining 17.4 percent of time contributes to the grouping of tuples with two skyline dimensions S1 and S2 . Theoperation and the other overhead.” skyline set contains t1 ; t2 , which collectively form a (dotted) Therefore, it is important to capture the CPU cost in “contour” in the figure and the extended skyline setselecting a cuboid to be materialized. Now, we define the contains all tuples on the contour: t1 ; t2 ; t3 ; t4 . Suppose thatbenefit B of including an ES-cuboid Cþ ðG; SÞ into MT (which each skyline attribute has an integer domain ½0; Š, i.e., therepresents the set of ES-cuboids that should be materialized) number of possible values is þ 1. A tuple t belongs to the
  9. 9. 500 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012extended skyline if there exists no tuple t0 2 D that strictlydominates it. We now generalize the above observation to a data setwith a set S of skyline dimensions and N tuples. Let ÄðSÞ bethe space of all possible points t, satisfying t½Ai Š 2 ½0; Š;8Ai 2 S. Clearly, there are ð þ 1ÞjSj possible distinct pointsin the space domain. Given a point t 2 ÄðSÞ, we define itsprobability density as jft0 2 D j t0 ½Ai Š ¼ t½Ai Š; 8Ai 2 Sgj pðt; SÞ ¼ ð4Þ Nand its cumulative strict dominated density as Fig. 10. Estimation of the comparisons on the OSP tree. X P ðt; SÞ ¼ pðt0 ; SÞ: estimated skyline and estimate the query CPU cost. Step 1 t0 2ÄðSÞ; 8Ai 2S;t0 ½Ai Št½Ai Š can be efficiently done by projecting the dimensions of SQ from the original histogram HC . Step 2 can be implemented Each possible point has a probability of pðt; SÞ being an by the approximate skyline location estimation technique inactual tuple in the data set D, which contains N tuples. The [23]. Thus, our main focus is on Step 3.probability of t being an extended skyline tuple is N Histogram-based CPU cost estimation of OSP.ð1 À P ðt; SÞÞ . Thus, we obtain the size of an extended It remains to discuss how we utilize the histogram HC forskyline of N data tuples as estimating the CPU cost of OSP. We suppose that the X ÈES ðNÞ ¼ pðt; SÞð1 À P ðt; SÞÞN : ð5Þ readers are already familiar with how OSP works, as t2ÄðSÞ described in Section 2.2. The subsequent example follows jSj the setting of the OSP example described in Fig. 4. The above equations (containing ð þ 1Þ terms) can be We shall use the terms “point” and “OSP tree node”efficiently computed by using a numerical technique called interchangeably in the discussion below. Assume, aftermidpoint approximation [18]. estimating the skyline locations (i.e., after Steps 1 and 2 we Derivation of PAQ ðCþ Þ, revisit. Now, we can derive mentioned above), the points of a cell on a subspace SQ ¼PAQ ðCþ Þ by using (5). Recall that PAQ ðCþ Þ can be simply fS1 ; S2 g are shown in the left-hand side of Fig. 10 and those Cþ once, let ! be thederived as the I/O cost of scanning þ points form an OSP tree on the right-hand side of the figure.number of distinct groups in C and each group gi has Let t be a tree node (with the point t). We use tp tojDðgi Þj tuples, we have represent the parent of t; it is null if t is the root. The 1 þ jGj þ jSj X! bounding region t: of t is a hyperrectangle formed by its PAQ ðCþ Þ ¼ Á ÈES ðjDðgi ÞjÞ; ð6Þ lower corner t: À and upper corner t: þ , which are in turn Bitem i¼1 defined aswhereas the factor ð1 þ jGj þ jSjÞ accounts for the size of 8a tuple in C (a tuple ID takes 1 unit of storage), and Bitem 0; 1 ; if ðtp ¼ nullÞ; À þ p Àrepresents the number of items/attributes that fit in a t: ½Si Š; t: ½Si Š ¼ t : ½Si Š; t ½Si Š; if ðt½Si Š tp ½Si ŠÞ; p : pdisk page. t ½Si Š; tp : þ ½Si Š; if ðt½Si Š ! tp ½Si ŠÞ: Given a data set D and a group-by skyline query For instance, in Fig. 10, node t1 has no parent, so we haveQ ¼ ðGQ ; SQ Þ, the number of distinct groups ! and the t : ¼ ½0; 1Þ Â ½0; 1Þ. Node t is the “top-left” child of t so 1 2 1number of tuples of each group jDðgi Þj can be easily derived we have t : ¼ ½0; 0:3Þ Â ½0:4; 1Þ. 2from the data distribution using some basic textbook Let v be the next (arbitrary) point in the space to beequations. Please refer to [22] for details. visited by OSP, t be a node in the OSP tree, and ÇðtÞ be the4.3.2 Derivation of CP Q ðCþ Þ set of ancestor tree nodes of t. In OSP, we need to perform a dominance comparison between v and t iff t gets visited.We now discuss the second component of benefit (3), i.e.,the CPU cost CP Q ðC Þ (i.e., number of dominance compar- This happens when 1) the lower corner of t dominates v, i.e., þ À 1isons) of evaluating the query QðGQ ; SQ Þ by using an ES- t: V v, and0 2) none of the ancestor nodes of t dominates v,cuboid C ðG; SÞ (where GQ G; SQ S) for arbitrary data i.e., t0 2ÇðtÞ t 61 v. To capture all the possible points that þdistribution. The model is derived based on OSP [37], the cause a node t to be visited, we define a region Rt for t soalgorithm that we adopted in query processing, which is the that points in Rt satisfy conditions 1 and 2. Consequently,most efficient index-free skyline algorithm to date. Rt can be written as Rt ¼ Rt À Rt , where Rt dom anc dom is the À t Suppose that we have a multidimensional histogram HC region dominated by t: (condition 1), and Ranc is the þ 0 À 0that models the distribution of tuples of C in the space of S (union of) region dominated by t : for any ancestor t 2with values in histogram buckets are normalized to have a ÇðtÞ (condition 2). For example, for node t2 , its lower corner t2sum of 1. The CPU cost of OSP can be estimated by t2 : À dominates the region Rdom ¼ ½0; 1Þ Â ½0:4; 1Þ and its1) construct a histogram HQ (for the data distribution of Cþ parent t1 dominates the region Rt2 ¼ ½0:3; 1Þ Â ½0:4; 1Þ. ancin the subspace SQ ); 2) apply HQ to estimate the skyline Thus, any points that fall into the region Rt2 À Rt2 ¼ dom anclocations in the space SQ ; and 3) build an OSP tree with the ½0; 0:3Þ Â ½0:4; 1Þ cause node t2 to be visited.
  10. 10. YIU ET AL.: MEASURING THE SKY: ON COMPUTING DATA CUBES VIA SKYLINING THE MEASURES 501 Given the histogram HQ , we use HQ ðRÞ and HQ ðv1 ; v2 Þ to Regarding the cuboid used for constructing Cþ ðG2 ; S2 Þ, adenote the probability density of a region R and the region sensible choice would be to use its smallest ancestor cuboidformed by the corners v1 and v2 , respectively. Therefore, the Cþ ðG2 G3 ; S2 Þ (having four tuples), which incurs lownumber of dominance comparisons of OSP is estimated as construction cost. Algorithm 2 is the pseudocode for constructing ES- X N CP Q ðCþ Þ ¼ ! Á Á HQ ðRt Þ cuboids efficiently. It takes the set MT of chosen cuboids as t2À ! input. Let AMT be the set of constructed cuboids, 0 1 initialized to the empty set. First, we sort the cuboids of X X ¼NÁ @HQ ðt: ; U Þ À À þ UQ 0 0 þ A HQ ðt ; t : Þ ; MT by the topological order. For each cuboid Cþ ðG0 ; S 0 Þ in t2À t0 2ÇðtÞ the sorted MT , we find its set of ancestor cuboids that have been constructed (i.e., found in AMT ). Then, we pickwhere À denotes the set of OSP tree nodes and U þ denotes UQ the smallest ancestor cuboid CÃ from , for constructingthe upper corner of the (skyline) space domain in HQ . Cþ ðG0 ; S 0 Þ. After that, we insert Cþ ðG0 ; S 0 Þ into AMT for Note that the above equation is a pessimistic estimate of subsequent use.the CPU cost because it does not capture the pruning effect Algorithm 2. Cuboid Construction Algorithmof Lemma 4. Nevertheless, this cost model is adequate toprovide good estimates—our experiments show that thiscost model yields query improvement close to that offeredby an optimal cost model.4.4 ConstructionIn this section, we develop an efficient method forconstructing the cuboids of MT , which denotes the set ofchosen ES-cuboids according to Section 4.3. For the runningexample here, suppose that we use the data set shown inFig. 5 and the set MT contains the following ES-cuboids(see their tuples in Fig. 8): 5 EXPERIMENTS . Aþ : Cþ ðG2 G3 ; S1 S2 Þ In this section, we experimentally evaluate the efficiency and . C þ : Cþ ðG2 G3 ; S2 Þ effectiveness of the proposed technique (we name it as ES). . Dþ : Cþ ðG2 ; S1 S2 Þ The objective is to evaluate the query I/O cost (i.e., disk page þ þ . F : C ðG2 ; S2 Þ accesses), CPU cost (i.e., dominance comparisons), and cube construction time (if applicable) of using the proposed A simple method for constructing a cuboid C þ ðG; SÞ 2 technique with respect to a workload of 100 randomlyMT is to take the data set D as input, apply the group-by generated group-by skyline queries, each of which has aoperator on D with the group-by attribute set G, and then random number of 1-5 group-by attributes and 1-5 skylineapply the skyline operator (e.g., OSP) as the “aggregation attributes. All the experiments were conducted on an Intelfunction” on each group with the skyline attribute set S. Core 2 Quad 2.83 GHz PC with 4 GB of memory. All theHowever, this method is slow as it does not exploit the algorithms were implemented using C++. The page size is setsharing of computations among cuboids. to 4 K Bytes. We use the term “budget-to-data ratio” to We then discuss how to construct these ES-cuboids in an represent the ratio of the data cube budget space (in diskefficient manner. Let AMT be the set of constructed pages) to the data set storage size (in disk pages). The defaultcuboids, which is empty in the beginning. According to budget-to-data ratio (for data cube construction) is set to 10:1.Lemma 3, a cuboid Cþ ðG0 ; S 0 Þ can be derived from anancestor cuboid Cþ ðG; SÞ, i.e., G0 G and S 0 S (provided 5.1 Description of Data Sets and Methodsthat such a cuboid Cþ ðG; SÞ has been constructed, i.e., it We have carried out experiments on synthetic data setsexists in AMT ). and three real data sets (NBA, Baseball (BB), and Forest In order to utilize this sharing effect, we propose to sort Cover (FC)).the (chosen) ES-cuboids of MT in the topological order oftheir attributes, such that a cuboid appears after all its . NBA. It stores the NBA players’ statistics2 from 1946(possible) ancestors. For instance, the ES-cuboids are sorted to now. It contains 20,788 tuples, where each tuple þ þ þin the order: C ðG2 G3 ; S1 S2 Þ, C ðG2 G3 ; S2 Þ, C ðG2 ; S1 S2 Þ, stores the statistics of a player in a season. We identify attributes “Year” and “Team” as groupingand Cþ ðG2 ; S2 Þ. þ dimensions and attributes “Games played,” Continuing with the example, we first build C ðG2 G3 ; “Points,” “Rebounds,” “Assists,” “Steals,” “Blocks,”S1 S2 Þ and insert it into AMT . For the next two cuboids þ þ “Free throws,” and “Three-point shots” as theC ðG2 G3 ; S2 Þ and C ðG2 ; S1 S2 Þ, we build them from their þ measures (i.e., skyline dimensions).ancestors C ðG2 G3 ; S1 S2 Þ (according to Lemma 3). These . Baseball. It stores the baseball batters’ statistics3 fromtwo cuboids are then inserted into AMT . Finally, we 1871 to now. It contains 88,686 tuples, where eachattempt to construct the cuboid Cþ ðG2 ; S2 Þ and find that it tuple stores the statistics of a batter in a season. Wehas multiple ancestors: Cþ ðG2 G3 ; S1 S2 Þ, Cþ ðG2 G3 ; S2 Þ,Cþ ðG2 ; S1 S2 Þ. These three cuboids contain five, four, and 2. tuples, respectively, as shown in Fig. 8. 3.
  11. 11. 502 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012Fig. 11. Performance versus budget-to-data ratio, on real data sets. (a) NBA. (b) BB. (c) FC. identify attributes “Year,” “Team,” and “League” as select the cuboids. The cuboid selection algorithm grouping dimensions and attributes “Games,” “At strictly follows the linear cost model. As a result, it bats,” “Runs,” “Hits,” “Doubles,” and “Runs batted considers only the I/O costs by setting T CP of (3) to order to find skyline, each cell in” as the measures (i.e., skyline dimensions). zero. Also, in . Forest Cover. It is obtained from UCI Machine materializes the set of tuples belonging to that cell Learning Repository.4 It stores 581,012 tuples, each (only the projected values of the selected dimensions representing a 30 Â 30 forest land cell. We identify are stored), instead of an aggregated value. attributes “Soil type A,” “Soil type B,” “Slope,” and 5.2 Experimental Results on Real Data “Cover type” as grouping dimensions and attributes “Roadway distance,” “Vertical hydrology,” “Hori- 5.2.1 Comparison with Baseline Approaches zontal hydrology,” “Hill-shade at am,” “Hill-shade at We first compare the performance of ES with the baseline noon,” and “Hill-shade at pm” as skyline dimensions. methods REL and PROJ on real data sets. Fig. 11 shows the . Synthetic data sets. Our synthetic data sets are query I/O cost, CPU cost, and construction time of the generated similar to the benchmark setting de- methods on the real data sets under different budget-to-data scribed in [7]. Unless stated otherwise, each data ratio. Observe that the query cost of REL remains constant as set contains N ¼ 1;000;000 tuples, with a total of it does not materialize any cuboids. The query I/O costs 14 attributes (a1 , a2 ; . . . ; a14 ): five attributes a1 . . . a5 offered by PROJ and ES gradually reduce when the budget- are for grouping, and nine attributes a6 . . . a14 are for to-data ratio increases. This is consistent with the behavior of skyline. Specifically, a6 . . . a8 are generated indepen- the greedy materialization on traditional data cubes [15]—the dently, a9 . . . a11 are correlated, and a12 . . . a14 are early selected cuboids provide more significant query cost anticorrelated. By default, the domain size
  12. 12. of each group-by attribute is 5 and the domain size of each savings. Given the same budget, ES is able to materialize skyline attribute is 1,000. more cuboids than PROJ; thus, ES offers better query I/O cost In our study, we mainly compare our approach (ES-cube) saving than PROJ. In addition to the reduction of I/O, ES can dramatically reduce the CPU cost when compared with PROJwith two baseline approaches: and REL, even for cases with tight space budget. Note that . Relational (REL). This method evaluates a query PROJ does not reduce any comparisons because the cuboid using a relational engine that supports group-by materialized by PROJ contains all the tuples (but with fewer skyline queries [22]. It does not exploit any materi- attributes). The construction time rises slowly when the alized cuboids. budget-to-data ratio increases. . Projection data cube (PROJ). This method does not distinguish between grouping dimensions and sky- 5.2.2 Comparison with Other Cost Models line dimensions and follows the method in [15] to In order to evaluate the proposed cost model, we compare 4. the performance of ES with two alternate cost models:
  13. 13. YIU ET AL.: MEASURING THE SKY: ON COMPUTING DATA CUBES VIA SKYLINING THE MEASURES 503Fig. 12. Comparison of cost models, on real data sets. (a) NBA. (b) BB. (c) FC. . ES-cube with optimal cost model (OPT). This of an ES-cuboid becomes small. Thus, more ES-cuboids method is a variant of ES. It differs from ES in the could be constructed from the fixed space budget. During following: 1) instead of estimating the cuboid size, it query processing, small ES-cuboids are accessed for really materializes every cuboid ahead in order to answering queries, so ES has small query I/O cost. Since know their actual sizes, and 2) it utilizes the actual the size of a PROJ-cuboid is independent of
  14. 14. , it has values obtained from the materialized cuboids to constant construction time and query I/O cost. At a small
  15. 15. construct an optimal model for cost estimation. Note value, the number of tuples per group in a PROJ-cuboid is that this method is impractical due to its very high high, causing PROJ to have high query CPU time. However, construction time and tremendous storage space for storing all cuboids. It merely serves as an indicator of the lowest query cost that can be achieved with a perfect cost model. . ES-cube with random strategy (RAN). This method is a variant of ES, except that each time a random cuboid is chosen to be materialized. Fig. 12 shows the query I/O cost, CPU cost, andconstruction time of these methods on real data sets, as afunction of the budget-to-data ratio. We can see that thequery performance offered by ES is much better than RANand it stays very close to OPT. Furthermore, the construc-tion time of ES is significantly lower than that of OPT.Because of that, we will simply ignore RAN and OPT in theremaining experiments.5.3 Experimental Results on Synthetic DataIn the following, we study the effects of various parameterson the performance of the methods.5.3.1 Varying the Group-By Attribute Domain Size
  16. 16. Fig. 13a investigates the impact of the group-by attributedomain size
  17. 17. on the construction time and query costs ofthe three methods. Each square-bracketed number (alongthe x-axis) represents the average number of measured Fig. 13. Effect of attribute domain size. (a) Versus group-by domain
  18. 18. .groups per query in the workload. When
  19. 19. is small, the size (b) Versus skyline domain size .
  20. 20. 504 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012Fig. 14. Effect of the data size. Fig. 15. Effect of the number of query attributes. (a) Versus group-bythe CPU cost trend of ES is slightly different from that of attributes. (b) Versus skyline attributes.PROJ. Even at a small
  21. 21. value, the number of tuples pergroup in an ES-cuboid remains small, and thus the CPU skyline is Oððln nÞdÀ1 =ðd À 1Þ!Þ (on a uniform data set with ncost of ES is not high. tuples and d attributes) [11], it explains the increasing trend of the query costs of all methods.5.3.2 Varying the Skyline Attribute Domain Size We then study the effect of the skyline attribute domain size on the construction time and query cost of all methods 6 CONCLUSIONS AND FUTURE WORK(see Fig. 13b). When is small, a tuple t has high probability This paper studies the support of group-by skyline queriesof belonging to the extended skyline of its group. As a in data warehouses by proposing the concept of group-byresult, the size of each materialized cuboid is large, so ES skyline cube. To implement this concept, we have devel-materializes few cuboids within the given budget. There- oped techniques for addressing the following importantfore, ES incurs more query I/O cost at a low value. On the issues: http://ieeexploreprojects.blogspot.comother hand, all methods have lower CPU cost at a low value, as it becomes easier to obtain a tuple with low values 1. the concept of ES-cuboids for re-enabling the sharingin skyline attributes, which helps pruning other unqualified of computational efforts between cuboids,tuples effectively. 2. an algorithm for answering a query from a materi- alized cuboid efficiently,5.3.3 Varying the Data Size N 3. cost models for facilitating the selection of cuboids toFig. 14 shows the construction time and query costs of REL, be materialized for the best query performancePROJ, and ES with respect to the data size. When N rises, improvement, andthe extended skyline size of a group increases sublinearly 4. an efficient algorithm for constructing cuboids.with N. Thus, the query cost of ES grows sublinearly. The Experimental results show that the proposal techniquesperformance gap between ES and the baseline methods significantly reduce the query costs in terms of both CPUwidens when N increases. time and I/O time.5.3.4 Varying the Number of Query Attributes Our future work includes incremental maintenance of skyline cube. We also plan to study the integration of otherOur default workload contains queries with random 1-5 useful decision-making operators such as top-k into datagroup-by attributes and 1-5 skyline attributes. We now warehouses as “aggregate functions.” Furthermore,study the effect of group-by and skyline attributes on thequery cost. ES outperforms REL and PROJ in all cases. throughout this paper, we assume that smaller values areFig. 15a shows the query cost when we control the preferable over larger ones, for skyline attributes. We plan tonumber of group-by attributes in queries. Observe that ES remove this limitation by defining the concept of generic ES-saves significant I/O cost and CPU cost over REL and cuboid, which consists of the union of ES-cuboids obtainedPROJ in all cases. When a query has a higher number of from different combinations of preferences on skylinegroup-by attributes, ES answers it with a larger cuboid, attributes. Such a generic ES-cuboid can then be used tothus incurring higher I/O cost. However, when there are answer a query irrespective of its preferences on attributes.more group-by attributes, the number of groups increases An important issue would be to study an efficient methodand thus there are fewer tuples (and also skyline tuples) for estimating the size of a generic ES-cuboid, withoutper group, so the CPU cost decreases. having to enumerate all combinations of preferences. Fig. 15b shows the query cost when we control thenumber of skyline attributes in queries. ES also outperformsREL and PROJ in all cases. When a query contains more ACKNOWLEDGMENTSskyline attributes, ES answers it with a larger cuboid, thus The research was supported by grant PolyU 525009E fromincurring higher I/O cost and CPU cost. Since the size of Hong Kong RGC.
  22. 22. YIU ET AL.: MEASURING THE SKY: ON COMPUTING DATA CUBES VIA SKYLINING THE MEASURES 505REFERENCES [30] K.-L. Tan, P.-K. Eng, and B.C. Ooi, “Efficient Progressive Skyline Computation,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2001.[1] ¨ W.T. Balke, U. Guntzer, and J.X. Zheng, “Efficient Distributed [31] Y. Tao et al., “Efficient Skyline and Top-k Retrieval in Subspaces,” Skylining for Web Information Systems,” Proc. Int’l Conf. Extend- IEEE Trans. Knowledge and Data Eng., vol. 19, no. 8, pp. 1072-1088, ing Database Technology (EDBT), 2004. Aug. 2007.[2] I. Bartolini, P. Ciaccia, and M. Patella, “Efficient Sort-Based [32] Y. Tao, X. Xiao, and J. Pei, “SUBSKY: Efficient Computation of Skyline Evaluation,” ACM Trans. Database Systems, vol. 33, no. 4, Skylines in Subspaces,” Proc. Int’l Conf. Data Eng. (ICDE), 2006. 2008. [33] A. Vlachou et al., “SKYPEER: Efficient Subspace Skyline Compu-[3] K.S. Beyer et al., “Bottom-Up Computation of Sparse and Iceberg tation over Distributed Data,” Proc. Int’l Conf. Data Eng. (ICDE), CUBEs,” Proc. ACM SIGMOD Int’l Conf. Management of Data, 1999. pp. 416-425, 2007.[4] ¨ ¨ S. Borzsonyi et al., “The Skyline Operator,” Proc. Int’l Conf. Data [34] T. Xia et al., “Refreshing the Sky: The Compressed Skycube with Eng. (ICDE), 2001. Efficient Support for Frequent Updates,” Proc. ACM SIGMOD Int’l[5] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang, Conf. Management of Data, pp. 491-502, 2006. “Finding k-Dominant Skylines in High Dimensional Space,” Proc. [35] D. Xin et al., “P-Cube: Answering Preference Queries in Multi- ACM SIGMOD Int’l Conf. Management of Data, 2006. Dimensional Space,” Proc. Int’l Conf. Data Eng. (ICDE), 2008.[6] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang, “On [36] Y. Yuan et al., “Efficient Computation of the Skyline Cube,” Proc. High Dimensional Skylines,” Proc. Int’l Conf. Extending Database Int’l Conf. Very Large Data Bases (VLDB), 2005. Technology (EDBT), 2006. [37] S. Zhang et al., “Scalable Skyline Computation Using Object-Based[7] S. Chaudhuri et al., “Robust Cardinality and Cost Estimation for Space Partitioning,” Proc. ACM SIGMOD Int’l Conf. Management of Skyline Operator,” Proc. Int’l Conf. Data Eng. (ICDE), 2006. Data, 2009.[8] J. Chomicki et al., “Skyline with Presorting,” Proc. Int’l Conf. Data [38] Z. Zhang et al., “Kernel-Based Skyline Cardinality Estimation,” Eng. (ICDE), 2003. Proc. ACM SIGMOD Int’l Conf. Management of Data, 2009.[9] B. Cui, H. Lu, Q. Xu, L. Chen, Y. Dai, and Y. Zhou, “Parallel Distributed Processing of Constrained Skyline Queries by Filter- Man Lung Yiu received the bachelor’s degree in ing,” Proc. IEEE Int’l Conf. Data Eng. (ICDE), 2008. computer engineering and the PhD degree in[10] E. Dellis et al., “Efficient Computation of Reverse Skyline computer science from the University of Hong Queries,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2007. Kong in 2002 and 2006, respectively. Prior to his[11] P. Godfrey, “Skyline Cardinality for Relational Processing,” Proc. current post, he worked at Aalborg University for Int’l Symp. Foundations of Information and Knowledge Systems three years starting in the Fall of 2006. He is now (FoIKS), 2004. an assistant professor in the Department of[12] P. Godfrey, R. Shipley, and J. Gryz, “Algorithms and Analysis for Computing, Hong Kong Polytechnic University. Maximal Vector Computation,” VLDB J., vol. 16, no. 1, pp. 5-28, His research focuses on the management of 2007. complex data, in particular query processing[13] J. Gray et al., “Data Cube: A Relational Aggregation Operator topics on spatiotemporal data and multidimensional data. Generalizing Group-By, Cross-Tab, and Sub-Total,” Proc. Int’l Conf. Data Eng. (ICDE), 1996.[14] A. Gupta and I.S. Mumick, Materialized Views: Techniques, Eric Lo received the PhD degree in 2007 from Implementations, and Applications. MIT Press, 1999.[15] inZurich. He is currently an assistant profes- V. Harinarayan et al., “Implementing Data Cubes Efficiently,” ETH sor the Department of Computing, Hong Kong Proc. ACM SIGMOD Int’l Conf. Management of Data, 1996. Polytechnic University. His research interests[16] W. Jin, A.K.H. Tung, M. Ester, and J. Han, “On Efficient include databases and software engineering. Processing of Subspace Skyline Queries on High Dimensional Data,” Proc. Int’l Conf. Scientific and Statistical Database Management (SSDBM), 2007.[17] D. Kossmann, F. Ramsak, and S. Rost, “Shooting Stars in the Sky: An Online Algorithm for Skyline Queries,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2002.[18] A.R. Krommer and C.W. Ueberhuber, Numerical Integration on Advanced Computer Systems. Springer, 1994. Duncan Yung received the bachelor’s degree in computer science in 2009 from the University of[19] H.T. Kung et al., “On Finding the Maxima of a Set of Vectors,” Hong Kong, Hong Kong. He is currently working J. ACM, vol. 22, no. 4, pp. 469-476, 1975. toward the MPhil degree at the Department of[20] C. Li, B.C. Ooi, A. Tung, and S. Wang, “DADA: A Data Cube for Computing, Hong Kong Polytechnic University, Dominant Relationship Analysis,” Proc. ACM SIGMOD Int’l Conf. Hong Kong, under the supervision of Dr. Eric Lo. Management of Data, 2006. His research focuses on different kinds of spatial[21] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang, “Selecting Stars: The k data management and authentication. Most Representative Skyline Operator,” Proc. Int’l Conf. Data Eng. (ICDE), 2007.[22] M.-H. Luk et al., “Group-By Skyline Query Processing in Relational Engines,” Proc. ACM Conf. Information and Knowledge Management (CIKM), 2009. . For more information on this or any other computing topic,[23] D. Papadias et al., “Progressive Skyline Computation in Database please visit our Digital Library at Systems,” ACM Trans. Database Systems, vol. 30, no. 1, pp. 41-82, 2005.[24] J. Pei et al., “Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2005.[25] J. Pei et al., “Towards Multidimensional Subspace Skyline Analysis,” ACM Trans. Database Systems, vol. 31, no. 4, pp. 1335- 1381, 2006.[26] J. Pei et al., “Computing Compressed Multidimensional Skyline Cubes Efficiently,” Proc. Int’l Conf. Data Eng. (ICDE), 2007.[27] K.A. Ross et al., “Fast Computation of Sparse Datacubes,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 1997.[28] M. Sharifzadeh and C. Shahabi, “The Spatial Skyline Queries,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2006.[29] Y. Sismanis et al., “Dwarf: Shrinking the PetaCube,” Proc. ACM SIGMOD Int’l Conf. Management of Data, 2002.