678                                                 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                  ...
680                                           IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,          VOL. 24,   NO....
682                                           IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,          VOL. 24,   NO....
684                                          IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,             VOL. 24,   N...
686                                             IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,   VOL. 24,   NO. 4,  ...
688                                        IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,     VOL. 24,   NO. 4,   AP...
690                                          IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,   VOL. 24,   NO. 4,   AP...
Upcoming SlideShare
Loading in …5

Horizontal aggregations in sql to prepare data sets for data mining analysis.bak


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Horizontal aggregations in sql to prepare data sets for data mining analysis.bak

  1. 1. 678 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL 2012 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis Carlos Ordonez and Zhibo Chen Abstract—Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation- variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not. Index Terms—Aggregation, data preparation, pivoting, SQL. Ç1 INTRODUCTIONI N a relational database, especially with normalized tables, average, maximum, minimum, or row count over groups of a significant effort is required to prepare a summary data rows. There exist many aggregation functions and operatorsset [16] that can be used as input for a data mining or in SQL. Unfortunately, all these aggregations have limita-statistical algorithm [17], [15]. Most algorithms require as tions to build data sets for data mining purposes. The main http://ieeexploreprojects.blogspot.cominput a data set with a horizontal layout, with several reason is that, in general, data sets that are stored in arecords and one variable or dimension per column. That is relational database (or a data warehouse) come from Onlinethe case with models like clustering, classification, regres- Transaction Processing (OLTP) systems where databasesion, and PCA; consult [10], [15]. Each research discipline schemas are highly normalized. But data mining, statistical,uses different terminology to describe the data set. In data or machine learning algorithms generally require aggre-mining the common terms are point-dimension. Statistics gated data in summarized form. Based on current availableliterature generally uses observation-variable. Machine functions and clauses in SQL, a significant effort is requiredlearning research uses instance-feature. This paper intro- to compute aggregations when they are desired in a cross-duces a new class of aggregate functions that can be used to tabular (horizontal) form, suitable to be used by a databuild data sets in a horizontal layout (denormalized with mining algorithm. Such effort is due to the amount andaggregations), automating SQL query writing and extend- complexity of SQL code that needs to be written, optimized,ing SQL capabilities. We show evaluating horizontal and tested. There are further practical reasons to returnaggregations is a challenging and interesting problem and aggregation results in a horizontal (cross-tabular) layout.we introduce alternative methods and optimizations for Standard aggregations are hard to interpret when there aretheir efficient evaluation. many result rows, especially when grouping attributes have1.1 Motivation high cardinalities. To perform analysis of exported tables into spreadsheets it may be more convenient to haveAs mentioned above, building a suitable data set for data aggregations on the same group in one row (e.g., to producemining purposes is a time-consuming task. This task graphs or to compare data sets with repetitive information).generally requires writing long SQL statements or custo- OLAP tools generate SQL code to transpose results (some-mizing SQL code if it is automatically generated by some times called PIVOT [5]). Transposition can be more efficienttool. There are two main ingredients in such SQL code: joins if there are mechanisms combining aggregation andand aggregations [16]; we focus on the second one. The transposition together.most widely known aggregation is the sum of a column With such limitations in mind, we propose a new class ofover groups of rows. Some other aggregations return the aggregate functions that aggregate numeric expressions and transpose results to produce a data set with a horizontal. The authors are with the Department of Computer Science, University of layout. Functions belonging to this class are called horizontal Houston, Houston, TX 77204. E-mail: {ordonez, zchen6}@cs.uh.edu. aggregations. Horizontal aggregations represent an ex-Manuscript received 17 Nov. 2009; revised 2 June 2010; accepted 16 Aug. tended form of traditional SQL aggregations, which return2010; published online 23 Dec. 2010. a set of values in a horizontal layout (somewhat similar to aRecommended for acceptance by T. Grust. multidimensional vector), instead of a single value per row.For information on obtaining reprints of this article, please send e-mail to:tkde@computer.org, and reference IEEECS Log Number TKDE-2009-11-0789. This paper explains how to evaluate and optimize horizontalDigital Object Identifier no. 10.1109/TKDE.2011.16. aggregations generating standard SQL code. 1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
  2. 2. ORDONEZ AND CHEN: HORIZONTAL AGGREGATIONS IN SQL TO PREPARE DATA SETS FOR DATA MINING ANALYSIS 679Fig. 1. Example of F , FV , and FH .1.2 Advantages attributes, and one numeric attribute: F ðK; D1 ; . . . ; Dp ; AÞ.Our proposed horizontal aggregations provide several Our definitions can be easily generalized to multipleunique features and advantages. First, they represent a numeric attributes. In OLAP terms, F is a fact table withtemplate to generate SQL code from a data mining tool. Such one column used as primary key, p dimensions and oneSQL code automates writing SQL queries, optimizing them, measure column passed to standard SQL aggregations. Thatand testing them for correctness. This SQL code reduces is, table F will be manipulated as a cube with p dimensionsmanual work in the data preparation phase in a data mining [9]. Subsets of dimension columns are used to group rowsproject. Second, since SQL code is automatically generated it to aggregate the measure column. F is assumed to have ais likely to be more efficient than SQL code written by an end star schema to simplify exposition. Column K will not beuser. For instance, a person who does not know SQL well or used to compute aggregations. Dimension lookup tablessomeone who is not familiar with the database schema (e.g., will be based on simple foreign keys. That is, one dimensiona data mining practitioner). Therefore, data sets can be column Dj will be a foreign key linked to a lookup table thatcreated in less time. Third, the data set can be created entirely has Dj as primary key. Input table F size is called N (not toinside the DBMS. In modern database environments, it is be confused with n, the size of the answer set). That is,common to export denormalized data sets to be further jF j ¼ N. Table F represents a temporary table or a viewcleaned and transformed outside a DBMS in external tools based on a “star join” query on several tables. http://ieeexploreprojects.blogspot.com(e.g., statistical packages). Unfortunately, exporting large We now explain tables FV (vertical) and FH (horizontal)tables outside a DBMS is slow, creates inconsistent copies of that are used throughout the paper. Consider a standardthe same data and compromises database security. There- SQL aggregation (e.g., sum()) with the GROUP BY clause,fore, we provide a more efficient, better integrated and more which returns results in a vertical layout. Assume there aresecure solution compared to external data mining tools. j þ k GROUP BY columns and the aggregated attribute is A.Horizontal aggregations just require a small syntax exten- The results are stored on table F having j þ k columns Vsion to aggregate functions called in a SELECT statement. making up the primary key and A as a nonkey attribute.Alternatively, horizontal aggregations can be used to Table FV has a vertical layout. The goal of a horizontalgenerate SQL code from a data mining tool to build data aggregation is to transform FV into a table FH with asets for data mining analysis. horizontal layout having n rows and j þ d columns, where1.3 Paper Organization each of the d columns represents a unique combination ofThis paper is organized as follows: Section 2 introduces the k grouping columns. Table FV may be more efficientdefinitions and examples. Section 3 introduces horizontal than FH to handle sparse matrices (having many zeroes),aggregations. We propose three methods to evaluate but some DBMSs like SQL Server [2] can handle sparsehorizontal aggregations using existing SQL constructs, we columns in a horizontal layout. The n rows representprove the three methods produce the same result and records for analysis and the d columns represent dimen-we analyze time complexity and I/O cost. Section 4 sions or features for analysis. Therefore, n is data set sizepresents experiments comparing evaluation methods, eval- and d is dimensionality. In other words, each aggregateduating the impact of optimizations, assessing scalability, column represents a numeric variable as defined inand understanding I/O cost with large tables. Related work statistics research or a numeric feature as typically definedis discussed in Section 5, comparing our proposal with in machine learning research.existing approaches and positioning our work within datapreparation and OLAP query evaluation. Section 6 gives 2.1 Examplesconclusions and directions for future work. Fig. 1 gives an example showing the input table F , a traditional vertical sum() aggregation stored in FV , and a horizontal aggregation stored in FH . The basic SQL2 DEFINITIONS aggregation query is:This section defines the table that will be used to explainSQL queries throughout this work. In order to present SELECT D1 ; D2 ,sum(A)definitions and concepts in an intuitive manner, we present FROM Four definitions in OLAP terms. Let F be a table having a GROUP BY D1 ; D2simple primary key K represented by an integer, p discrete ORDER BY D1 ; D2 ;
  3. 3. 680 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL 2012 Notice table FV has only five rows because D1 ¼ 3 and within each department. Most data mining algorithms (e.g.,D2 ¼ Y do not appear together. Also, the first row in FV has clustering, decision trees, regression, correlation analysis)null in A following SQL evaluation semantics. On the other require result tables from these queries to be transformedhand, table FH has three rows and two (d ¼ 2) nonkey into a horizontal layout. We must mention there exist datacolumns, effectively storing six aggregated values. In FH it mining algorithms that can directly analyze data setsis necessary to populate the last row with null. Therefore, having a vertical layout (e.g., in transaction format) [14],nulls may come from F or may be introduced by the but they require reprogramming the algorithm to have ahorizontal layout. better I/O pattern and they are efficient only when there We now give other examples with a store (retail) database many zero values (i.e., sparse matrices).that requires data mining analysis. To give examples of F , wewill use a table transactionLine that represents the transaction 3 HORIZONTAL AGGREGATIONStable from a store. Table transactionLine has dimensions We introduce a new class of aggregations that have similargrouped in three taxonomies (product hierarchy, location, behavior to SQL standard aggregations, but which producetime), used to group rows, and three measures represented tables with a horizontal layout. In contrast, we call standardby itemQty, costAmt, and salesAmt, to pass as arguments to SQL aggregations vertical aggregations since they produceaggregate functions. tables with a vertical layout. Horizontal aggregations just We want to compute queries like “summarize sales for require a small syntax extension to aggregate functionseach store by each day of the week”; “compute the total called in a SELECT statement. Alternatively, horizontalnumber of items sold by department for each store.” These aggregations can be used to generate SQL code from a dataqueries can be answered with standard SQL, but addi- mining tool to build data sets for data mining analysis. Wetional code needs to be written or generated to return start by explaining how to automatically generate SQL code.results in tabular (horizontal) form. Consider the followingtwo queries: 3.1 SQL Code Generation Our main goal is to define a template to generate SQL codeSELECT storeId,dayofweekNo,sum(salesAmt) combining aggregation and transposition (pivoting). AFROM transactionLine second goal is to extend the SELECT statement with aGROUP BY storeId,dayweekNo clause that combines transposition with aggregation. Con-ORDER BY storeId,dayweekNo; sider the following GROUP BY query in standard SQL that takes a subset L1 ; . . . ; Lm from D1 ; . . . ; Dp :SELECT storeId,deptId,sum(itemqty) http://ieeexploreprojects.blogspot.comFROM transactionLine SELECT L1 ; ::; Lm , sum(A)GROUP BY storeId,deptId FROM FORDER BY storeId,deptId; GROUP BY L1 ; . . . ; Lm ; Assume there are 200 stores, 30 store departments, and This aggregation query will produce a wide table withstores are open 7 days a week. The first query returns m þ 1 columns (automatically determined), with one group1,400 rows which may be time consuming to compare with for each unique combination of values L1 ; . . . ; Lm and oneeach other each day of the week to get trends. The second aggregated value per group (sum(A) in this case). In order to evaluate this query the query optimizer takes three inputquery returns 6,000 rows, which in a similar manner, makes parameters: 1) the input table F , 2) the list of groupingdifficult to compare store performance across departments. columns L1 ; . . . ; Lm , 3) the column to aggregate (A). The basicEven further, if we want to build a data mining model by goal of a horizontal aggregation is to transpose (pivot) thestore (e.g., clustering, regression), most algorithms require aggregated column A by a column subset of L1 ; . . . ; Lm ; forstore id as primary key and the remaining aggregated simplicity assume such subset is R1 ; . . . ; Rk where k < m. Incolumns as nonkey columns. That is, data mining algo- other words, we partition the GROUP BY list into tworithms expect a horizontal layout. In addition, a horizontal sublists: one list to produce each group (j columns L1 ; . . . ; Lj )layout is generally more I/O efficient than a vertical layout and another list (k columns R1 ; . . . ; Rk ) to transposefor analysis. Notice these queries have ORDER BY clauses aggregated values, where fL1 ; . . . ; Lj g fR1 ; . . . ; Rk g ¼ ;.to make output easier to understand, but such order is Each distinct combination of fR1 ; . . . ; Rk g will automaticallyirrelevant for data mining algorithms. In general, we omit produce an output column. In particular, if k ¼ 1 then thereORDER BY clauses. are jR1 ðF Þj columns (i.e., each value in R1 becomes a column2.2 Typical Data Mining Problems storing one aggregation). Therefore, in a horizontal aggrega-Let us consider data mining problems that may be solved tion there are four input parameters to generate SQL code:by typical data mining or statistical algorithms, which 1. the input table F ,assume each nonkey column represents a dimension,variable (statistics), or feature (machine learning). Stores 2. the list of GROUP BY columns L1 ; . . . ; Lj ,can be clustered based on sales for each day of the week. On 3. the column to aggregate (A),the other hand, we can predict sales per store department 4. the list of transposing columns R1 ; . . . ; Rk .based on the sales in other departments using decision trees Horizontal aggregations preserve evaluation semantics ofor regression. PCA analysis on department sales can reveal standard (vertical) SQL aggregations. The main differencewhich departments tend to sell together. We can find out will be returning a table with a horizontal layout, possiblypotential correlation of number of employees by gender having extra nulls. The SQL code generation aspect is
  4. 4. ORDONEZ AND CHEN: HORIZONTAL AGGREGATIONS IN SQL TO PREPARE DATA SETS FOR DATA MINING ANALYSIS 681explained in technical detail in Section 3.4. Our definition 7. The argument to aggregate represented by A isallows a straightforward generalization to transpose multi- required; A can be a column name or an arithmeticple aggregated columns, each one with a different list of expression. In the particular case of countðÞA can betransposing columns. the “DISTINCT” keyword followed by the list of columns.3.2 Proposed Syntax in Extended SQL 8. When HðÞ is used more than once, in different terms,We now turn our attention to a small syntax extension to it should be used with different sets of BY columns.the SELECT statement, which allows understanding ourproposal in an intuitive manner. We must point out the 3.2.1 Examplesproposed extension represents nonstandard SQL because In a data mining project, most of the effort is spent inthe columns in the output table are not known when preparing and cleaning a data set. A big part of this effortthe query is parsed. We assume F does not change while a involves deriving metrics and coding categorical attributeshorizontal aggregation is evaluated because new values from the data set in question and storing them in a tabularmay create new result columns. Conceptually, we extend (observation, record) form for analysis so that they can bestandard SQL aggregate functions with a “transposing” BY used by a data mining algorithm.clause followed by a list of columns (i.e., R1 ; . . . ; Rk ), to Assume we want to summarize sales information withproduce a horizontal set of numbers instead of one number. one store per row for one year of sales. In more detail, weOur proposed syntax is as follows: need the sales amount broken down by day of the week, the number of transactions by store per month, the number ofSELECT L1 ; ::; Lj , HðABYR1 ; . . . ; Rk Þ items sold by department and total sales. The followingFROM F query in our extended SELECT syntax provides the desiredGROUP BY L1 ; . . . ; Lj ; data set, by calling three horizontal aggregations. We believe the subgroup columns R1 ; . . . ; Rk should be aparameter associated to the aggregation itself. That is why SELECTthey appear inside the parenthesis as arguments, but storeId,alternative syntax definitions are feasible. In the context of sum(salesAmt BY dayofweekName),our work, HðÞ represents some SQL aggregation (e.g., count(distinct transactionid BY salesMonth),sumðÞ, countðÞ, minðÞ, maxðÞ, avgðÞ). The function HðÞ must sum(1 BY deptName),have at least one argument represented by A, followed by a sum(salesAmt) http://ieeexploreprojects.blogspot.comlist of columns. The result rows are determined by columns FROM transactionLineL1 ; . . . ; Lj in the GROUP BY clause if present. Result ,DimDayOfWeek,DimDepartment,DimMonthcolumns are determined by all potential combinations of WHERE salesYear¼2009columns R1 ; . . . ; Rk , where k ¼ 1 is the default. Also, AND transactionLine.dayOfWeekNofL1 ; . . . ; Lj g fR1 ; . . . ; Rk g ¼ ; ¼DimDayOfWeek.dayOfWeekNo We intend to preserve standard SQL evaluation seman-tics as much as possible. Our goal is to develop sound and AND transactionLine.deptIdefficient evaluation mechanisms. Thus, we propose the ¼DimDepartment.deptIdfollowing rules. AND transactionLine.MonthId ¼DimTime.MonthId 1. The GROUP BY clause is optional, like a vertical GROUP BY storeId; aggregation. That is, the list L1 ; . . . ; Lj may be empty. This query produces a result table like the one shown in When the GROUP BY clause is not present then there is only one result row. Equivalently, rows can be Table 1. Observe each horizontal aggregation effectively grouped by a constant value (e.g., L1 ¼ 0) to always returns a set of columns as result and there is call to a include a GROUP BY clause in code generation. standard vertical aggregation with no subgrouping col- 2. When the clause GROUP BY is present there should umns. For the first horizontal aggregation, we show day not be a HAVING clause that may produce cross- names and for the second one we show the number of day tabulation of the same group (i.e., multiple rows of the week. These columns can be used for linear with aggregated values per group). regression, clustering, or factor analysis. We can analyze 3. The transposing BY clause is optional. When BY is correlation of sales based on daily sales. Total sales can be not present then a horizontal aggregation reduces to predicted based on volume of items sold each day of the a vertical aggregation. week. Stores can be clustered based on similar sales for each 4. When the BY clause is present the list R1 ; . . . ; Rk is day of the week or similar sales in the same department. required, where k ¼ 1 is the default. Consider a more complex example where we want to 5. Horizontal aggregations can be combined with know for each store subdepartment how sales compare for vertical aggregations or other horizontal aggrega- each region-month showing total sales for each region/ tions on the same query, provided all use the same month combination. Subdepartments can be clustered GROUP BY columns fL1 ; . . . ; Lj g. based on similar sales amounts for each region/month 6. As long as F does not change during query combination. We assume all stores in all regions have the processing horizontal aggregations can be freely same departments, but local preferences lead to different combined. Such restriction requires locking [11], buying patterns. This query in our extended SELECT builds which we will explain later. the required data set:
  5. 5. 682 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL 2012 TABLE 1 A Multidimensional Data Set in Horizontal Layout, Suitable for Data MiningSELECT subdeptid, set of values for each group L1 ; . . . ; Lj . Therefore, the result sum(salesAmt BY regionNo,monthNo) table FH must have as primary key the set of groupingFROM transactionLine columns fL1 ; . . . ; Lj g and as nonkey columns all existingGROUP BY subdeptId; combinations of values R1 ; . . . ; Rk . We get the distinct value We turn our attention to another common data prepara- combinations of R1 ; . . . ; Rk using the following statement.tion task, transforming columns with categorical attributes SELECT DISTINCT R1 ; ::; Rkinto binary columns. The basic idea is to create a binary FROM F ;dimension for each distinct value of a categorical attribute.This can be accomplished by simply calling maxð1BY::Þ, Assume this statement returns a table with d distinct rows.grouping by the appropriate columns. The next query Then each row is used to define one column to store anproduces a vector showing 1 for the departments where the aggregation for one specific combination of dimensioncustomer made a purchase, and 0 otherwise. values. Table FH that has fL1 ; . . . ; Lj g as primary key and d columns corresponding to each distinct subgroup. Therefore,SELECT FH has d columns for data mining analysis and j þ d columns transactionId, in total, where each Xj corresponds to one aggregated value max(1 BY deptId DEFAULT 0) based on a specific R1 ; . . . ; Rk values combination.FROM transactionLineGROUP BY transactionId; CREATE TABLE FH ( L1 int3.3 SQL Code Generation: Locking and Tablehttp://ieeexploreprojects.blogspot.com ,. . . Definition ,Lj intIn this section, we discuss how to automatically generate ,X1 realefficient SQL code to evaluate horizontal aggregations. ,. . .Modifying the internal data structures and mechanisms of ,Xd realthe query optimizer is outside the scope of this paper, but ) PRIMARY KEY(L1 ; . . . ; Lj );we give some pointers. We start by discussing the structureof the result table and then query optimization methods to 3.4 SQL Code Generation: Query Evaluationpopulate it. We will prove the three proposed evaluation Methodsmethods produce the same result table FH . We propose three methods to evaluate horizontal aggrega- tions. The first method relies only on relational operations.3.3.1 Locking That is, only doing select, project, join, and aggregationIn order to get a consistent query evaluation it is necessary queries; we call it the SPJ method. The second form relies onto use locking [7], [11]. The main reasons are that any the SQL “case” construct; we call it the CASE method. Eachinsertion into F during evaluation may cause inconsisten-cies: 1) it can create extra columns in FH , for a new table has an index on its primary key for efficient joincombination of R1 ; . . . ; Rk ; 2) it may change the number of processing. We do not consider additional indexingrows of FH , for a new combination of L1 ; . . . ; Lj ; 3) it may mechanisms to accelerate query evaluation. The thirdchange actual aggregation values in FH . In order to return method uses the built-in PIVOT operator, which transformsconsistent answers, we basically use table-level locks on F , rows to columns (e.g., transposing). Figs. 2 and 3 show anFV , and FH acquired before the first statement starts and overview of the main steps to be explained below (for areleased after FH has been populated. In other words, the sum() aggregation).entire set of SQL statements becomes a long transaction. Weuse the highest SQL isolation level: SERIALIZABLE. Notice 3.4.1 SPJ Methodan alternative simpler solution would be to use a static The SPJ method is interesting from a theoretical point of(read-only) copy of F during query evaluation. That is, view because it is based on relational operators only. Thehorizontal aggregations can operate on a read-only data- basic idea is to create one table with a vertical aggregationbase without consistency issues. for each result column, and then join all those tables to produce FH . We aggregate from F into d projected tables3.3.2 Result Table Definition with d Select-Project-Join-Aggregation queries (selection,Let the result table be FH . Recall from Section 2 FH has d projection, join, aggregation). Each table FI corresponds toaggregation columns, plus its primary key. The horizontal one subgrouping combination and has fL1 ; . . . ; Lj g asaggregation function HðÞ returns not a single value, but a primary key and an aggregation on A as the only nonkey
  6. 6. ORDONEZ AND CHEN: HORIZONTAL AGGREGATIONS IN SQL TO PREPARE DATA SETS FOR DATA MINING ANALYSIS 683Fig. 2. Main steps of methods based on F (unoptimized). Fig. 3. Main steps of methods based on FV (optimized).column. It is necessary to introduce an additional table F0 ,that will be outer joined with projected tables to get a Then each table FI aggregates only those rows thatcomplete result set. We propose two basic substrategies to correspond to the Ith unique combination of R1 ; . . . ; Rk ,compute FH . The first one directly aggregates from F . Thesecond one computes the equivalent vertical aggregation in given by the WHERE clause. A possible optimization isa temporary table FV grouping by L1 ; . . . ; Lj ; R1 ; . . . ; Rk . synchronizing table scans to compute the d tables in one pass.Then horizontal aggregations can be instead computed Finally, to get FH we need d left outer joins with the d þ 1from FV , which is a compressed version of F , since tables so that all individual aggregations are properlystandard aggregations are distributive [9]. assembled as a set of d dimensions for each group. Outer We now introduce the indirect aggregation based on the joins set result columns to null for missing combinations forintermediate table FV , that will be used for both the SPJ and the given group. In general, nulls should be the defaultthe CASE method. Let FV be a table containing the vertical value for groups with missing combinations. We believe itaggregation, based on L1 ; . . . ; Lj ; R1 ; . . . ; Rk . Let V() repre- would be incorrect to set the result to zero or some othersent the corresponding vertical aggregation for HðÞ. The number by default if there are no qualifying rows. Suchstatement to compute FV gets a cube: approach should be considered on a per-case basis.INSERT INTO FV http://ieeexploreprojects.blogspot.com INSERT INTO FHSELECT L1 ; . . . ; Lj ; R1 ; . . . ; Rk , V(A) SELECTFROM F F0 :L1 ; F0 :L2 ; . . . ; F0 :Lj ,GROUP BY L1 ; . . . ; Lj ; R1 ; . . . ; Rk ; F1 :A; F2 :A; . . . ; Fd :A Table F0 defines the number of result rows, and builds FROM F0the primary key. F0 is populated so that it contains every LEFT OUTER JOIN F1existing combination of L1 ; . . . ; Lj . Table F0 has fL1 ; . . . ; Lj g ON F0 :L1 ¼ F1 :L1 and . . . and F0 :Lj ¼ F1 :Ljas primary key and it does not have any nonkey column. LEFT OUTER JOIN F2INSERT INTO F0 ON F0 :L1 ¼ F2 :L1 and . . . and F0 :Lj ¼ F2 :LjSELECT DISTINCT L1 ; . . . ; Lj ...FROM fF jFV g; LEFT OUTER JOIN Fd ON F0 :L1 ¼ Fd :L1 and . . . and F0 :Lj ¼ Fd :Lj ; In the following discussion I 2 f1; . . . ; dg: we use h to This statement may look complex, but it is easy to see thatmake writing clear, mainly to define boolean expressions.We need to get all distinct combinations of subgrouping each left outer join is based on the same columns L1 ; . . . ; Lj .columns R1 ; . . . ; Rk , to create the name of dimension To avoid ambiguity in column references, L1 ; . . . ; Lj arecolumns, to get d, the number of dimensions, and to qualified with F0 . Result column I is qualified with table FI .generate the boolean expressions for WHERE clauses. Each Since F0 has n rows each left outer join produces a partialWHERE clause consists of a conjunction of k equalities table with n rows and one additional column. Then at the end,based on R1 ; . . . ; Rk . FH will have n rows and d aggregation columns. The statement above is equivalent to an update-based strategy.SELECT DISTINCT R1 ; . . . ; Rk Table FH can be initialized inserting n rows with keyFROM fF jFV g; L1 ; . . . ; Lj and nulls on the d dimension aggregation columns. Tables F1 ; . . . ; Fd contain individual aggregations for Then FH is iteratively updated from FI joining on L1 ; . . . ; Lj .each combination of R1 ; . . . ; Rk . The primary key of table FI This strategy basically incurs twice I/O doing updatesis fL1 ; . . . ; Lj g. instead of insertion. Reordering the d projected tables to joinINSERT INTO FI cannot accelerate processing because each partial table has nSELECT L1 ; . . . ; Lj ; V ðAÞ rows. Another claim is that it is not possible to correctlyFROM fF jFV g compute horizontal aggregations without using outer joins.WHERE R1 ¼ v1I AND .. AND Rk ¼ vkI In other words, natural joins would produce an incompleteGROUP BY L1 ; . . . ; Lj ; result set.
  7. 7. 684 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL 20123.4.2 CASE Method ,sum(CASE WHEN R1 ¼ v1d and .. and Rk ¼ vkdFor this method, we use the “case” programming construct THEN A ELSE null END)available in SQL. The case statement returns a value FROM FVselected from a set of values based on boolean expressions. GROUP BY L1 ; L2 ; . . . ; Lj ;From a relational database theory point of view this is As can be seen, the code is similar to the code presentedequivalent to doing a simple projection/aggregation query before, the main difference being that we have a call towhere each nonkey value is given by a function that returns sumðÞ in each term, which preserves whatever values werea number based on some conjunction of conditions. We previously computed by the vertical aggregation. It has thepropose two basic substrategies to compute FH . In a similarmanner to SPJ, the first one directly aggregates from F and disadvantage of using two tables instead of one as requiredthe second one computes the vertical aggregation in a by the direct computation from F . For very large tables Ftemporary table FV and then horizontal aggregations are computing FV first, may be more efficient than computingindirectly computed from FV . directly from F . We now present the direct aggregation method. Hor-izontal aggregation queries can be evaluated by directly 3.4.3 PIVOT Methodaggregating from F and transposing rows at the same time We consider the PIVOT operator which is a built-in operatorto produce FH . First, we need to get the unique combina- in a commercial DBMS. Since this operator can performtions of R1 ; . . . ; Rk that define the matching boolean transposition it can help evaluating horizontal aggregations.expression for result columns. The SQL code to compute The PIVOT method internally needs to determine how manyhorizontal aggregations directly from F is as follows: columns are needed to store the transposed table and it canobserve V ðÞ is a standard (vertical) SQL aggregation that be combined with the GROUP BY clause.has a “case” statement as argument. Horizontal aggrega- The basic syntax to exploit the PIVOT operator totions need to set the result to null when there are no compute a horizontal aggregation assuming one BY columnqualifying rows for the specific horizontal group to be for the right key columns (i.e., k ¼ 1) is as follows:consistent with the SPJ method and also with the extendedrelational model [4]. SELECT DISTINCT R1 FROM F ; /* produces v1 ; . . . ; vd */SELECT DISTINCT R1 ; . . . ; RkFROM F ; SELECT L1 ; L2 ; . . . ; Lj http://ieeexploreprojects.blogspot.com ,v1 ; v2 ; . . . ; vdINSERT INTO FH INTO FtSELECT L1 ; . . . ; Lj FROM F ,V(CASE WHEN R1 ¼ v11 and . . . and Rk ¼ vk1 PIVOT( THEN A ELSE null END) V(A) FOR R1 in (v1 ; v2 ; . . . ; vd ) .. ) AS P; ,V(CASE WHEN R1 ¼ v1d and . . . and Rk ¼ vkd THEN A ELSE null END) SELECTFROM F L1 ; L2 ; . . . ; LjGROUP BY L1 ; L2 ; . . . ; Lj ; ,V ðv1 Þ; V ðv2 Þ; . . . ; V ðvd Þ This statement computes aggregations in only one scan INTO FHon F . The main difficulty is that there must be a feedback FROM Ftprocess to produce the “case” boolean expressions. GROUP BY L1 ; L2 ; . . . ; Lj ; We now consider an optimized version using FV . Basedon FV , we need to transpose rows to get groups based on This set of queries may be inefficient because Ft can be aL1 ; . . . ; Lj . Query evaluation needs to combine the desired large intermediate table. We introduce the followingaggregation with “CASE” statements for each distinct optimized set of queries which reduces of the intermediatecombination of values of R1 ; . . . ; Rk . As explained above, table:horizontal aggregations must set the result to null when SELECT DISTINCT R1there are no qualifying rows for the specific horizontal FROM F ; /* produces v1 ; . . . ; vd */group. The boolean expression for each case statement has aconjunction of k equality comparisons. The following SELECTstatements compute FH : L1 ; L2 ; . . . ; LjSELECT DISTINCT R1 ; . . . ; Rk ,v1 ; v2 ; . . . ; vdFROM FV ; INTO FH FROM (INSERT INTO FH SELECT L1 ; L2 ; . . . ; Lj ; R1 ; ASELECT L1 ,..,Lj FROM F ) Ft ,sum(CASE WHEN R1 ¼ v11 and .. and Rk ¼ vk1 PIVOT( THEN A ELSE null END) V ðAÞ FOR R1 in (v1 ; v2 ; . . . ; vd ) .. ) AS P;
  8. 8. ORDONEZ AND CHEN: HORIZONTAL AGGREGATIONS IN SQL TO PREPARE DATA SETS FOR DATA MINING ANALYSIS 685 Notice that in the optimized query the nested query 3.5 Properties of Horizontal Aggregationstrims F from columns that are not later needed. That is, A horizontal aggregation exhibits the following properties:the nested query projects only those columns that will 1. n ¼ jFH j matches the number of rows in a verticalparticipate in FH . Also, the first and second queries can be aggregation grouped by L1 ; . . . ; Lj .computed from FV ; this optimization is evaluated in 2. d ¼ jR1 ;...;Rk ðF Þj.Section 4. 3. Table FH may potentially store more aggregated values than FV due to nulls. That is, jFV j nd.3.4.4 Example of Generated SQL QueriesWe now show actual SQL code for our small example. This 3.6 Equivalence of MethodsSQL code produces FH in Fig. 1. Notice the three methods We will now prove the three methods produce the samecan compute from either F or FV , but we use F to make result.code more compact. Theorem 1. SPJ and CASE evaluation methods produce the same The SPJ method code is as follows (computed from F ): result./* SPJ method */ Proof. Let S ¼ R1 ¼v1I ÁÁÁRk ¼vkI ðF Þ. Each table FI in SPJ isINSERT INTO F1 computed as FI ¼ L1 ;...;Lj F V ðAÞ ðSÞ. The F notation is usedSELECT D1,sum(A) AS A to extend relational algebra with aggregations: theFROM F GROUP BY columns are L1 . . . Lj and the aggregationWHERE D2=’X’ function is V ðÞ. Note: in the following equations all joinsGROUP BY D1; ffl are left outer joins. We can follow an induction on d, the number of distinct combinations for R1 ; . . . ; Rk .INSERT INTO F2 When d ¼ 1 (base case) it holds jR1 ...Rk ðF Þj ¼ 1 andSELECT D1,sum(A) AS A S1 ¼ R1 ¼v11 ...Rk ¼vk1 ðF Þ. Then F1 ¼ L1 ;...;Lj F V ðAÞ ðS1 Þ. ByFROM F definition F0 ¼ L1 ;...;Lj ðF Þ. SinceWHERE D2=’Y’ jR1 ...Rk ðF Þj ¼ 1jL1;...;Lj ðF Þj ¼ jL1 ;...;Lj ;R1 ;...;Rk ðF Þj:GROUP BY D1; Then FH ¼ F0 ffl F1 ¼ F1 (the left join does not insertINSERT INTO FH nulls). On the other hand, for the CASE method let http://ieeexploreprojects.blogspot.comSELECT F0.D1,F1.A AS D2_X,F2.A AS D2_Y G ¼ L1 ;...;Lj F V ððA;1ÞÞ ðF Þ, where ð; IÞ represents the CASEFROM F0 LEFT OUTER JOIN F1 on F0.D1=F1.D1 statement and I is the Ith dimension. But since LEFT OUTER JOIN F2 on F0.D1=F2.D1; jR1 ...Rk ðF Þj ¼ 1 then The CASE method code is as follows (computed from F ): G ¼ L1 ;...;Lj F V ððA;1ÞÞ ðF Þ ¼ L1 ;...;Lj F V ðAÞ ðF Þ/* CASE method */ (i.e., the conjunction in ðÞ always evaluates to true).INSERT INTO FH Therefore, G ¼ F1 , which proves both methods returnSELECT the same result. For the general case, assume the result D1 holds for d À 1. Consider F0 ffl F1 . . . ffl Fd . By the ,SUM(CASE WHEN D2=’X’ THEN A induction hypothesis this means ELSE null END) as D2_X ,SUM(CASE WHEN D2=’Y’ THEN A F0 ffl F1 . . . ffl FdÀ1 ¼ L1 ;...;Lj F V ððA;1ÞÞ;V ððA;2ÞÞ;...;V ððA;dÀ1ÞÞ ðF Þ: ELSE null END) as D2_YFROM F Let us analyze Fd . Table Sd ¼ R1 ¼v1d ...Rk ¼vkd ðF Þ. TableGROUP BY D1; Fd ¼ L1 ;...;Lj F V ðAÞ ðSd Þ. Now, F0 ffl Fd augments Fd with nulls so that jF0 j ¼ jFH j. Since the dth conjunction is the Finally, the PIVOT method SQL is as follows (computed same for Fd and for ðA; dÞ. Thenfrom F ): F0 ffl Fd ¼ L1 ;...;Lj F V ððA;dÞÞ ðF Þ:/* PIVOT method */INSERT INTO FH Finally,SELECT D1 F0 ffl F1 ffl . . . Fd ,[X] as D2_X ¼ L1 ;...;Lj F V ððA;1ÞÞ;V ððA;2ÞÞ;...;V ððA;dÀ1ÞÞ;V ððA;dÞÞ ðF Þ: ,[Y] as D2_Y t uFROM ( SELECT D1, D2, A FROM F) as p Theorem 2. The CASE and PIVOT evaluation methods producePIVOT ( the same result.SUM(A) Proof. (sketch) The SQL PIVOT operator works in a similarFOR D2 IN ([X], [Y]) manner to the CASE method. We consider the optimized) as pvt; version of PIVOT, where we project only the columns
  9. 9. 686 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL 2012 required by FH (i.e., trimming F ). When the PIVOT 3.8 Discussion operator is applied one aggregation column is produced For all proposed methods to evaluate horizontal aggrega- for every distinct value vj , producing the d desired tions we summarize common requirements. columns. An induction on j proves the clause “V ðAÞ for R1 IN (v1 ; . . . ; vd )” is transformed into d “V(CASE WHEN 1.All methods require grouping rows by L1 ; . . . ; Lj in R1 ¼ vj THEN A END)” statements, where each pivoted one or several queries. value produces one Xj for j ¼ 1 . . . d. Notice a GROUP 2. All methods must initially get all distinct combina- BY clause is not required for the outer SELECT statement tions of R1 ; . . . ; Rk to know the number and names of for the optimized PIVOT version. u t result columns. Each combination will match an input row with a result column. This step makes Based on the two previous theoretical results we present query optimization difficult by standard queryour main theorem. optimization methods because such columns cannot be known when a horizontal aggregation query isTheorem 3. Given the same input table F and horizontal parsed and optimized. aggregation query, the SPJ, CASE, and PIVOT methods 3. It is necessary to set result columns to null when produce the same result. there are no qualifying rows. This is done either by outer joins or by the CASE statement.3.7 Time Complexity and I/O Cost 4. Computation can be accelerated in some cases byWe now analyze time complexity for each method. Recall first computing FV and then computing furtherthat N ¼ jF j, n ¼ jFH j and d is the data set dimensionality aggregations from FV instead of F . The amount of(number of cross-tabulated aggregations). We consider one acceleration depends on how larger is N withI/O to read/write one row as the basic unit to analyze the respect to n (i.e., if N ) n). These requirementscost to evaluate the query. This analysis considers every can be used to develop more efficient querymethod precomputes FV . evaluation algorithms. SPJ: We assume hash or sort-merge joins [7] are available. The correct way to treat missing combinations for oneThus a join between two tables of size OðnÞ can be evaluated group is to set the result column to null. But in some cases itin time OðnÞ on average. Otherwise, joins take time may make sense to change nulls to zero, as was the case toOðnlog2 nÞ. Computing the sort in the initial query “SELECT code a categorical attribute into binary dimensions. SomeDISTINCT. . . ” takes OðNlog2 ðNÞÞ. If the right key produces aspects about both CASE substrategies are worth discussinga high d (say d ! 10 and a uniform distribution of values). in more depth. Notice the boolean expressions in each term http://ieeexploreprojects.blogspot.comThen each query will have a high selectivity predicate. Each produce disjoint subsets (i.e., they partition F ). The queriesjFi j n. Therefore, we can expect jFi j N. There are above can be significantly accelerated using a smarterd queries with different selectivity with a conjunction of evaluation because each input row falls on only one resultk terms Oðkn þ NÞ each. Then total time for all selection column and the rest remain unaffected. Unfortunately, thequeries is Oðdkn þ dNÞ. There are d GROUP-BY operations SQL parser does not know this fact and it unnecessarilywith L1 ; . . . ; Lj producing a table OðnÞ each. Therefore, the evaluates d boolean expressions for each input row in F .d GROUP-BYs take time OðdnÞ with I/O cost 2dn (to read and This requires OðdÞ time complexity for each row, instead ofwrite). Finally, there are d outer joins taking OðnÞ or Oð1Þ. In theory, the SQL query optimizer could reduce theOðnlog2 ðnÞÞ each, giving a total time OðdnÞ or Oðdnlog2 ðnÞÞ. number to conjunctions to evaluate down to one using aIn short, time is OðNlog2 ðNÞ þ dkn þ dNÞ and I/O cost is hash table that maps one conjunction to one dimensionNlog2 ðNÞ þ 3dn þ dN with hash joins. Otherwise, time is column. Then the complexity for one row can decrease fromOðNlog2 ðNÞ þ dknlog2 ðnÞ þ dNÞ and I/O cost is Nlog2 ðNÞ þ OðdÞ down to Oð1Þ.2dn þ dnlog2 ðnÞ þ dN with sort-merge joins. If an input query has several terms having a mix of Time depends on number of distinct values, their horizontal aggregations and some of them share similarcombination, and probabilistic distribution of values. subgrouping columns R1 ; . . . ; Rk the query optimizer can CASE: Computing the sort in the initial query “SELECT avoid redundant comparisons by reordering operations. If aDISTINCT. . . ” takes OðNlog2 ðNÞÞ. There are OðdkNÞ pair of horizontal aggregations does not share the same set ofcomparisons; notice this is fixed. There is one GROUP-BY subgrouping columns further optimization is not possible. Horizontal aggregations should not be used when the setwith L1 ; . . . ; Lj in time OðdknÞ producing table OðdnÞ. of columns fR1 ; . . . ; Rk g have many distinct values (such asEvaluation time depends on the number of distinct value the primary key of F ). For instance, getting horizontalcombinations, but not on their probabilistic distribution. In aggregations on transactionLine using itemId. In theoryshort, time is OðNlog2 ðNÞ þ dkn þ NÞ and I/O cost is such query would produce a very wide and sparse table,Nlog2 ðNÞ þ n þ N. As we can see, time complexity is the but in practice it would cause a runtime error because thesame, but I/O cost is significantly smaller compared to SPJ. maximum number of columns allowed in the DBMS could PIVOT: We consider the optimized version which trims be exceeded.F from irrelevant columns and k ¼ 1. Like the SPJ andCASE methods, PIVOT depends on selecting the distinct 3.9 DBMS Limitationsvalues from the right keys R1 ; . . . ; Rk . It avoids joins and There exist two DBMS limitations with horizontal aggrega-saves I/O when it receives as input the trimmed version of tions: reaching the maximum number of columns in oneF . Then it has similar time complexity to CASE. Also, time table and reaching the maximum column name lengthdepends on number of distinct values, their combination, when columns are automatically named. To elaborate onand probabilistic distribution of values. this, a horizontal aggregation can return a table that goes
  10. 10. ORDONEZ AND CHEN: HORIZONTAL AGGREGATIONS IN SQL TO PREPARE DATA SETS FOR DATA MINING ANALYSIS 687 TABLE 2 TABLE 3 Summary of Grouping Columns from Query Optimization: Precompute Vertical Aggregation TPC-H Table Transactionline in FV (N ¼ 12M). Times in Seconds (N ¼ 6M) generator. In general, we evaluated horizontal aggregationbeyond the maximum number of columns in the DBMS queries using the fact table transactionLine as input. Table 2when the set of columns fR1 ; . . . ; Rk g has a large number of shows the specific columns from the TPC-H fact table wedistinct combinations of values, or when there are multiple use for the left key and the right key in a horizontalhorizontal aggregations in the same query. On the other aggregation. Basically, we picked several column combina-hand, the second important issue is automatically generat- tions to get different d and n. In order to get meaningfuling unique column names. If there are many subgrouping data sets for data mining we picked high selectivitycolumns R1 ; . . . ; Rk or columns are of string data types, this columns for the left key and low selectivity columns formay lead to generate very long column names, which may the right key. In this manner d ( n, which is the mostexceed DBMS limits. However, these are not important common scenario in data mining.limitations because if there are many dimensions that is Since TPC-H only creates uniformly distributed valueslikely to correspond to a sparse matrix (having many zeroes we created a similar data generator of our own to controlor nulls) on which it will be difficult or impossible to the probabilistic distribution of column values. We alsocompute a data mining model. On the other hand, the large created synthetic data sets controlling the number ofcolumn name length can be solved as explained below. distinct values in grouping keys and their probabilistic The problem of d going beyond the maximum number of distribution. We used two probabilistic distributions: uni-columns can be solved by vertically partitioning FH so that form (unskewed) and zipf (skewed), which represent two http://ieeexploreprojects.blogspot.comeach partition table does not exceed the maximum number common and complementary density functions in queryof columns allowed by the DBMS. Evidently, each partition optimization. The table definition for these data sets istable must have L1 ; . . . ; Lj as its primary key. Alternatively, similar to the fact table from TPC-H. The key valuethe column name length issue can be solved by generating distribution has an impact both in grouping rows andcolumn identifiers with integers and creating a “dimension” writing rows to the output table.description table that maps identifiers to full descriptions,but the meaning of each dimension is lost. An alternative is 4.2 Query Optimizationsthe use of abbreviations, which may require manual input. Table 3 analyzes our first query optimization, applied to three methods. Our goal is to assess the acceleration obtained by precomputing a cube and storing it on FV .4 EXPERIMENTAL EVALUATION We can see this optimization uniformly accelerates allIn this section, we present our experimental evaluation on a methods. This optimization provides a different gain,commercial DBMS. We evaluate query optimizations, depending on the method: for SPJ the optimization is bestcompare the three query evaluation methods, and analyze for small n, for PIVOT for large n and for CASE there istime complexity varying table sizes and output data set rather a less dramatic improvement all across n. It isdimensionality. noteworthy PIVOT is accelerated by our optimization, despite the fact it is handled by the query optimizer. Since4.1 Setup: Computer Configuration and Data Sets this optimization produces significant acceleration for theWe used SQL Server V9, running on a DBMS server three methods (at least 2Â faster) we will use it by default.running at 3.2 GHz, Dual Core processor, 4 GB of RAM and Notice that precomputing FV takes the same time within1 TB on disk. The SQL code generator was programmed in each method. Therefore, comparisons are fair.the Java language and connected to the server via JDBC. We now evaluate an optimization specific to the PIVOTThe PIVOT operator was used as available in the SQL operator. This PIVOT optimization is well known, as welanguage implementation provided by the DBMS. learned from SQL Server DBMS users groups. Table 4 We used large synthetic data sets described below. We shows the impact of removing (trimming) columns notanalyzed queries having only one horizontal aggregation, needed by PIVOT. That is, removing columns that will notwith different grouping and horizontalization columns. appear in FH . We can see the impact is significant,Each experiment was repeated five times and we report the accelerating evaluation time from three to five times. Allaverage time in seconds. We cleared cache memory before our experiments incorporate this optimization by default.each method started in order to evaluate query optimiza-tion under pessimistic conditions. 4.3 Comparing Evaluation Methods We evaluated optimization strategies for aggregation Table 5 compares the three query optimization methods.queries with synthetic data sets generated by the TPC-H Notice Table 5 is a summarized version of Table 3, showing
  11. 11. 688 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL 2012 TABLE 4 TABLE 6 Query Optimization: Remove Variability of Mean Time (N ¼ 12M, One Standard (Trim) Unnecessary Columns Deviation, Percentage of Mean Time). from FV for PIVOT (N ¼ 12M). Times in Seconds Times in Secondsbest times for each method. Table 6 is a complement,showing time variability around the mean time for timesreported in Table 5; we show one standard deviation and 4.4 Time Complexitypercentage that one represents respect to . As can be We now verify the time complexity analysis given inseen, times exhibit small variability, PIVOT exhibits Section 3.7. We plot time complexity keeping varying onesmallest variability, followed by CASE. As we explained parameter and the remaining parameters fixed. In thesebefore, in the time complexity and I/O cost analysis, the experiments, we generated synthetic data sets similar totwo main factors influencing query evaluation time are data the fact table of TPC-H of different sizes with groupingset size and grouping columns (dimensions) cardinalities. columns of varying selectivities (number of distinct values).We consider different combinations of L1 ; . . . ; Lj and We consider two basic probabilistic distribution of values:R1 ; . . . ; Rk columns to get different values of n and d, uniform (unskewed) and zipf (skewed). The uniformrespectively. Refer to Table 2 to know the correspondence distribution is the distribution used by default.between TPC-H columns and FH table sizes d; n. The three Fig. 4 shows the impact of increasing the size of the factmethods use the optimization precomputing FV in order to table (N). The left plot analyzes a small FH table (n ¼ 1k),make a uniform and fair comparison (recall this optimiza- whereas the right plot presents a much larger FH tabletion works well for large tables). In general, SPJ is the (n ¼ 128k). Recall from Section 3.7 time is OðNlog2 NÞ þ Nslowest method, as expected. On the other hand, PIVOT when N grows and the other sizes (n; d) are fixed (the firstand CASE have similar time performance with slightly term corresponds to SELECT DISTINCT and the second one http://ieeexploreprojects.blogspot.combigger differences in some cases. Time grows as n grows for to the method). The trend indicates all evaluation methodsall methods, but much more for SPJ. On the other hand, get indeed impacted by N. Times tend to be linear as N grows for the three methods, which shows the methodthere is a small time growth when d increases for SPJ, but queries have more weight than getting the distinct dimen-not a clear trend for PIVOT and CASE. Notice we are sion combinations. The PIVOT and CASE methods showcomparing against the optimized version of the PIVOT very similar performance. On the other hand, there is a bigmethod (the CASE method is significantly faster than the gap between PIVOT/CASE and SPJ for large n. The trendunoptimized PIVOT method). We conclude that n is the indicates such gap will not shrink as N grows.main performance factor for PIVOT and CASE methods, Fig. 5 now leaves N fixed and computes horizontalwhereas both d and n impact the SPJ method. aggregations increasing n. The left plot uses d ¼ 16 (to get We investigated the reason for time differences analyz- the individual horizontal aggregation columns) and theing query evaluation plans. It turned out PIVOT and CASE right plot shows d ¼ 64. Recall from Section 3.7, that timehave similar query evaluation plans when they use FV , but should grow OðnÞ keeping N and d fixed. Clearly time isthey are slightly different when they use F . The most linear and grows slowly for PIVOT and CASE. On the otherimportant difference is that the PIVOT and CASE methods hand, time grows faster for SPJ and the trend indicates it ishave a parallel step which allows evaluating aggregations mostly linear, especially for d ¼ 64. From a comparisonin parallel. Such parallel step does not appear when PIVOT perspective, PIVOT and CASE have similar performance forevaluates the query from F . Therefore, in order to conduct a low d and the same performance for high d. SPJ is muchfair comparison we use FV by default. slower, as expected. In short, for PIVOT and CASE time complexity is OðnÞ, keeping N; d fixed. Fig. 6 shows time complexity varying d. According to our TABLE 5 analysis in Section 3.7, time should be OðdÞ. The left plot Comparing Query Evaluation Methods (All with Optimization Computing FV ). shows a medium sized data set with n ¼ 32 k, whereas the Times in Seconds right plot shows a large data set with n ¼ 128 k. For small n (left plot) the trend for PIVOT and CASE methods is not clear; overall time is almost constant overall. On the other hand, for SPJ time indeed grows in a linear fashion. For large n (right plot) trends are clean; PIVOT and CASE methods have the same performance, its time is clearly linear, although with a small slope. On the other hand, time is not linear overall for SPJ; it initially grows linearly, but it has a jump beyond d ¼ 64. The reason is the size of the inter- mediate join result tables, which get increasingly wider.
  12. 12. ORDONEZ AND CHEN: HORIZONTAL AGGREGATIONS IN SQL TO PREPARE DATA SETS FOR DATA MINING ANALYSIS 689Fig. 4. Time complexity varying N (d ¼ 16, uniform distribution). http://ieeexploreprojects.blogspot.comFig. 5. Time complexity varying n (N ¼ 4M, uniform distribution).Fig. 6. Time complexity varying d (N ¼ 4M, uniform distribution). Table 7 analyzes the impact of having an uniform or a 5 RELATED WORKzipf distribution of values on the right key. That is, weanalyze the impact of an unskewed and a skewed We first discuss research on extending SQL code for datadistribution. In general, all methods are impacted by the mining processing. We briefly discuss related work ondistribution of values, but the impact is significant when d query optimization. We then compare horizontal aggrega-is high. Such trend indicates higher I/O cost to update tions with alternative proposals to perform transpositionaggregations on more subgroups. The impact is marginal or pivoting.for all methods at low d. PIVOT shows a slightly higher There exist many proposals that have extended SQLimpact than the CASE method by the skewed distribution syntax. The closest data mining problem associated toat high d. But overall, both show similar behavior. SPJ is OLAP processing is association rule mining [18]. SQLagain the slowest and shows bigger impact at high d. extensions to define aggregate functions for association rule
  13. 13. 690 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 4, APRIL 2012 TABLE 7 TRANSPOSE are inverse operators with respect to horizontal Impact of Probabilistic Distribution of Right Key aggregations. A vertical layout may give more flexibility Grouping Values (N ¼ 8M) expressing data mining computations (e.g., decision trees) with SQL aggregations and group-by queries, but it is generally less efficient than a horizontal layout. Later, SQL operators to pivot and unpivot a column were introduced in [5] (now part of the SQL Server DBMS); this work took a step beyond by considering both complementary operations: one to transpose rows into columns and the other one to convert columns into rows (i.e., the inverse operation). There are several important differences with our proposal, though the list of distinct to values must be provided by the user, whereas ours does it automatically; output columns are automatically created; the PIVOT operator can only trans- pose by one column, whereas ours can do it with several columns; as we saw in experiments, the PIVOT operatormining are introduced in [19]. In this case, the goal is to requires removing unneeded columns (trimming) from theefficiently compute itemset support. Unfortunately, there is input table for efficient evaluation (a well-known optimiza-no notion of transposing results since transactions are given tion to users), whereas ours work directly on the input table.in a vertical layout. Programming a clustering algorithm Horizontal aggregations are related to horizontal percentagewith SQL queries is explored in [14], which shows a aggregations [13]. The differences between both approacheshorizontal layout of the data set enables easier and simpler are that percentage aggregations require aggregating at twoSQL queries. Alternative SQL extensions to perform grouping levels, require dividing numbers and need takingspreadsheet-like operations were introduced in [20]. Their care of numerical issues (e.g., dividing by zero). Horizontaloptimizations have the purpose of avoiding joins to express aggregations are more general, have wider applicability andcell formulas, but are not optimized to perform partial in fact, they can be used as a primitive extended operator totransposition for each group of result rows. The PIVOT and compute percentages. Finally, our present paper is aCASE methods avoid joins as well. significant extension of the preliminary work presented in Our SPJ method proved horizontal aggregations can be [12], where horizontal aggregations were first proposed. Theevaluated with relational algebra, exploiting outer joins, most important additional technical contributions are the http://ieeexploreprojects.blogspot.comshowing our work is connected to traditional query following. We now consider three evaluation methods,optimization [7]. The problem of optimizing queries with instead of one, and DBMS system programming issues likeouter joins is not new. Optimizing joins by reordering SQL code generation and locking. Also, the older work didoperations and using transformation rules is studied in [6]. not show the theoretical equivalence of methods, nor theThis work does not consider optimizing a complex query that (now popular) PIVOT operator which did not exist back then.contains several outer joins on primary keys only, which is Experiments in this newer paper use much larger tables,fundamental to prepare data sets for data mining. Tradi- exploit the TPC-H database generator, and carefully studytional query optimizers use a tree-based execution plan, but query optimization.there is work that advocates the use of hypergraphs toprovide a more comprehensive set of potential plans [1]. Thisapproach is related to our SPJ method. Even though the 6 CONCLUSIONSCASE construct is an SQL feature commonly used in practice We introduced a new class of extended aggregate functions,optimizing queries that have a list of similar CASE called horizontal aggregations which help preparing datastatements has not been studied in depth before. sets for data mining and OLAP cube exploration. Specifi- Research on efficiently evaluating queries with aggrega- cally, horizontal aggregations are useful to create data setstions is extensive. We focus on discussing approaches that with a horizontal layout, as commonly required by dataallow transposition, pivoting, or cross-tabulation. The im- mining algorithms and OLAP cross-tabulation. Basically, aportance of producing an aggregation table with a cross- horizontal aggregation returns a set of numbers instead of atabulation of aggregated values is recognized in [9] in the single number for each group, resembling a multidimen-context of cube computations. An operator to unpivot a table sional vector. We proposed an abstract, but minimal,producing several rows in a vertical layout for each input row extension to SQL standard aggregate functions to computeto compute decision trees was proposed in [8]. The unpivot horizontal aggregations which just requires specifyingoperator basically produces many rows with attribute-value subgrouping columns inside the aggregation function call.pairs for each input row and thus it is an inverse operator of From a query optimization perspective, we proposed threehorizontal aggregations. Several SQL primitive operators for query evaluation methods. The first one (SPJ) relies ontransforming data sets for data mining were introduced in standard relational operators. The second one (CASE) relies[3]; the most similar one to ours is an operator to transpose a on the SQL CASE construct. The third (PIVOT) uses a built-table, based on one chosen column. The TRANSPOSE in operator in a commercial DBMS that is not widelyoperator [3] is equivalent to the unpivot operator, producing available. The SPJ method is important from a theoreticalseveral rows for one input row. An important difference is point of view because it is based on select, project, and jointhat, compared to PIVOT, TRANSPOSE allows two or more (SPJ) queries. The CASE method is our most importantcolumns to be transposed in the same query, reducing the contribution. It is in general the most efficient evaluationnumber of table scans. Therefore, both UNPIVOT and method and it has wide applicability since it can be
  14. 14. ORDONEZ AND CHEN: HORIZONTAL AGGREGATIONS IN SQL TO PREPARE DATA SETS FOR DATA MINING ANALYSIS 691programmed combining GROUP-BY and CASE statements. [5] C. Cunningham, G. Graefe, and C.A. Galindo-Legaria, “PIVOT and UNPIVOT: Optimization and Execution Strategies in anWe proved the three methods produce the same result. We RDBMS,” Proc. 13th Int’l Conf. Very Large Data Bases (VLDB ’04),have explained it is not possible to evaluate horizontal pp. 998-1009, 2004.aggregations using standard SQL without either joins or [6] C. Galindo-Legaria and A. Rosenthal, “Outer Join Simplification“case” constructs using standard SQL operators. Our and Reordering for Query Optimization,” ACM Trans. Databaseproposed horizontal aggregations can be used as a database Systems, vol. 22, no. 1, pp. 43-73, 1997. [7] H. Garcia-Molina, J.D. Ullman, and J. Widom, Database Systems:method to automatically generate efficient SQL queries with The Complete Book, first ed. Prentice Hall, 2001.three sets of parameters: grouping columns, subgrouping [8] G. Graefe, U. Fayyad, and S. Chaudhuri, “On the Efficientcolumns, and aggregated column. The fact that the output Gathering of Sufficient Statistics for Classification from Largehorizontal columns are not available when the query is SQL Databases,” Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD ’98), pp. 204-208, 1998.parsed (when the query plan is explored and chosen) makes [9] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “Data Cube: Aits evaluation through standard SQL mechanisms infeasi- Relational Aggregation Operator Generalizing Group-by, Cross-ble. Our experiments with large tables show our proposed Tab and Sub-Total,” Proc. Int’l Conf. Data Eng., pp. 152-159, 1996.horizontal aggregations evaluated with the CASE method [10] J. Han and M. Kamber, Data Mining: Concepts and Techniques, firsthave similar performance to the built-in PIVOT operator. [11] ed. Luo, J.F. Kaufmann, 2001.Ellmann, and M. Watzke, “Locking G. Morgan Naughton, C.J.We believe this is remarkable since our proposal is based on Protocols for Materialized Aggregate Join Views,” IEEE Trans.generating SQL code and not on internally modifying the Knowledge and Data Eng., vol. 17, no. 6, pp. 796-807, June 2005.query optimizer. Both CASE and PIVOT evaluation [12] C. Ordonez, “Horizontal Aggregations for Building Tabular Datamethods are significantly faster than the SPJ method. Sets,” Proc. Ninth ACM SIGMOD Workshop Data Mining and Knowledge Discovery (DMKD ’04), pp. 35-42, 2004.Precomputing a cube on selected dimensions produced an [13] C. Ordonez, “Vertical and Horizontal Percentage Aggregations,”acceleration on all methods. Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’04), There are several research issues. Efficiently evaluating pp. 866-871, 2004.horizontal aggregations using left outer joins presents [14] C. Ordonez, “Integrating K-Means Clustering with a Relational DBMS Using SQL,” IEEE Trans. Knowledge and Data Eng., vol. 18,opportunities for query optimization. Secondary indexes no. 2, pp. 188-201, Feb. 2006.on common grouping columns, besides indexes on primary [15] C. Ordonez, “Statistical Model Computation with UDFs,” IEEEkeys, can accelerate computation. We have shown our Trans. Knowledge and Data Eng., vol. 22, no. 12, pp. 1752-1765, Dec.proposed horizontal aggregations do not introduce conflicts 2010.with vertical aggregations, but we need to develop a more [16] C. Ordonez, “DataIntelligent Data Analysis, vol. 15, no. 4, pp. in a Database System,” Set Preprocessing and Transformation 613-formal model of evaluation. In particular, we want to study 631, 2011.the possibility of extending SQL OLAP aggregations with [17] C. Ordonez and S. Pitchaimalai, “Bayesian Classifiers Pro- http://ieeexploreprojects.blogspot.com Knowledge and Data Eng., vol. 22,horizontal layout capabilities. Horizontal aggregations pro- grammed in SQL,” IEEE Trans. no. 1, pp. 139-144, Jan. 2010.duce tables with fewer rows, but with more columns. Thus [18] S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating Associationquery optimization techniques used for standard (vertical) Rule Mining with Relational Database Systems: Alternatives andaggregations are inappropriate for horizontal aggregations. Implications,” Proc. ACM SIGMOD Int’l Conf. Management of DataWe plan to develop more complete I/O cost models for cost- (SIGMOD ’98), pp. 343-354, 1998.based query optimization. We want to study optimization of [19] H. Wang, C. Zaniolo, and C.R. Luo, “ATLAS: A Small But Complete SQL Extension for Data Mining and Data Streams,”horizontal aggregations processed in parallel in a shared- Proc. 29th Int’l Conf. Very Large Data Bases (VLDB ’03), pp. 1113-nothing DBMS architecture. Cube properties can be general- 1116, 2003.ized to multivalued aggregation results produced by a [20] A. Witkowski, S. Bellamkonda, T. Bozkaya, G. Dorman, N.horizontal aggregation. We need to understand if horizontal Folkert, A. Gupta, L. Sheng, and S. Subramanian, “Spreadsheets in RDBMS for OLAP,” Proc. ACM SIGMOD Int’l Conf. Managementaggregations can be applied to holistic functions (e.g., of Data (SIGMOD ’03), pp. 52-63, 2003.rank()). Optimizing a workload of horizontal aggregationqueries is another challenging problem. Carlos Ordonez received the degree in applied mathematics and the MS degree in computer science, from UNAM University, Mexico, in 1992ACKNOWLEDGMENTS and 1996, respectively. He received the PhD degree in computer science from the GeorgiaThis work was partially supported by US National Science Institute of Technology, in 2000. He workedFoundation grants CCF 0937562 and IIS 0914861. six years extending the Teradata DBMS with data mining algorithms. He is currently an associate professor at the University of Houston.REFERENCES His research is centered on the integration of statistical and data mining techniques into database systems and their[1] G. Bhargava, P. Goel, and B.R. Iyer, “Hypergraph Based application to scientific problems. Reorderings of Outer Join Queries with Complex Predicates,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’95), pp. 304-315, 1995. Zhibo Chen received the BS degree in electrical[2] J.A. Blakeley, V. Rao, I. Kunen, A. Prout, M. Henaire, and C. engineering and computer science in 2005 from Kleinerman, “.NET Database Programmability and Extensibility the University of California, Berkeley, and the in Microsoft SQL Server,” Proc. ACM SIGMOD Int’l Conf. MS and PhD degrees in computer science from Management of Data (SIGMOD ’08), pp. 1087-1098, 2008. the University of Houston in 2008 and 2011,[3] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman, “Non- respectively. His research focuses on query Stop SQL/MX Primitives for Knowledge Discovery,” Proc. ACM optimization for OLAP cube processing. SIGKDD Fifth Int’l Conf. Knowledge Discovery and Data Mining (KDD ’99), pp. 425-429, 1999.[4] E.F. Codd, “Extending the Database Relational Model to Capture More Meaning,” ACM Trans. Database Systems, vol. 4, no. 4, pp. 397-434, 1979.