Top “n” Projects.doc


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Top “n” Projects.doc

  1. 1. Top “n” Projects Dec. 7, 2007 1. Introduction Top “n” queries are those using the “top n”, “first n” (as in Ingres), “fetch first n rows”, etc. features of the various RDBMS products. They are common in decision support applications and have spawned a significant amount of activity in both the commercial DBMS and research communities. This document describes current Ingres support for “first n” and outlines several potential development directions to enhance Ingres support of “first n”. 2. Current Ingres “first n” Support The optional “first n” clause (where “n” is a positive integer) was added to the select list in Ingres in 2000 (actually, it was added informally earlier than that, but was made an official feature in 2000). Very simply, it passes the “n” value to QEF in the top level action header of the query plan to direct qea_fetch() to return only the first n rows from the result set. It was originally coded because of potential applicability in the TPC C benchmarks that we were running at the time and because of its utility to application designers and developers in limiting the size of a result set (I’ve used it countless times for exactly this reason). It was implemented as simply as possible with no concern for optimization potential or broader application utility. Indeed, in all my discussions with clients, I have always pointed out that it simply takes the top n rows from the result set. If the query plan is inherently expensive with blocking operators (e.g. sorts or aggregation), “first n” will not make it any faster. Additionally, to avoid complications or inexplicable side effects, its use is limited to the outermost select of a query and the left most select of a union. While it is allowed in a “create table as select …”, it is not permitted in a view definition. 3. “top n” and the SQL Standard In recognition of the fact that most vendors already had “top n” support of one sort or another, the standards committee have recently (last year) added it to the SQL standard. The approach is more ambitious than most (all?) current implementations and permits both “fetch first n rows” (yes, I know – it’s ugly) and “order by” at the end of each “query expression” in a query. That means you can code them after each select in a union and even after subqueries nested in a query (most typically, in a derived table). The
  2. 2. optional pairing of “order by” clauses with “fetch first” clauses is done to allow the queries to produce deterministic results. Separate features are defined that allow an implementation to support only a single “fetch first” (as in Ingres), multiple “fetch first”s in <query expression>s (one per union) or even “fetch first”s in subqueries. A separate feature also defines whether an implementation permits “fetch first” in a view definition. Moreover, the “n” value can now be a variable (host language variable or procedure parameter) and that capability is defined as yet another distinct feature. To complement the “fetch first” clause, an “offset n” clause has also just been introduced to the standard. “fetch first n” defines how many rows will be returned and “offset n” defines where in the result set the returned rows start. “fetch first”, “offset” and “order by” clauses are all optional and can be used in any combination with one another. Feature codes analogous to those for “fetch first n” are defined for “offset n” to limit its use in a given implementation. 4. Ingres and Standards Syntax 4.1 Present Ingres Support Ingres already supports the “fetch first n rows” and “offset n” clauses in changes submitted to the main codeline. They are only permitted once in a query – the most restrictive feature as defined in the standard. Again, no attempt has been made to perform any optimization in implementing these features. However, they do satisfy the standard. 4.2 Extensions to Ingres Support Support of multiple “fetch first” and “offset” clauses in a single query is possible in Ingres. This might be useful in at least the union case. Implementation in OPF and QEF would be straightforward and would likely take no more than a day or so. Implementation in the grammar would likely be more problematic – in particular if multiple “order by” clauses are also to be supported. Changes to the parse tree structure would be required that, along with grammar changes, would likely take at least a week. Supporting parameters for the “n” value would provide challenges of a different sort, though not insurmountable. As it currently stands, parameters to Ingres queries are always coded in contexts that result in their materialization in ADF CXs. In the case of “top n” parameters, the parameters would fill in values in query plan action headers (the “fetch first” and “offset” values) and aren’t used in the context of CXs. Parameter descriptors and values are made available through a more general interface to QEF (based in the QEF_RCB), so we are not bound to fake a CX to materialize them. Accordingly, flags set in the action headers would probably suffice to indicate that the values must be materialized from the parameter infrastructure, rather than directly from action header fields. This would probably involve a day or so of work in each of PSF, OPF and QEF.
  3. 3. 5. Optimization for “top n” A query plan optimized to materialize all rows of a result set may not offer the most efficient execution of a “top n” execution of the same query. DB2 has long supported a “optimize for n rows” hint that causes the query plan to be built on the assumption that the result set will only contain “n” rows. “top n” optimization can be as simple as assuming that each node, table/index access or join, will only produce n rows. However, in the current climate of data warehouse applications and the prevalence of “top n” queries in various TPC benchmarks, more sophisticated analysis of a “top n” query is required for building an optimal plan. 5.1 Carey, Kossman Approach One of the earlier “top n” optimization papers was “On Saying ‘Enough Already’ in SQL”, presented at SIGMOD 1997. It describes the problem nicely and proposes a simple framework for handling some “top n” queries more efficiently. Their target is select-project-join queries, typically with an order by clause to give some meaning to the “top n” result. Without an “order by”, the results of a “top n” query are non-deterministic, as noted in section 3, above. Their fundamental idea is to introduce a “stop after n” operator into strategic locations in the query plan to prevent the materialization of all rows in both the final result set and at intermediate places in the query plan. The purpose of the “stop after” operators is to reduce the number of rows passing through the query plan to a minimum. Costs are associated with the new operator so that optimization of “top n” queries is sensitive to potential savings of using the “stop after” operator. Needless to say, the value of “n” in the “stop after” operator will depend on the query itself. If a join can multiply the number of instances of a row in the result set, the “stop after” value might be smaller than “n” in the “top n” syntax. On the other hand, if predicates can remove rows from the result set (so-called “reductive” predicates), the “stop after” value may have to be larger. The paper talks about reductive and non-reductive predicates. It also discusses conservative and aggressive placement of the operators. A conservative placement is guaranteed to return the required rows, and possibly more. An aggressive placement may not return enough rows and, therefore, may require the query to be restarted. In general, all the approaches to “top n” optimization incorporate the potential need to restart a query. Another interesting technique the paper discusses is a “sort stop” variant on the “stop after” operator. Any sort operation must clearly consume all input rows, even if it is only required to return the “top n” sorted rows. However, a very simple variation to the Ingres
  4. 4. heap sort algorithms would permit us to only keep the first “n” rows in the sort structures at any point in time. At the start of the sort, the first “n” rows would be loaded into the heap structure. But only those remaining rows whose sort key was in the first “n” would be added to the heap (replacing one already in the heap). All other rows would simply be discarded. For typical values of “n” this could always be done in the QEF memory sort. Apparently Oracle incorporates this optimization in its “top n” processing. Finally, another Carey/Kossman paper (“Reducing the Braking Distance of an SQL Query Engine” from VLDB 1998) supplements these ideas with a technique using range partitioning to split rows into partitions designed to capture the “top n” in one or more materialized partitions, while discarding the remaining rows. 5.2 Donjerkovic, Ramakrishnan Approach “Probablistic Optimization of Top N Queries” (presented at VLDB1999) builds on the results of Carey & Kossman. Rather than introduce explicit “stop after” operators, their paper recommends introducing a “cutoff predicate” to strategic locations in the query plan. The cutoff predicate compares the order by column (which determines the “top n”) to an appropriately estimated value of the column. The cutoff value is chosen to assure that the “top n” rows are produced, though again, a restart operator is required in the event that the estimated cutoff value is too restrictive. Given that the “top n”-ness of a query is optimized by means of introduced cutoff predicates in this approach, they also introduce the idea of a more holistic approach to optimization that not only uses statistics and histograms to predict the numbers of qualified rows at each step of execution, but also uses probability to assess the risk of choosing incorrect cutoff values (resulting in restarts) and to incorporate restart risk into the overall plan cost estimate. Because of the acknowledged importance of statistical accuracy, the paper also introduces the idea of evaluating the “quality” of histograms. Resulting maximum error estimates in the histograms are then also incorporated in the probabilities used in plan optimization. The paper also discusses a couple of specialized “top n” queries. Queries in which the ranked attribute is a count (i.e., computed as “count(*)”) can use histograms on the counted column, along with histogram error estimates as described earlier, to predict the values that will produce the “top n” counts. This is risky business, though, and the values chosen must be the right ones (i.e., the candidate set of values must be chosen conservatively) since there is no verification process that the result is correct other than to materialize the entire result set. 5.3 Ilyas, Shah, et al Approach A more recent paper (“Rank-aware Query Optimization” presented at SIGMOD 2004) addresses a subset of the ranking problem, with rank expressions (the thing in the “order
  5. 5. by” clause) consisting of values from more than one table. It proposes the use of specialized “rank joins” that discard rows more quickly based on their rank scores. As well as the modified join algorithms, the paper also describes techniques for incorporating the new joins as alternate strategies that can be integrated into the overall optimization process. 6. TPC Benchmarks and Classes of “top n” Queries There are several categories of “top n” queries, some far easier than others to optimize. Each of these categories is represented in the TPC benchmarks and so, deserves some consideration. The “easy” “top n” queries are single table queries that order on a single column. These could be handled efficiently with the “top n” sort described earlier. Adding joins to a “top n” query, even if the “order by” is on a single column, can complicate the optimization considerably. The multiplicative and/or reductive characteristics of the join predicates must then be accounted for in order to estimate the number of rows to input to the joins before the stop or cutoff should take place. Blocking joins such as sort/merge and hash may be good for retrieving all rows of the result set, but bad for the “top n”. Optimizing joins for “top n” queries is much easier when the joins map onto known referential relationships. But Ingres referential relationships are not known during optimization (I’ve made numerous recommendations on how to deal with this). An “order by” on multiple columns from one or more tables and/or on multi-column expressions further adds complexity. Multi-column ranking expressions are discussed in the paper referenced in section 5.3, above. “top n” queries on aggregate results are very difficult to handle. The paper in 5.2, above, discusses how to handle the problem when the aggregate is count(). Even that case would require significant changes to Ingres. However, sum() and avg() aggregates are even more difficult. Personally, I can’t see any practical way of determining ahead of time the groups in an aggregate query that will produce the largest sums. Even in a single table query, the grouping and sum/avg columns will undoubtedly be different and any optimization would require some idea of the degree of correlation between the columns. Add a join into the mix, or compute the sum/avg on an expression and the problem is unsolvable. The only optimization that could be brought to bear is the “top n” sort on the results of the grouping and aggregation. While one might hope that “top n” aggregate queries are rare, my intuition is that they are the rule rather than the exception in warehousing applications. Each of TPC H, E and DS contains “top n” queries on sum(). They all also contain “top n” queries ranked on individual columns, though usually on joins. TPC H has a “top n” query on a count() aggregate. TPC DS queries are typically very complex with nesting, derived tables, unions and so forth. “top n” is then performed on the results of these extremely complex
  6. 6. queries. It seems hard to believe that such queries could be handled efficiently by any means other than pre-computed materialized views. 7. Ingres Optimization of “top n” Queries 7.1 “top n” Sort Adding a “top n” sort to Ingres should be a trivial task in QEF. The DMF sort will be more difficult, both because of the potential for overflow to disk and in the face of parallel sort execution where multiple threads may each be sorting a subset of the rows. However, the likelihood of a value of “n” large enough to require a DMF sort with disk overflow is very small. A modified sort node will be required in the query plan to support “top n” sorting. This would be generated by the OPF code generator and will also require some knowledge in the query optimizer to use it effectively. A trivial alternative would be to detect a sort at the top of a “top n” query plan and modify it with no optimization. 7.2 Knowledge of Joins across Referential Relationships Optimizing “top n” joins is much easier if it is known that the joins map to referential relationships. The Ingres catalog structure is not designed for easy determination of sets of matching columns in a referential relationship. I have recommended before that new catalogs be introduced that record these relationships in a manner that is useful to query optimization. Even without considering “top n” queries, optimization of Ingres join queries would benefit from this information. A more general catalog that simply records proportions of rows from a cross product of 2 tables that participate in joins on specified columns is already informally in place in Ingres. Populating it with statistics describing common joins is another technique that would improve join estimates. 7.3 Restart Operator The techniques for processing “top n” queries as described in the literature all incorporate the notion of a restart operator in the event that not enough rows are processed to honour the “top n” count. This could be done as simply as spooling the result rows until we know we have the right number. If enough rows aren’t returned, the spool file is discarded and the query is started over with modified cutoff values. This approach is obviously the least efficient as it wastes all the effort required to materialize the first set of result rows. Moreover, it delays the return of the first result row until all have been materialized and spooled. “top n” queries typically also want the first rows to be returned as quickly as possible.
  7. 7. The field of adaptive query processing has examined the problem of restarting query plans while minimizing the amount of wasted work. The context of the restart is not the same as for “top n”, though the techniques should be analogous. However, this is yet another complication to solving the “top n” optimization problem. 7.4 Probabilistic Optimization The techniques proposed by Donjerkovic and Ramakrishnan are applicable to the Ingres optimizer, but only with a large amount of implementation effort. The mechanisms for computing histogram quality (maximum error) proposed in the paper and in Donjerkovic’s doctoral thesis wouldn’t be difficult to add to Ingres histogram construction. However, the Ingres optimizer is not easily changed. Replacing the current selectivity estimation in Ingres with the probabilistic approach discussed in the paper would be very difficult. OPF is an “old school” query optimizer in which cost estimation is interleaved with plan enumeration. Consideration of different execution algorithms (join techniques, sorting, etc.) is all “hard coded” in OPF. Optimizer extensions described in this and other papers are much more easily applied to rule-based optimizers. But this would involve a complete rewrite of the Ingres optimizer. 7.5 Optimization of sum/avg “top n” Queries As suggested before, I don’t really see any way to effectively optimize a “top n” query ranked on a sum or avg aggregate. While important subsets of “top n” can be effectively optimized with strategies that could ultimately be introduced to Ingres, there will always be other important subsets of “top n” queries with no solution or with entirely different solutions. 8. Summary This document was intended to trigger discussion of various aspects of “top n” query optimization and execution. This is a very wide field, much of it as yet unexplored. We can certainly change Ingres to solve some of the problems that are faced in processing such queries, but at least some of those changes will be very large in scope and will require significant resources to implement.