AN, Add animation for the additional Picasso predicates.
AN: Swap X and Y axis in picture. Also remove the 0,0 point.
AN: Is this SQL Server 2000 ? (is that the latest?)
TPC-H Query 2 is shown in picture AN: Screen dump should be Q5P and fire interval 1
What was the coverage of 68 th plan? 0.02 Plan num 61-68 all had 0.02. Was this at highest level? Yes, DB2, level9
AN, swap the X and Y axis; also remove the 0,0 point.
AN: Should Q8/9 first before Q18 Cardinality of the optimal plan set is high 80% of the plan space is covered by 20% of the plans on average Gini index values are mostly in excess of 0.5 High Cardinality not solely query specific Example: Q18 has 13 plans for Oracle, but only 5 plans each for DB2 and SQL-Server Q8 and Q9, which have a large number of plans across all three systems, are join-intensive and feature dynamic base relations This skew present across the board , in all the optimizers
AN: Make bottom line (Avg) like that in first table. Conservative Estimates Restricted to first quadrant , Grid granularity of 100 x 100 Average and Maxima are upper bounds and the real cost estimates may be even lower in practice We can drastically reduce the complexity of plan diagrams without materially affecting the query processing quality If it were possible to simplify current optimizers to produce only reduced plan diagrams, then considerable computational overheads typically associated with the query optimization process may also be substantially lowered
AN, see if you can fix the double pointers to be neater. Plans P3 , P6 , P10 have duplicates. P18 is an island. P18 has a hash-join between CUSTOMER and NATION followed by a hash-join with a sub-tree whose root is a nested-loops join. P6 first hash-joins the CUSTOMER relation with the sub-tree and then executes hash-join with NATION. Both plans have almost equal cost.
Plan P2 , P7 are duplicates. P2 also has two islands. P7 uses a nested-loops join between relations NATION and REGION , while P1 employs a sort-merge join Cost of both plans is almost equal as the relations are very small
Optimizer changes from P6 to P1 , P16 to P4 , P25 to P23 , P7 to P18 , P8 to P9 and P42 to P47 .
An additional “sort” operator gets added between 26% and 50% selectivities of PARTSUPP relation, that decreases the cost Upward jumps are known to occur like w hen a join ceases to fit in memory But downward jumps! Presence of rules in the optimizer, Parameterized changes in search space AN, actually run the query at highest peak, then middle of valley and then at crest again.
AN, check which NL join is the problem case? Run queries on peak and valley (as far from each other as possible and see whether run times show similar behavior to estimated costs) Inconsistency in Cost Model!
Analyzing Plan Diagrams of Database Query Optimizers Naveen Reddy Jayant Haritsa Database Systems Lab Indian Institute of Science Bangalore, INDIA
Query Execution Plans <ul><li>SQL, the standard database query language, is declarative in nature </li></ul><ul><ul><li>does not specify how the query should be evaluated </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><li>select StudentName, CourseName </li></ul></ul><ul><ul><li>from STUDENT, COURSE, REGISTER where STUDENT.RollNo = REGISTER.RollNo and REGISTER.CourseNo = COURSE.CourseNo </li></ul></ul><ul><ul><li> join order and techniques are left unspecified </li></ul></ul><ul><li>DBMS query optimizer identifies efficient execution strategy: “query execution plan” </li></ul>
Example Query and Plan select c_custkey, c_name, c_acctbal, c_address, c_phone, c_comment from customer, orders, nation where c_custkey = o_custkey and c_nationkey = n_nationkey and o_orderdate < date ('2001-11-09') and n_name = 'IRAQ' RETURN 201,689 HSJOIN 201,689 TBSCAN 175,025 ORDERS HSJOIN 26,571 TBSCAN 50 CUSTOMER NATION TBSCAN 26,512
Query Plan Selection <ul><li>Core technique </li></ul><ul><li>Cost difference between best plan choice and a sub-optimal choice can be enormous (orders of magnitude) </li></ul>Query (Q) Query Optimizer (dynamic programming) Minimum Cost Plan P(Q) DB catalogs Cost Model
Plan and Cost Diagrams <ul><li>Given a query, the optimizer’s plan choice is a function (among other factors) of the selectivities of the base relations participating in the query </li></ul><ul><ul><li>selectivity is the estimated number of rows of a relation relevant to producing the final result </li></ul></ul><ul><li>A plan diagram is a pictorial enumeration of the plan choices of a database query optimizer over the relational selectivity space </li></ul><ul><li>A cost diagram is a visualization of the associated (estimated) plan execution costs over the same relational selectivity space </li></ul>
Example Query [Q7 of TPC-H] <ul><li>select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from supplier, lineitem, orders, customer, nation n1, nation n2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ((n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY') or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')) and l_shipdate between date '1995-01-01' and date '1996-12-31' </li></ul>group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year and o_totalprice < C1 and c_acctbal < C2 ) as shipping
Picasso <ul><li>A Java tool that, given a query template, automatically generates plan and cost diagrams </li></ul><ul><ul><li>Fires queries at user-specified granularity (default 100x100 grid) </li></ul></ul><ul><ul><li>Currently restricted to 2-D plan diagrams and 3-D cost diagrams </li></ul></ul><ul><li>Using the tool, enumerated the plan/cost diagrams produced by industrial-strength query optimizers on TPC-H-based queries </li></ul><ul><ul><li>IBM DB2 v8, Oracle 9i and Microsoft SQL Server 2000 </li></ul></ul><ul><li>Plan diagrams appear similar to cubist paintings </li></ul><ul><ul><li>[ Pablo Picasso founder of the cubist painting genre ] </li></ul></ul>
Testbed Environment <ul><li>Database </li></ul><ul><ul><li>TPC-H database (1 GB scale) representing a manufacturing environment, featuring the following relations: </li></ul></ul><ul><li>Query Set </li></ul><ul><ul><li>Queries based on TPC-H </li></ul></ul><ul><ul><li>benchmark [Q1 through Q22] </li></ul></ul><ul><ul><li>Uniform 100x100 grid (10000 queries) [0.5%, 0.5%] to [99.5%, 99.5%] </li></ul></ul><ul><li>Relational Engines </li></ul><ul><ul><li>Default installations (with all optimization features on) </li></ul></ul><ul><ul><li>Stats on all columns, no extra indices </li></ul></ul><ul><li>Computational Platform </li></ul><ul><ul><li>Pentium-IV 2.4 GHz, 1GB RAM, Windows XP Professional </li></ul></ul>Relation Cardinality 5 25 10000 150000 200000 800000 1500000 6001215 REGION NATION SUPPLIER CUSTOMER PART PARTSUPP ORDERS LINEITEM
RESULTS <ul><li>Optimizers randomly identified as Opt A, Opt B, Opt C </li></ul><ul><li>NOT intended to make comparisons across optimizers </li></ul><ul><li>Black-box testing our conclusions are speculative </li></ul><ul><li>Full result listing at http://dsl.serc.iisc.ernet.in/projects/PICASSO/ </li></ul>
Cost Domination Principle Cost of executing any “foreign” query point in the first quadrant of q s is an upper bound on the cost of executing the foreign plan at q s Cost of executing q s with Plans P 4 and P 1 is less than 91 and 90, respectively. Cost of Query point q s with plan P 2 is 88 q s
Cost Domination Principle <ul><li>Dominating Point </li></ul><ul><ul><li>Given a pair of distinct points q 1 (x 1 ,y 1 ) and q 2 (x 2 ,y 2 ) in 2-D selectivity space, we say that q 2 ≻ q 1 , if and only if x 2 ≥ x 1 , y 2 ≥ y 1 and result cardinality R q2 ≥ R q1 </li></ul></ul><ul><li>Cost Domination Principle </li></ul><ul><ul><li>If points q 1 (x 1 ,y 1 ) and q 2 (x 2 ,y 2 ) are associated with distinct plans P 1 and P 2 respectively, in the original space, the cost of executing query q 1 with plan P 2 is upper-bounded by the cost of executing q 2 with P 2 , if and only if q 2 ≻ q 1 </li></ul></ul>
Plan Swallowing Algorithm <ul><li>For each query point q s , look for replacements by “foreign” query points that are in the first quadrant relative to q s as the origin. </li></ul><ul><li>For the foreign points that are within λ (e.g. λ =10%) threshold, choose point with lowest cost as potential replacement. </li></ul><ul><li>An entire plan is “swallowed” only if all its query points can be swallowed by a single plan or group of plans . </li></ul><ul><li>Order the plans in ascending order of size; go up the list, checking for possibility of swallowing each plan. </li></ul>
Complex Plan Diagram [Q8, OptA*] Reduced Plan Diagram ( λ =10%) Reduced to 7 plans from 68 <ul><li>Note: Plan Reduction ≠ Change in Optimization Levels </li></ul>
Duplicates and Islands [Q10, OptA] Duplicate locations of P3 Duplicate locations of P10 P18 is an island within P6
Duplicates and Islands [Q5, OptC] Three duplicates of P7, which are islands within P1
Duplicates and Islands Removal <ul><li>With Plan Reduction by Swallowing , significant decrease in duplicates and islands </li></ul># Duplicates Original Reduced # Islands Original Reduced Databases Opt A Opt B Opt C 130 13 80 15 55 7 38 3 1 0 8 3
Plan Switch Points [Q9, OptA] Plan Switch Point: line parallel to axis with a plan shift for all plans bordering the line. Hash-Join sequence PARTSUPP ►◄ SUPPLIER ►◄ PART is altered to PARTSUPP ►◄ PART ►◄ SUPPLIER
Venetian Blinds [Q9, Opt B] Six plans simultaneously change with rapid alternations to produce a “Venetian blinds” effect. Left-deep hash join across NATION, SUPPLIER and LINEITEM relations gets replaced by a right-deep hash join.
Footprint Pattern [Q7, Opt A] P7 exhibits a thin broken curved pattern in the middle of P2 ’s region. P2 has sort-merge-join at the top of the plan tree, while P7 has hash-join
Speckle Pattern [Q17, Opt A*] An additional sort operation is present on PART relation in P2 , whose cost is very low
Plan-Switch Non-Monotonic Costs [Q2, Opt A] Plan Diagram Cost Diagram 26% Selectivity 50% Selectivity 26%: Cost decreases by a factor of 50 50%: Cost increases by a factor of 70 Presence of Rules? Parameterized changes in search space?
Intra-Plan Non-Monotonic Costs [Q21, Opt A] Plan Diagram Cost Diagram Plans P1, P3, P4 and P6 Nested loops join whose cost decreases with increasing input cardinalities
PQO (Parametric Query Optimization) <ul><li>Active research area for last 15 years </li></ul><ul><li>Identify the optimal set of plans for the entire relational selectivity space at compile time </li></ul><ul><li>At run time , use actual selectivity values to identify the appropriate plan choice </li></ul><ul><li>Assumptions </li></ul><ul><ul><li>Plan Convexity: If a plan is optimal at two points, then it is optimal at all points on the straight line joining them </li></ul></ul><ul><ul><li>Plan Uniqueness: An optimal plan appears at only one contiguous region in the entire space </li></ul></ul><ul><ul><li>Plan Homogeneity: An optimal plan is optimal within the entire region enclosed by its plan boundaries </li></ul></ul>
Validity of PQO [Q8, Opt A] Plan Convexity is severely violated by regions covered by P12 (dark green) and P16 (light gray). Plan Homogeneity is violated by P14 Plan uniqueness is violated by P4
Note: <ul><li>PQO is a more viable proposition in the space of reduced plan diagrams due to the removal of most duplicates and islands </li></ul>
Conclusions <ul><li>Conceived and developed the Picasso tool for automatically generating plan and cost diagrams </li></ul><ul><li>Presented and analyzed representative plan and cost diagrams on popular commercial query optimizers </li></ul><ul><ul><li>Optimizers make fine grained choices </li></ul></ul><ul><ul><li>Complexity of plan diagrams can be drastically reduced without materially affecting the query processing quality </li></ul></ul><ul><ul><li>Plan optimality regions can have intricate patterns and complex boundaries </li></ul></ul><ul><ul><li>Non-Monotonic cost behavior exists where increasing result cardinalities decrease the estimated cost </li></ul></ul><ul><ul><li>Basic assumptions underlying research literature on PQO do not hold in practice; but hold approximately for reduced plan diagrams </li></ul></ul>
Future Work <ul><li>Extend Picasso to generate higher-dimensional plan and cost diagrams </li></ul><ul><li>Port Picasso to Sybase and PostgreSQL </li></ul><ul><li>Conduct a deeper investigation on the query features that result in high plan cardinalities </li></ul><ul><li>Investigate whether it is possible to simplify the optimizer, so as to be able to directly produce reduced plan diagrams </li></ul><ul><li>Public release of Picasso software (by end-2005) </li></ul>