Like this presentation? Why not share!

# dbtech3.ppt

## by Tess98 on Jul 19, 2010

• 462 views

### Views

Total Views
462
Views on SlideShare
462
Embed Views
0

Likes
1
4
0

No embeds

### Accessibility

Uploaded via SlideShare as Microsoft PowerPoint

### Report content

• From the above we can easily derive that sigma_{c_1} sigma_{c_2} R = sigma_{c_2} sigma_{c_1} R
• The meaning of the arrows: rewritings versus axioms. What we write is axioms. The left to right direction is 99% the beneficial. However views complicate matters. Generalization to theta join Example 6.14
• Watch out rules like that: you cannot apply them blindly. May lead to an explosion of alternative expressions. Use in the exam the rules of duplicates.
• 1
• 1
• 1

## dbtech3.pptPresentation Transcript

• Database Techniek Query Optimization (chapter 14)
• Lecture 3
• Query Rewriting
• Equivalence Rules
• Query Optimization
• Dynamic Programming (System R/DB2)
• Heuristics (Ingres/Postgres)
• De-correlation of nested queries
• Result Size Estimation
• Practicum Assignment 2
• Lecture 3
• Query Rewriting
• Equivalence Rules
• Query Optimization
• Dynamic Programming (System R/DB2)
• Heuristics (Ingres/Postgres)
• De-correlation of nested queries
• Result Size Estimation
• Practicum Assignment 2
• Transformation of Relational Expressions
• Two relational algebra expressions are said to be equivalent if on every legal database instance the two expressions generate the same set of tuples
• Note: order of tuples is irrelevant
• In SQL, inputs and outputs are bags (multi-sets) of tuples
• Two expressions in the bag version of the relational algebra are said to be equivalent if on every legal database instance the two expressions generate the same bag of tuples
• An equivalence rule says that expressions of two forms are equivalent
• Can replace expression of first form by second, or vice versa
• Equivalence Rules
• 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections.
• 2. Selection operations are commutative.
• 3. Only the last in a sequence of projection operations is needed, the others can be omitted.
• Selections can be combined with Cartesian products and theta joins.
•   (E 1 X E 2 ) = E 1  E 2
•   1 (E 1  2 E 2 ) = E 1  1  2 E 2
• Algebraic Rewritings for Selection:  cond2  cond1 R  cond1 AND cond2 R  cond1 OR cond2 R   cond2 R  cond1  cond1  cond2 R
• Equivalence Rules (Cont.)
• 5. Theta-join operations (and natural joins) are commutative . E 1  E 2 = E 2  E 1
• 6. Natural join operations are associative :
• ( E 1 E 2 ) E 3 = E 1 ( E 2 E 3 )
• Equivalence Rules for Joins commutative associative
• For pushing down selections into a (theta) join we have the following cases:
• (push 1) When all the attributes in  0 involve only the attributes of one of the expressions ( E 1 ) being joined.   0  E 1  E 2 ) = (   0 (E 1 ))  E 2
• (split) When  1 involves only the attributes of E 1 and  2 involves only the attributes of E 2 .
•   1    E 1  E 2 ) = (   1 (E 1 ))  (   (E 2 ))
• (impossible) When  involves both attributes of E 1 and E 2 (it is a join condition)
Equivalence Rules (Cont.)
• Pushing Selection thru Cartesian Product and Join  R S cond   cond S R  The right direction requires that cond refers to S attributes only  R S cond  cond S R
• Projection Decomposition  p XY p X p Y p   Y X  Y X  p   p X p  p Y p   X.tax Y.price total=X.tax*Y.price p  {total} {tax} {price} p XY p X p Y
• 8. The projections operation distributes over the theta join operation as follows:
• (a) if  involves only attributes from L 1  L 2 :
• (b) Consider a join E 1  E 2 .
• Let L 1 and L 2 be sets of attributes from E 1 and E 2 , respectively.
• Let L 3 be attributes of E 1 that are involved in join condition  , but are not in L 1  L 2 , and
• Let L 4 be attributes of E 2 that are involved in join condition  , but are not in L 1  L 2 .
More Equivalence Rules
• Join Ordering Example
• For all relations r 1, r 2, and r 3 ,
• ( r 1 r 2 ) r 3 = r 1 ( r 2 r 3 )
• If r 2 r 3 is quite large and r 1 r 2 is small, we choose
• ( r 1 r 2 ) r 3
• so that we compute and store a smaller temporary relation.
• Join Ordering Example (Cont.)
• Consider the expression
•  customer-name ((  branch-city = “Brooklyn” ( branch)) account depositor)
• Could compute account depositor first, and join result with  branch-city = “Brooklyn” ( branch) but account depositor is likely to be a large relation.
• Since it is more likely that only a small fraction of the bank’s customers have accounts in branches located in Brooklyn, it is better to compute
•  branch-city = “Brooklyn” ( branch) account
• first.
• Lecture 3
• Query Rewriting
• Equivalence Rules
• Query Optimization
• Dynamic Programming (System R/DB2)
• Heuristics (Ingres/Postgres)
• De-correlation of nested queries
• Result Size Estimation
• Lecture 3
• Query Rewriting
• Equivalence Rules
• Query Optimization
• Dynamic Programming (System R/DB2)
• Heuristics (Ingres/Postgres)
• De-correlation of nested queries
• Result Size Estimation
• The role of Query Optimization SQL physical algebra logical algebra parsing, normalization logical query optimization physical query optimization query execution
• The role of Query Optimization SQL physical algebra logical algebra parsing, normalization logical query optimization physical query optimization query execution Compare different relational algebra plan  on result size (Practicum 2A)
• The role of Query Optimization SQL physical algebra logical algebra parsing, normalization logical query optimization physical query optimization query execution
• Compare different execution algorithms
• on true cost
• (IO, CPU, cache)
• Enumeration of Equivalent Expressions
• Query optimizers use equivalence rules to systematically generate expressions equivalent to the given expression
• repeated until no more expressions can be found:
• for each expression found so far, use all applicable equivalence rules, and add newly generated expressions to the set of expressions found so far
• The above approach is very expensive in space and time
• Time and space requirements are reduced by not generating all expressions
• Finding A Good Join Order
• Consider finding the best join-order for r 1 r 2 . . . r n .
• There are (2( n – 1))!/( n – 1)! different join orders for above expression. With n = 7, the number is 665280, with n = 10, the number is greater than 176 billion!
• No need to generate all the join orders. Using dynamic programming, the least-cost join order for any subset of { r 1 , r 2 , . . . r n } is computed only once and stored for future use.
• Dynamic Programming in Optimization
• To find best join tree for a set of n relations:
• To find best plan for a set S of n relations, consider all possible plans of the form: S 1 ( S – S 1 ) where S 1 is any non-empty subset of S .
• Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2 n – 1 alternatives.
• When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing it
• Dynamic programming
• Join Order Optimization Algorithm
• procedure findbestplan( S ) if ( bestplan [ S ]. cost   ) return bestplan [ S ] // else bestplan [ S ] has not been computed earlier, compute it now for each non-empty subset S 1 of S such that S 1  S P1= findbestplan( S 1) P2= findbestplan( S - S 1) A = best algorithm for joining results of P 1 and P 2 cost = P 1. cost + P 2. cost + cost of A if cost < bestplan [ S ]. cost bestplan [ S ]. cost = cost bestplan [ S ]. plan = “execute P 1. plan ; execute P 2. plan ; join results of P 1 and P 2 using A ” return bestplan [ S ]
• Left Deep Join Trees
• In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.
• Cost of Optimization
• With dynamic programming time complexity of optimization with bushy trees is O (3 n ).
• With n = 10, this number is 59000 instead of 176 billion!
• Space complexity is O (2 n )
• To find best left-deep join tree for a set of n relations:
• Consider n alternatives with one relation as right-hand side input and the other relations as left-hand side input.
• Using (recursively computed and stored) least-cost join order for each alternative on left-hand-side, choose the cheapest of the n alternatives.
• If only left-deep trees are considered, time complexity of finding best join order is O ( n 2 n )
• Space complexity remains at O (2 n )
• Cost-based optimization is expensive, but worthwhile for queries on large datasets (typical queries have small n, generally < 10)
• Physical Query Optimization
• Minimizes absolute cost
• Minimize I/Os
• Minimize CPU, cache miss cost (main memory DBMS)
• Must consider the interaction of evaluation techniques when choosing evaluation plans: choosing the cheapest algorithm for each operation independently may not yield best overall algorithm. E.g.
• merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation.
• nested-loop join may provide opportunity for pipelining
• Physical Optimization: Interesting Orders
• Consider the expression ( r 1 r 2 r 3 ) r 4 r 5
• An interesting sort order is a particular sort order of tuples that could be useful for a later operation.
• Generating the result of r 1 r 2 r 3 sorted on the attributes common with r 4 or r 5 may be useful, but generating it sorted on the attributes common only r 1 and r 2 is not useful.
• Using merge-join to compute r 1 r 2 r 3 may be costlier, but may provide an output sorted in an interesting order.
• Not sufficient to find the best join order for each subset of the set of n given relations; must find the best join order for each subset, for each interesting sort order
• Simple extension of earlier dynamic programming algorithms
• Usually, number of interesting orders is quite small and doesn’t affect time/space complexity significantly
• Heuristic Optimization
• Cost-based optimization is expensive, even with dynamic programming.
• Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion.
• Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance:
• Perform selection early (reduces the number of tuples)
• Perform projection early (reduces the number of attributes)
• Perform most restrictive selection and join operations before other similar operations.
• Some systems use only heuristics, others combine heuristics with partial cost-based optimization.
• Steps in Typical Heuristic Optimization
• 1. Deconstruct conjunctive selections into a sequence of single selection operations (Equiv. rule 1.).
• 2. Move selection operations down the query tree for the earliest possible execution (Equiv. rules 2, 7a, 7b, 11).
• 3. Execute first those selection and join operations that will produce the smallest relations (Equiv. rule 6).
• 4. Replace Cartesian product operations that are followed by a selection condition by join operations (Equiv. rule 4a).
• 5. Deconstruct and move as far down the tree as possible lists of projection attributes, creating new projections where needed (Equiv. rules 3, 8a, 8b, 12).
• 6. Identify those subtrees whose operations can be pipelined, and execute them using pipelining).
• Heuristic Join Order: the Wong-Youssefi algorithm (INGRES) Sample TPC-H Schema Nation(NationKey, NName) Customer(CustKey, CName, NationKey) Order(OrderKey, CustKey, Status) Lineitem(OrderKey, PartKey, Quantity) Product(SuppKey, PartKey, PName) Supplier(SuppKey, SName) SELECT SName FROM Nation, Customer, Order, LineItem, Product, Supplier WHERE Nation.NationKey = Cuctomer.NationKey AND Customer.CustKey = Order.CustKey AND Order.OrderKey=LineItem.OrderKey AND LineItem.PartKey= Product.Partkey AND Product.Suppkey = Supplier.SuppKey AND NName = “Canada” Find the names of suppliers that sell a product that appears in a line item of an order made by a customer who is in Canada
• Challenges with Large Natural Join Expressions
• For simplicity, assume that in the query
• All joins are natural
• whenever two tables of the FROM clause have common
• attributes we join on them
• Consider Right-Index only
Nation Customer Order LineItem Product Supplier σ NName=“Canada” π SName One possible order RI RI RI RI RI Index
• Wong-Yussefi algorithm assumptions and objectives
• Assumption 1 (weak): Indexes on all join attributes (keys and foreign keys)
• Assumption 2 (strong): At least one selection creates a small relation
• A join with a small relation results in a small relation
• Objective: Create sequence of index-based joins such that all intermediate results are small
• Hypergraphs CName CustKey NationKey NName Status OrderKey Quantity PartKey SuppKey PName SName
• relation hyperedges
• two hyperedges for same relation are possible
• each node is an attribute
• can extend for non-natural equality joins by merging nodes
Nation Customer Order LineItem Product Supplier
• Small Relations/Hypergraph Reduction CName CustKey NationKey NName Status OrderKey Quantity PartKey SuppKey PName SName Nation Customer Order LineItem Product Supplier NationKey NName “ Nation” is small because it has the equality selection NName = “Canada” Nation σ NName=“Canada” Index Pick a small relation (and its conditions) to start the plan
• CName CustKey NationKey NName Status OrderKey Quantity PartKey SuppKey PName SName Nation Customer Order LineItem Product Supplier NationKey NName Nation σ NName=“Canada” Index RI (1) Remove small relation (hypergraph reduction) and color as “small” any relation that joins with the removed “small” relation Customer (2) Pick a small relation (and its conditions if any) and join it with the small relation that has been reduced
• After a bunch of steps… Nation Customer Order LineItem Product Supplier σ NName=“Canada” π SName RI RI RI RI RI Index
• Some Query Optimizers
• The System R/Starburst: dynamic programming on left-deep join orders. Also uses heuristics to push selections and projections down the query tree.
• DB2, SQLserver are cost-based optimizers
• SQLserver is transformation based, also uses dynamic programming.
• MySQL optimizer is heuristics-based (rather weak)
• Heuristic optimization used in some versions of Oracle:
• Repeatedly pick “best” relation to join next
• Starting from each of n starting points. Pick best among these.
• Lecture 3
• Query Rewriting
• Equivalence Rules
• Query Optimization
• Dynamic Programming (System R/DB2)
• Heuristics (Ingres/Postgres)
• De-correlation of nested queries
• Result Size Estimation
• Practicum Assignment 2
• Lecture 3
• Query Rewriting
• Equivalence Rules
• Query Optimization
• Dynamic Programming (System R/DB2)
• Heuristics (Ingres/Postgres)
• De-correlation of nested queries
• Result Size Estimation
• Practicum Assignment 2
• Optimizing Nested Subqueries
• SQL conceptually treats nested subqueries in the where clause as functions that take parameters and return a single value or set of values
• Parameters are variables from outer level query that are used in the nested subquery; such variables are called correlation variables
• E.g. select customer-name from borrower where exists ( select * from depositor where depositor.customer-name = borrower.customer-name )
• Conceptually, nested subquery is executed once for each tuple in the cross-product generated by the outer level from clause
• Such evaluation is called correlated evaluation
• Note: other conditions in where clause may be used to compute a join (instead of a cross-product) before executing the nested subquery
• Optimizing Nested Subqueries (Cont.)
• Correlated evaluation may be quite inefficient since
• a large number of calls may be made to the nested query
• there may be unnecessary random I/O as a result
• SQL optimizers attempt to transform nested subqueries to joins where possible, enabling use of efficient join techniques
• E.g.: earlier nested query can be rewritten as select customer-name from borrower, depositor where depositor.customer-name = borrower.customer-name
• Note: above query doesn’t correctly deal with duplicates, can be modified to do so as we will see
• In general, it is not possible/straightforward to move the entire nested subquery from clause into the outer level query from clause
• A temporary relation is created instead, and used in body of outer level query
• Optimizing Nested Subqueries (Cont.)
• In general, SQL queries of the form below can be rewritten as shown
• Rewrite: select … from L 1 where P 1 and exists ( select * from L 2 where P 2 )
• To: create table t 1 as select distinct V from L 2 where P 2 1 select … from L 1 , t 1 where P 1 and P 2 2
• P 2 1 contains predicates in P 2 that do not involve any correlation variables
• P 2 2 reintroduces predicates involving correlation variables, with relations renamed appropriately
• V contains all attributes used in predicates with correlation variables
• Optimizing Nested Subqueries (Cont.)
• In our example, the original nested query would be transformed to create table t 1 as select distinct customer-name from depositor select customer-name from borrower , t 1 where t 1 . customer-name = borrower.customer-name
• The process of replacing a nested query by a query with a join (possibly with a temporary relation) is called decorrelation .
• Decorrelation is more complicated when
• the nested subquery uses aggregation, or
• when the result of the nested subquery is used to test for equality, or
• when the condition linking the nested subquery to the other query is not exists ,
• and so on.
• Practicum Assignment 2A
• Get the XML metadata description for TPC-H
• xslt script for plotting histograms
• Take our solution for your second query (assignment 1)
• For each operator in the tree give:
• Selectivity
• Intermediate Result size
• Short description how you computed this
• Explanation how to compute histograms on all result columns
• Sum all intermediate result sizes into total query cost
• The Big Picture
• 1. Parsing and translation
• 2. Optimization
• 3. Evaluation
• The Big Picture
• 1. Parsing and translation
• 2. Optimization
• 3. Evaluation
• Optimization
• Query Optimization : Amongst all equivalent evaluation plans choose the one with lowest cost.
• Cost is estimated using statistical information from the database catalog
• e.g. number of tuples in each relation, size of tuples, etc.
• In this lecture we study logical cost estimation
• introduction to histograms
• estimating the amount of tuples in the result with perfect and equi-height histograms
• propagation of histograms into result columns
• How to compute result size from width and #tuples
• Cost Estimation
• Physical cost estimation
• predict I/O blocks, seeks, cache misses, RAM consumption, …
• Depends in the execution algorithm
• In this lecture we study logical cost estimation
• “ the plan with smallest intermediate result tends to be best”
• need estimations for intermediate result sizes
• Histogram-based estimation (practicum, assignment 2)
• estimating the amount of tuples in the result with perfect and equi-height histograms
• propagation of histograms into result columns
• compute result size as tuple-width * #tuples
• Selectivities
• select
• expr := X (col,const): X in { <, <=, =, >=, > }
• | expr && expr | expr || expr
• s op (R) = |R'| / |R| .
• join
• only 1-n / n-1 foreign key joins
• s join (R 1 ,R 2 ) = |R'| / (|R 1 |*|R 2 |) .
• aggr
• s(A(R;g 1 ;a)) = |A(R;g 1 ;a)| / |R| = distinct(R.g 1 ) / |R| .
• s(A(R;g 1 ,g 2 ;a)) = (distinct(R.g 1 ) * distinct(R.g 2 )) / |R| .
• project, order
• s project (R) = s order (R) = 1
• topn
• s topN (R) = min(N,|R|) / |R| = min(N/|R|,1) .
• Result Size
• #tuples_max * selectivity * #columns
• We disregard differences in column-width
• project:
• #columns = |projectlist|
• #tuples_max = |R|
• aggr:
• #columns = |groupbys| + |aggrs|
• #tuples_max = min(|R|, |g1| * .. * |gn|)
• join:
• #columns = |child1| + |child2|
• #tuples_max = |R1| * |R2|
• other :
• #columns stays equal wrt child
• #tuples_max = |R|
• Selectivity estimation
• We can estimate the selectivities using:
• domain constraints
• min/max statistics
• histograms
• Histograms
• Buckets:
• B = <min, max, total, distinct>
• Leave min out ( B i .min = B i-1 .max )
• Different Kinds of Histograms
• Perfect
• Equi-width
• Equi-height
• In the practicum we use
• Perfect histograms, when distinct(R.a) < 25
• Equi-height histograms of 10 buckets, otherwise
• Not perfectly even-height: disjunct value ranges between buckets
• (i.e. frequent value is not split over even-height buckets. It may create a bigger-than-height bucket)
• Perfect Histograms: Equi-Selection
• s(R.a=C) = B k .total * (1/|R|)
• in case there is a k with B k .max = C
• s(R.a=C) = 0
• otherwise
a c d f total s(R.a=d)
• Perfect Histograms: Equi-Selection
• s(R.a=C) = B k .total * (1/|R|)
• in case there is a k with B k .max = C
• s(R.a=C) = 0
• otherwise
a c d f total s(R.a=d)
• Perfect Histograms: Range-Selection
• s(R.a<C) = sum(B i .total) * (1/|R|),
• for all 1 <= i < k with B (k-1) .max < C <= B k .max
a c d f total s(R.a<d)
• Perfect Histograms: Range-Selection
• s(R.a<C) = sum(B i .total) * (1/|R|),
• for all 1 <= i < k with B (k-1) .max < C <= B k .max
a c d f total s(R.a<d)
• Equi-Height Histograms: Equi-Selection
• s(R.a=C) = avg_freq(B k ) * (1/|R|)
• in case there is a k with B (k-1) .max < C <= B k .max
• avg_freq(B k ) = B k .total / B k .distinct
• s(R.a=C) = 0
• otherwise
a d e f total s(R.a=c)
• Equi-Height Histograms: Equi-Selection
• s(R.a=C) = avg_freq(B k ) * (1/|R|)
• in case there is a k with B (k-1) .max < C <= B k .max
• avg_freq(B k ) = B k .total / B k .distinct
• s(R.a=C) = 0
• otherwise
a d e f total s(R.a=c)
• Equi-Height Histograms: Range-Selection
• s(R.a<C) = ( sum(B i .total) + freq_lt(B k ,C) ) * (1/|R|),
• for all 1 <= i < k with B (k-1) .max < C <= B k .max.
total s(R.a<c) a d e f
• Equi-Height Histograms: Range-Selection
• s(R.a<C) = ( sum(B i .total) + freq_lt(B k ,C) ) * (1/|R|),
• for all 1 <= i < k with B (k-1) .max < C <= B k .max.
total s(R.a<c) a d e f
• Select with And and Or
• Assume no correlation between attributes:
• s(θ a and θ c ) = s(θ a ) * s(θ c )
• s(θ a or θ c ) = s(θ a ) + (1-s(θ a )) * s(θ c )
• Note: must normalize θ a , θ c into non-overlapping conditions
• Foreign-key Join Selectivity/Hitrate Estimation
• Foreign-key constraint:
• R 1 matches at most once with R 2
• “ each order matches on average with 7 lineitems”  hitrate = 7
• But what if R’ 2 (e.g. order) is an intermediate result?
• R’ 2 may have multiple key occurrences due to a previous join
• R’ 2 may have less key occurrences (missing keys) due to a select (or join).
• Simple Approach (practicum):
• Hitrate *= | R’ 2 |/| R 2 |
• Aggr(R,[g1..gn],[..])
• Can only predict groupby columns and size:
• Expected result size =
• min(|R|,distinct(g 1 ) * …. * distinct(g n ))
• Histogram Propagation
• order: histogram stays identical
• project: histogram stays identical
• Expression (e.g. l_tax*l_price) not required for the practicum
• possible to use cartesian product on histograms, followed by expression evaluation and re-bucketizing.
• topn: not required for the practicum
• Use last bucket (and backwards) to take highest N distinct values and their frequencies
• aggr: not required for the practicum
• Groupbys: distinct is multiplication of distincts, freq=1
• Aggregates: only possible for global aggregates (no groupbys)
• fk-join: multiply totals by join hitrate
• distinct = min(distinct,total)  this is a simplicifation!
• Select: multiply totals by selectivity
• distinct = min(distinct,total)
• Select (selection attribute):
• Get totals/distincts from subset of buckets
• Practicum Assignment 2
• Get the XML metadata description for TPC-H
• ps/pfd histograms also available
• Take our solution for your second query (assignment 1)
• For each operator in the tree give:
• Selectivity
• Intermediate Result size
• Short description how you computed this
• Explanation how to compute histograms on all result columns
• Sum all intermediate result sizes into total query cost