Upcoming SlideShare
×

Query Optimization

480 views
412 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
480
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
23
0
Likes
0
Embeds 0
No embeds

No notes for slide

Query Optimization

1. 1. <ul><li>Query Optimizer </li></ul><ul><li>(Chapter 9.0 - 9.6) </li></ul>
2. 2. Optimization <ul><li>Minimizes uses of resources by choosing best set of alternative query access plans </li></ul><ul><li>considers  I/O cost, CPU cost </li></ul><ul><li>gathers statistics - may become out of date (DB2 – RUNSTATS command) </li></ul><ul><ul><li>E.g. Selectivity of values - 1/domain - used to determine number of tuples of each value </li></ul></ul>
3. 3. Filter Factor FF (selectivity) <ul><li>Fraction of rows with specified values(s) for specific attribute that result from the predicate restriction </li></ul><ul><li>FF(c)= # records satisfying condition c </li></ul><ul><li>total# of records in relation </li></ul><ul><li>Estimate attribute with i distinct values as: </li></ul><ul><ul><li>Assume |R| is #rows in table R </li></ul></ul><ul><li>( |R|/i) / |R|  = 1/col_cardinality </li></ul><ul><li>e.g. (10,000/2)/10,000 = 1/2                </li></ul>
4. 4. Filter Factor FF <ul><li>FF tells how many tuples satisfy predicate - hopefully only need to access those tuples + index </li></ul><ul><li>Statistical assumptions - uniform distribution of column values, independent join distribution of values from any 2 columns </li></ul>
5. 5. Assumptions <ul><li>Attribute values independent </li></ul><ul><li>Conjunctive select (independent)                 C1 and C2                 FF(C1) * FF(C2)    </li></ul><ul><li>e.g.  1/2 (gender) * 1/4 (class) = </li></ul><ul><ul><li>1/8 freshman in CS are female </li></ul></ul>
6. 6. Information for Optimization <ul><li>SYSCOLUMNS     col_name, table_name, #of values, High, Low </li></ul><ul><li>Statistics on columns that deviate strongly from the uniform assumption </li></ul><ul><li>Cluster Ratio how well clustering property holds for rows with respect to a given index           if 100% clustered with updates, becomes less clustered              if clustering ratio 80% or more, use sequential prefetch </li></ul>
7. 7. Examples of FF <ul><li>if SQL statement specified: </li></ul><ul><ul><li>col = const, </li></ul></ul><ul><ul><ul><li>DB2 assumes FF is 1/col_cardinality </li></ul></ul></ul><ul><ul><li>col between const1 and const2 </li></ul></ul><ul><ul><ul><li>DB2 assumes FF=(const2 - const1)/(High - Low) </li></ul></ul></ul><ul><li>For some predicates, FF not predictable by simple formula </li></ul>
8. 8. Explain Plan <ul><li>You can have access to query plan with </li></ul><ul><li>  EXPLAIN PLAN statement for SQL_query in ORACLE </li></ul><ul><li> </li></ul><ul><li>gives access type (index) col </li></ul>
9. 9. Using Indexes <ul><li>System must decide if to use index </li></ul><ul><li>What if more than one index, which one? </li></ul><ul><li>What if composite index? </li></ul>
10. 10. Plans using Indexes <ul><li>Can use an index if index matches select condition in where clause: </li></ul><ul><li>A matching index scan - only have to access a limited number of contiguous leaf entries to access data </li></ul><ul><li>Predicate screening – index entries to eliminate RIDs </li></ul><ul><li>Non-matching index scan – use index to identify RIDs </li></ul><ul><li>Index-only retrieval – don’t have to access data, RIDs </li></ul><ul><li>Multiple index retrieval – use >1 index to identify RIDs </li></ul>
11. 11. Indexes – Matching index scan <ul><li>A matching index scan is a single-step query plan </li></ul><ul><li>Only have to access contiguous leaf nodes </li></ul><ul><li>Example: Assume a table T1 with multiple indexes on columns C1, C2 and C3 </li></ul><ul><li>Single where clause and (one) index matches </li></ul><ul><li>Select * from T1 </li></ul><ul><li>where C1=10 </li></ul><ul><li>search B+-tree to leaf level for leftmost entry having specified values useful for =, between </li></ul>
12. 12. Index Scan <ul><li>If multiple where clauses and all '=' </li></ul><ul><li>Select * from T1 </li></ul><ul><li>where C1=10 and C2=5 and C3=1 </li></ul><ul><li>  a)  if there is a composite index and a select </li></ul><ul><li>condition matches all index columns </li></ul><ul><li>only have to read contiguous leaf pages </li></ul><ul><li>  FF = FF(P1) * FF(P2) * ... </li></ul><ul><ul><li>b)   if there is a separate index for each clause </li></ul></ul><ul><ul><li>       must choose one of the indexes </li></ul></ul>
13. 13. Index Scan– Predicate screening <ul><li>3. If all select conditions match composite index columns and some selects are a range </li></ul><ul><li> Select * from T1 where C1=10 and C2 </li></ul><ul><li>between 5 and 50 and C3 like ‘A%’ </li></ul><ul><li>- Access contiguous leaf pages, but not all results on contiguous leaf pages </li></ul><ul><li>- Must examine index entries to determine if in the result </li></ul><ul><li>called predicate screening </li></ul>
14. 14. Predicate screening <ul><li>discard RIDs based on values (for index) </li></ul><ul><li>will access fewer tuples because RIDs used instead to eliminate potential tuples </li></ul>
15. 15. Index Scan <ul><li>4. If select conditions match some index columns of composite index </li></ul><ul><li>           Select * from T1 </li></ul><ul><li>where C1=10 and C2=30 and C6=20 </li></ul><ul><li> - a matching scan can be used if at least one of the columns in select is first column of index </li></ul><ul><ul><li>must eliminate tuples with what indexes you can, then examine the tuples </li></ul></ul>
16. 16. Rules for predicate matching <ul><li>Decide how many attributes to match in a composite index after the first column, so can read in a small contiguous range of leaf entries in B+-tree to get RIDs </li></ul><ul><li>Match first column of composite index then: </li></ul><ul><ul><li>look at index columns from left to right </li></ul></ul><ul><ul><li>Match ends when no predicate found </li></ul></ul><ul><ul><li>If range (<=, like, between) for a column, match terminates thereafter </li></ul></ul><ul><li>   If a range, easier to scan all entries for range - treat rest of entries as screening predicates </li></ul>
17. 17. Non-matching index scan <ul><li>Not always used by DBMSs </li></ul><ul><li>attributes in where clause don't include initial attribute of index </li></ul><ul><li>           Select * from T1 </li></ul><ul><li> where C2=30 and C3=15 </li></ul><ul><li>search leaf entries of index and compare values for entries </li></ul><ul><li>must read in all leaf pages to find C2, C3 values       e.g. 50 index pages vs 500,000 data pages </li></ul>
18. 18. Index only retrieval <ul><li>elements retrieved in select clause are attributes of compose index </li></ul><ul><li>don't need to access rows (actual data)     Select C1, C3 from T1 </li></ul><ul><li> where C1=5 and C3 between 2 and 5 </li></ul><ul><li>       Select count(*) from T1 </li></ul>
19. 19. Multiple Index Access <ul><li>If conjunctive conditions in where clause (and), can use >1 index </li></ul><ul><ul><li>Extract RIDs from each index satisfying matching predicate </li></ul></ul><ul><ul><li>Intersect lists of RIDs (and them) from each index </li></ul></ul><ul><ul><li>Final list - satisfies all predicates indexed </li></ul></ul><ul><ul><li>If disjunctive conditions (or)          Union the two lists of RIDs </li></ul></ul>
20. 20. Some Query optimizer rules for using RID-lists (then use list prefetch) <ul><li>   1.  predicted active resulting RIDs must not be  > 50% of RID pool </li></ul><ul><li>2.  Limit to any single RID list the size of the RID memory pool (16M RIDs) </li></ul><ul><li>3.  RID list cannot be generated by screening predicates </li></ul>
21. 21. Rules for multiple indexes <ul><li>Optimizer determines diminishing returns using multiple index access </li></ul><ul><li>1.  List indexes with matching predicates in where clause </li></ul><ul><li>2.  Place indexes in order by increasing filter factor </li></ul><ul><li>3.  For successive indexes, extract RID list only if reduced cost for final row returned     e.g. no sense reading 100's of pages of a new index to get number of rows to only 1 tuple </li></ul>
22. 22. List Prefetch – for accessing rows with RID list <ul><li>Assume once a list of RIDs is created, the system can order pages to minimize disk I/O </li></ul><ul><ul><li>E.g. elevator algorithm for disk request scheduling </li></ul></ul>
23. 23. Problem: Using RID lists with Multiple Indexes <ul><li>Prospects Table : 50M rows - 10 row per page </li></ul><ul><li>Indexes: </li></ul><ul><ul><li>zipcode – 100,000 values (100 entries per page) </li></ul></ul><ul><ul><li>hobby – 100 values (1000 entries per page </li></ul></ul><ul><ul><li>page – 50 values (1000 entries per page </li></ul></ul><ul><ul><li>incomeclass – 10 values (1000 entries per page) </li></ul></ul>
24. 24. Problem cont’d <ul><li>Select name, stradr from prospects </li></ul><ul><li>where zipcode between 02159 and 02658 </li></ul><ul><li>and age = 40 and hobby = ‘chess’ and incomeclass = 10; </li></ul><ul><li>FF in ascending order: </li></ul><ul><li>FF(zipcode) = 500/100,000 = 1/200 </li></ul><ul><li>FF(hobby) = 1/100 </li></ul><ul><li>FF(age) = 1/50 </li></ul><ul><li>FF(incomeclass) = 1/10 </li></ul>
25. 25. Problem cont’d <ul><li>Rows in table is: 5,000,000 </li></ul><ul><li>Data rows read if use indexes: </li></ul><ul><li>(1) 50,000,000/200 = 250,000 </li></ul><ul><li>(1,2) 250,000/100 = 2500 </li></ul><ul><li>(1,2,3) 2500/50 = 50 </li></ul><ul><li>(1,2,3,4) 50/10 = 5 </li></ul><ul><li>How much time will this take? Is it cost effective to use all of these indexes? </li></ul><ul><li>see textbook Pg. 579 </li></ul>
26. 26. Problem cont’d I/O costs <ul><li>Cost: </li></ul><ul><ul><li>RIO is 1/80 </li></ul></ul><ul><ul><li>Sequential Prefetch 1/800 </li></ul></ul><ul><ul><li>List Prefetch 1/200 </li></ul></ul><ul><li>Note: textbook assumes if read <= 3 pages use RIO </li></ul>
27. 27. Problem cont’d <ul><li>Table scan </li></ul><ul><ul><li>5,000,000/800 = 6250 </li></ul></ul><ul><li>Using index 1: </li></ul><ul><li>index: (500,000/200 + 3)/800 = 4 </li></ul><ul><li>data: 250,000/200 = 1250 </li></ul><ul><li>Using indexes 1&2: </li></ul><ul><li>index (50,000/100)/800 = 0.625 </li></ul><ul><li>data: 2500/200 = 12.5 </li></ul>
28. 28. Problem cont’d <ul><li>Using indexes 1,2,3: </li></ul><ul><li>index (50,000/50)/800 = 1.25 </li></ul><ul><li>data: 50/200 = 0.25 </li></ul><ul><li>Using indexes 1,2,3,4: </li></ul><ul><li>index (50,000/10)/800 = 6.25 </li></ul><ul><li>data: 5/200 = 0.025 </li></ul>
29. 29. Problem cont’d 1,2,3,4 1,2,3 1,2 1 None Index used Decrease 0.25 to 0.025 s With 6.25 additional s 4 + 0.625 + 1.25 + 6.25 s 5 0.025 s Decrease 12.5 to .25 s With 1.25 additional s 4 + 0.625 + 1.25 50 0.25 s Decrease 1250 to 12.5 s With 0.625 additional s 4 + 0.625 s 2500 12.5 s Decrease 6250 to 1250 s With 4 additional s 4 s 250,000 1250 s 50M 6250 s Cost Increase if use index Index I/O cost Data rows I/O cost