Download

276 views
233 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
276
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The problem with frequency values is when the number of possible values is large, it is detrimental to attempt to collect all skewed values. Collecting more than 100 values can result in increased bind/prepare times. So while frequency statistics are beneficial for low cardinality columns, or columns with very few skewed values, they do not help with large number of skewed values, or when skew occurs across data ranges. Range frequency or histograms, allow you to collect range skew. RUNSTATS will attempt to get same number of rows in each bucket. So that each histogram (bucket) has an “equal-depth”.
  • RUNSTATS will try to avoid big gaps inside quantiles. If you see a big gap, the gap could be a good choice to split. Gaps are best to have “between” quantiles, and not “within”. Null value is interesting for optimizer to know, and is therefore stored in it’s own quantile. Using RUNSTATS to collect histogram will collect the quantile for Null value be default. When collection frequency values, it depends on whether Null is included in the top n frequent values.
  • Optimizer doesn’t know that 200655 is not a valid value. So there is a big gap between 200612 and 200701. Optimizer will assume the data is evenly distributed between high2key and low2key, 10% of rows are estimated to be returned in this example.
  • Using histogram, we collect frequencies for two ranges (buckets). 200601 to 200612 in one bucket, 200701 to 200712 in another bucket. The result shows 45% of rows are estimated to be returned for this example. Where is the histogram captured in the catalog? In table SYSCOLDIST, there is a column TYPE. When the TYPE is H, the row represent the histogram.
  • Download

    1. 1.  IBM Corporation 2003 Tampa Bay Relational Users Group IBM Silicon Valley Lab, U.S.A.
    2. 2. Filter factor issues <ul><li>Filter factor accuracy important for... </li></ul><ul><ul><li>Index matching </li></ul></ul><ul><ul><ul><li>Accurately estimate index cost </li></ul></ul></ul><ul><ul><li>Total index filtering </li></ul></ul><ul><ul><ul><li>Estimate table access cost via index(es) </li></ul></ul></ul><ul><ul><ul><li>Choose how to use index (prefetch?) </li></ul></ul></ul><ul><ul><li>Total table level filtering </li></ul></ul><ul><ul><ul><li>Efficient join order </li></ul></ul></ul><ul><ul><ul><li>Efficient join method </li></ul></ul></ul><ul><ul><ul><li>Appropriate sorts </li></ul></ul></ul>
    3. 3. Terminology <ul><li>Correlation </li></ul><ul><ul><li>When data on two columns is not independent </li></ul></ul><ul><ul><li>Eg. CITY, STATE </li></ul></ul><ul><ul><ul><li>Every city does not exist in every state. </li></ul></ul></ul><ul><li>Data Skew (or skew) </li></ul><ul><ul><li>Describes situation where data is non-uniformly distributed </li></ul></ul><ul><ul><li>Data can be point-skewed on a value or skewed over a range </li></ul></ul><ul><ul><li>Eg. Gender </li></ul></ul><ul><ul><ul><li>Domain (M, F) </li></ul></ul></ul><ul><ul><ul><li>35% = M, 65% = F </li></ul></ul></ul>
    4. 4. Terminology (cont.) <ul><li>MFREQ: Multi-column frequency </li></ul><ul><ul><li>Frequency on concatenated column group </li></ul></ul><ul><ul><li>MFREQ(C1,C2,C3) </li></ul></ul><ul><li>MCARD: Multi-column cardinality </li></ul><ul><ul><li>Multi-column cardinality on a column group </li></ul></ul><ul><ul><li>MCARD(C1,C2,C3) </li></ul></ul>
    5. 5. Selectivity statistics <ul><li>Single column </li></ul><ul><ul><li>Cardinality </li></ul></ul><ul><ul><li>HIGH2KEY/LOW2KEY </li></ul></ul><ul><ul><li>Frequency </li></ul></ul><ul><li>Multi-column </li></ul><ul><ul><li>Cardinality </li></ul></ul><ul><ul><li>Frequency </li></ul></ul><ul><ul><li>Histogram </li></ul></ul>
    6. 6. Single column cardinality <ul><li>Single column cardinality </li></ul><ul><ul><li>Number of distinct values for a column </li></ul></ul><ul><ul><li>Assumes uniform distribution </li></ul></ul><ul><ul><li>Stored as </li></ul></ul><ul><ul><ul><li>SYSCOLUMNS.COLCARDF </li></ul></ul></ul><ul><ul><ul><li>SYSINDEXES.FIRSTKEYCARDF </li></ul></ul></ul><ul><ul><li>Used when better statistics can't be used... </li></ul></ul><ul><ul><ul><li>Host variables, parameter markers, special registers </li></ul></ul></ul><ul><ul><ul><li>No other statistics available </li></ul></ul></ul>
    7. 7. RUNSTATS column cardinality <ul><li>How to collect </li></ul><ul><ul><li>RUNSTATS command: </li></ul></ul><ul><ul><li>RUNSTATS TABLESPACE (DBNAME.TSNAME) </li></ul></ul><ul><ul><li>TABLE (ALL or PAT_TABLE) </li></ul></ul><ul><ul><li>COLUMN(ALL or <list of columns>) </li></ul></ul><ul><ul><li>Leading column of index when RUNSTATS on index performed </li></ul></ul><ul><ul><li>RUNSTATS INDEX (PAT_INDEX) </li></ul></ul><ul><ul><li>RUNSTATS TABLESPACE (DBNAME.TSNAME) </li></ul></ul><ul><ul><li>INDEX(ALL) </li></ul></ul>
    8. 8. Single column cardinality <ul><li>For equals predicate, filter factor = 1/COLCARDF </li></ul><ul><ul><li>Index pages --> probe + matching FF * NLEAF </li></ul></ul><ul><ul><ul><li>3 + (1/5) * 10,000 = 2003 index pages </li></ul></ul></ul><ul><ul><li>Index record ids processed = CARDF * matching index filtering </li></ul></ul><ul><ul><ul><li>100,000 * (1/5) = 20,000 </li></ul></ul></ul><ul><ul><li>Rows returned = CARDF * total filtering </li></ul></ul><ul><ul><ul><li>100,000 * (1/5) = 20,000 </li></ul></ul></ul>Select C2 from T1 Where C1 = ? Index I1: C1,C2,C3 T1 CARDF = 100,000 C1 COLCARDF = 5 I1 NLEAF = 10,000 I1 NLEVELS = 3
    9. 9. Single column cardinality <ul><li>Two matching predicates, multiply filter factors </li></ul><ul><ul><li>Index pages --> probe + matching FF * NLEAF </li></ul></ul><ul><ul><ul><li>3 + [(1/5) * (1/10)] * 10,000 = ~203 index pages </li></ul></ul></ul><ul><ul><li>Index record ids processed = CARDF * matching index filtering </li></ul></ul><ul><ul><ul><li>100,000 * [(1/5) * (1/10)] =~ 2,000 </li></ul></ul></ul><ul><ul><li>qualified rows = CARDF * total filtering </li></ul></ul><ul><ul><ul><li>100,000 * [(1/5) * (1/10)] =~ 2,000 rows </li></ul></ul></ul>Select C3 from T1 Where C1 = ? AND C2 = ? Index I1: C1,C2,C3 C1 COLCARDF = 5 C2 COLCARDF = 10 FULLKEYCARDF 65,000
    10. 10. Single column cardinality <ul><li>Range predicate with parameter marker </li></ul><ul><ul><li>Use default interpolation filter factor chart </li></ul></ul><ul><ul><ul><li>COLCARDF 10,121 --> FF = 1/100 </li></ul></ul></ul><ul><ul><li>In reality, could qualify anywhere from all to no rows </li></ul></ul><ul><ul><ul><li>Here's another sample predicate: </li></ul></ul></ul><ul><ul><ul><ul><li>BIRTH_DATE <= ? </li></ul></ul></ul></ul><ul><ul><ul><li>How many people in room born before parameter marker? </li></ul></ul></ul><ul><ul><ul><ul><li>What if value is '1930-01-01'? </li></ul></ul></ul></ul><ul><ul><ul><ul><li>What if value is '1980-01-01? </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Cannot accurately estimate without literal value </li></ul></ul></ul></ul>Select C2 from T1 Where C1 > ? Index I1: C1,C2,C3 C1 COLCARDF = 10,121
    11. 11. Range predicate interpolation <ul><li>Table 104. Default filter factors for interpolation </li></ul><ul><ul><li>COLCARDF Factor for Op Factor for LIKE/BETWEEN </li></ul></ul><ul><ul><li>>=100,000,000 1/10,000 3/100,000 </li></ul></ul><ul><li>>=10,000,000 1/3,000 1/10,000 </li></ul><ul><li>>=1,000,000 1/1,000 3/10,000 </li></ul><ul><li>>=100,000 1/300 1/1,000 </li></ul><ul><li>>=10,000 1/100 3/1,000 </li></ul><ul><li>>=1,000 1/30 1/100 </li></ul><ul><li>>=100 1/10 3/100 </li></ul><ul><li>>=0 1/3 1/10 </li></ul><ul><li>Note: Op is one of these operators: <, <=, >, >=. </li></ul><ul><li>COMMENT: This is DB2’s documented guess for an impossible to estimate </li></ul><ul><li>Filter factor. </li></ul>
    12. 12. Single column cardinality <ul><li>Matching cost </li></ul><ul><ul><li>Index pages --> probe + matching FF * NLEAF </li></ul></ul><ul><ul><ul><li>probe + (1/10) * 10,000 =~ 1,003 pages </li></ul></ul></ul><ul><ul><li>Index rows processed = CARDF * Matching FF </li></ul></ul><ul><ul><ul><li>1,000,000 * (1/10) = 100,000 rows </li></ul></ul></ul><ul><li>Screening </li></ul><ul><ul><li>Rows to access table for = CARDF * (Matching and screening FF) </li></ul></ul><ul><ul><ul><li>1,000,000 * [(1/10) * (1/100)] =~ 1,000 rows??? </li></ul></ul></ul>CARDF = 1 million NPAGES/F = 100,000 NLEAF = 10,000 C1 COLCARDF 10 C3 COLCARDF 10,121 clusterratiof 50 Select C4 from T1 Where C1 = ? AND C3 > ? Index I1: C1,C2,C3
    13. 13. HIGH2KEY/LOW2KEY <ul><li>HIGH2KEY/LOW2KEY </li></ul><ul><ul><li>Single column statistic </li></ul></ul><ul><ul><ul><li>SYSCOLUMNS.HIGH2KEY </li></ul></ul></ul><ul><ul><ul><li>SYSCOLUMNS.LOW2KEY </li></ul></ul></ul><ul><ul><li>When used? </li></ul></ul><ul><ul><ul><li>Interpolation used to estimate range predicates </li></ul></ul></ul><ul><ul><ul><li>Like, between, <, <=, >, >= </li></ul></ul></ul><ul><ul><ul><li>Literal value must be known </li></ul></ul></ul><ul><ul><ul><li>As domain statistics when COLCARDF = 1 or 2 </li></ul></ul></ul><ul><ul><ul><li>Can be used in combination with single column frequencies for more accurate estimate. </li></ul></ul></ul><ul><ul><ul><li>DB2 Interpolation: Technique to estimate the percentage of rows which qualify based on known high / low values. </li></ul></ul></ul>
    14. 14. RUNSTATS HIGH2KEY / LOW2KEY <ul><li>How to collect </li></ul><ul><ul><li>Whenever single column cardinality collected, HIGH2KEY / LOW2KEY also collected. </li></ul></ul><ul><ul><li>Reference RUNSTATS COLUMN CARDINALITY slide </li></ul></ul>
    15. 15. Linear interpolation <ul><li>Matching cost - same as before </li></ul><ul><li>Screening </li></ul><ul><ul><li>Rows to access table for = CARDF * (Matching + screening FF) </li></ul></ul><ul><ul><li>1,000,000 * [(1/10) * (screening FF)] =~ ??? </li></ul></ul><ul><ul><li>Interpolation for C3 > 50 </li></ul></ul><ul><ul><ul><li>(HIGH2KEY - LITVALUE) / (HIGH2KEY - LOW2KEY) </li></ul></ul></ul><ul><ul><ul><li>(100 - 50) / (100 - 0) = 50/100 = 0.5 </li></ul></ul></ul><ul><ul><li>1,000,000 * [(1/10) * (0.5)] =~ 50,000 rows (vs 1000 with def FF) </li></ul></ul>CARDF = 1 million NPAGES/F = 100,000 C3 COLCARDF 10,241 LOW2KEY 0 HIGH2KEY 100 clusterratiof 50 Select C4 from T1 Where C1 = ? AND C3 > 50 Index I1: C1,C2,C3
    16. 16. Single column frequencies <ul><li>Single column frequencies </li></ul><ul><ul><li>SYSCOLDIST.FREQUENCYF </li></ul></ul><ul><ul><ul><li>TYPE = 'F', NUMCOLUMNS = 1 </li></ul></ul></ul><ul><ul><li>Provides non-uniform distribution information </li></ul></ul><ul><ul><ul><li>Data skew </li></ul></ul></ul><ul><ul><li>When used? </li></ul></ul><ul><ul><ul><li>Literal value must be known </li></ul></ul></ul><ul><ul><ul><li>Equals, is null, in </li></ul></ul></ul><ul><ul><ul><li>Like, between, <, <=, >, >= </li></ul></ul></ul><ul><ul><ul><li>Used in conjunction with other complementary statistics </li></ul></ul></ul>
    17. 17. RUNSTATS Frequency <ul><li>Leading indexed column </li></ul><ul><ul><li>PAT_INDEX (C1,C2,C3) </li></ul></ul><ul><ul><li>Top 10 values </li></ul></ul><ul><ul><ul><li>RUNSTATS INDEX (PAT_INDEX) </li></ul></ul></ul><ul><ul><li>Top 20 values </li></ul></ul><ul><ul><ul><li>RUNSTATS INDEX </li></ul></ul></ul><ul><ul><ul><li>(PAT_INDEX FREQVAL NUMCOLS(1) COUNT(20)) </li></ul></ul></ul><ul><ul><li>Top 0 values (purge frequencies) </li></ul></ul><ul><ul><ul><li>(PAT_INDEX FREQVAL NUMCOLS(1) COUNT(0)) </li></ul></ul></ul>
    18. 18. RUNSTATS Frequency <ul><li>How to collect </li></ul><ul><ul><li>RUNSTATS COLGROUP allows collection on almost any column </li></ul></ul><ul><ul><ul><li>RUNSTATS TABLESPACE DB1.TS1 </li></ul></ul></ul><ul><ul><ul><li>TABLE (PAT_TABLE) COLUMNS(C1,C2,C3) </li></ul></ul></ul><ul><ul><ul><li>COLGROUP (C1) FREQVAL COUNT(1) MOST </li></ul></ul></ul><ul><ul><ul><li>COLGROUP (C2) FREQVAL COUNT(10) LEAST </li></ul></ul></ul><ul><ul><ul><li>COLGROUP (C3) FREQVAL COUNT(20) BOTH </li></ul></ul></ul><ul><ul><li>Eliminate existing frequencies: </li></ul></ul><ul><ul><ul><li>COLGROUP (C3) FREQVAL COUNT(0) MOST </li></ul></ul></ul>
    19. 19. Single column frequency <ul><li>Value which exists a lot </li></ul><ul><ul><li>Index pages --> probe + matching FF * NLEAF </li></ul></ul><ul><ul><ul><li>3 + (0.75) * 10,000 =~ 7,503 index pages </li></ul></ul></ul><ul><ul><li>Index record ids processed = CARDF * matching index filtering </li></ul></ul><ul><ul><ul><li>100,000 * (0.75) =~ 75,000 </li></ul></ul></ul><ul><ul><li>Rows returned = CARDF * total filtering </li></ul></ul><ul><ul><ul><li>100,000 * (0.75) =~ 75,000 </li></ul></ul></ul>Select C4 from T1 Where C1 = 'A' Index I1: C1,C2,C3 T1 CARDF = 100,000 C1 COLCARDF = 5 I1 NLEAF = 10,000 I1 NLEVELS = 3 C1 FREQ 'A' 0.75 'B' 0.15 'C' 0.05 'D' 0.03 'E' 0.02 clusterratiof 50
    20. 20. Single column frequency <ul><li>Looking for value not in the domain.... </li></ul><ul><ul><li>Index pages --> probe + matching FF * NLEAF </li></ul></ul><ul><ul><ul><li>3 + (~0) * 10,000 =~ 1 index page </li></ul></ul></ul><ul><ul><li>Filter factor without the frequencies </li></ul></ul><ul><ul><ul><li>1/5 = 0.20 </li></ul></ul></ul>Select C4 from T1 Where C1 = 'Z' Index I1: C1,C2,C3 T1 CARDF = 100,000 C1 COLCARDF = 5 I1 NLEAF = 10,000 I1 NLEVELS = 3 C1 FREQ 'A' 0.75 'B' 0.15 'C' 0.05 'D' 0.03 'E' 0.02 clusterratiof 0.50 ?
    21. 21. Single column frequency <ul><li>Some in domain, some not... </li></ul><ul><ul><li>Index pages --> probe + matching FF * NLEAF </li></ul></ul><ul><ul><ul><li>3 + (0.05 + 0.0 + 0.02) * 10,000 = ~ 700 index page </li></ul></ul></ul><ul><ul><li>Rows returned = CARDF * total filtering </li></ul></ul><ul><ul><ul><li>100,000 * (0.05 + 0.0 + 0.02) = ~ 7000 </li></ul></ul></ul><ul><ul><li>Without frequencies filter factor = 3/5 = 0.60 versus 0.07 </li></ul></ul>Select C2 from T1 Where C1 in ('C', 'Z', 'E') Index I1: C1,C2,C3 T1 CARDF = 100,000 C1 COLCARDF = 5 I1 NLEAF = 10,000 I1 NLEVELS = 3 C1 FREQ 'A' 0.75 'B' 0.15 'C' 0.05 'D' 0.03 'E' 0.02
    22. 22. Single column histograms <ul><li>Single column frequencies </li></ul><ul><ul><li>SYSCOLDIST.FREQUENCYF </li></ul></ul><ul><ul><ul><li>TYPE = ‘H', NUMCOLUMNS = 1 </li></ul></ul></ul><ul><ul><li>Provides non-uniform distribution information </li></ul></ul><ul><ul><ul><li>Range skew </li></ul></ul></ul><ul><ul><li>When used? </li></ul></ul><ul><ul><ul><li>Literal value must be known </li></ul></ul></ul><ul><ul><ul><li>Equals, is null, in </li></ul></ul></ul><ul><ul><ul><li>Like, between, <, <=, >, >= </li></ul></ul></ul><ul><ul><ul><li>Used in conjunction with other complementary statistics </li></ul></ul></ul>
    23. 23. Histograms <ul><li>How to collect </li></ul><ul><ul><li>RUNSTATS COLGROUP allows collection on almost any column </li></ul></ul><ul><ul><ul><li>RUNSTATS TABLESPACE DB1.TS1 </li></ul></ul></ul><ul><ul><ul><li>TABLE (PAT_TABLE) COLUMNS(C1,C2,C3) </li></ul></ul></ul><ul><ul><ul><li>COLGROUP (C2) HISTOGRAM NUMQUANTILES 20 </li></ul></ul></ul><ul><ul><ul><li>INDEX (I1 KEYCARD </li></ul></ul></ul><ul><ul><ul><li>HISTOGRAM NUMCOLS 1 NUMQUANTILES 30 </li></ul></ul></ul><ul><ul><ul><li>,I2 KEYCARD </li></ul></ul></ul><ul><ul><ul><li>HISTOGRAM NUMCOLS 1 NUMQUANTILES 50 </li></ul></ul></ul><ul><ul><ul><li>) </li></ul></ul></ul><ul><ul><li>- 20 quantiles for column C2. </li></ul></ul><ul><ul><li>30 quantiles for leading column of index I1. </li></ul></ul><ul><ul><li>50 quantiles for leading column of index I3. </li></ul></ul><ul><ul><li>Eliminate existing histograms: </li></ul></ul><ul><ul><ul><li>COLGROUP (C3) HISTOGRAM COUNT(0) MOST </li></ul></ul></ul>
    24. 24. RUNSTATS Histogram Statistics <ul><li>RUNSTATS will produce equal-depth histogram </li></ul><ul><ul><li>Each quantile (range) will have approx same number of rows </li></ul></ul><ul><ul><ul><li>Not same number of values </li></ul></ul></ul><ul><ul><li>Another term is range frequency </li></ul></ul><ul><li>Example </li></ul><ul><ul><ul><li>1, 3, 3, 4, 4, 6, 7, 8, 9, 10, 12, 15 (sequenced) </li></ul></ul></ul><ul><ul><li>Lets cut that into 3 quantiles. </li></ul></ul><ul><ul><ul><li>1, 3, 3, 4 ,4 6,7,8,9 10,12,15 </li></ul></ul></ul>3/12 3 15 10 3 4/12 4 9 6 2 5/12 3 4 1 1 Frequency Cardinality High Value Low Value Seq No
    25. 25. RUNSTATS Histogram Statistics Notes <ul><li>RUNSTATS </li></ul><ul><ul><li>Maximum 100 quantiles for a column </li></ul></ul><ul><ul><li>Same value columns WILL be in the same quantile </li></ul></ul><ul><ul><li>Quantiles will be similar size but: </li></ul></ul><ul><ul><ul><li>Will try to avoid big gaps inside quantiles </li></ul></ul></ul><ul><ul><ul><li>Highvalue and lowvalue may have separate quantiles </li></ul></ul></ul><ul><ul><ul><li>Null WILL have a separate quantile </li></ul></ul></ul><ul><li>Supports column groups as well as single columns </li></ul><ul><li>Think “frequencies” for high cardinality columns </li></ul>
    26. 26. Histogram Statistics Example <ul><li>Customer uses INTEGER (or VARCHAR) for YEAR-MONTH </li></ul><ul><ul><ul><li>Assuming data for 2006 & 2007 </li></ul></ul></ul><ul><ul><ul><ul><li>FF = (high-value – low-value) / (high2key – low2key) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>FF = (200612 – 200601) / (200711 – 200602) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>10% of rows estimated to return </li></ul></ul></ul></ul>WHERE YEARMONTH BETWEEN 200601 AND 200612 Data assumed as evenly distributed between low and high range
    27. 27. Histogram Statistics Example <ul><li>Example (cont.) </li></ul><ul><ul><li>Data only exists in ranges 200601-12 & 200701-12 </li></ul></ul><ul><ul><ul><li>Collect via histograms </li></ul></ul></ul><ul><ul><ul><ul><li>45% of rows estimated to return </li></ul></ul></ul></ul>No data between 200613 & 200700 WHERE YEARMONTH BETWEEN 200601 AND 200612
    28. 28. Single column recommendations <ul><li>Cardinality </li></ul><ul><ul><li>Collect on all columns used in where clause </li></ul></ul><ul><ul><li>Used regardless of literal value known </li></ul></ul><ul><li>Interpolation (HIGH2KEY/LOW2KEY) </li></ul><ul><ul><li>Collected with column statistics </li></ul></ul><ul><ul><li>Consider REOPT(VARS) </li></ul></ul><ul><ul><li>Dynamic </li></ul></ul><ul><ul><ul><li>V8 - REOPT(ONCE) </li></ul></ul></ul><ul><ul><ul><li>DB2 9 – REOPT(AUTO) </li></ul></ul></ul><ul><li>Frequency </li></ul><ul><ul><li>Optimizer requires literal value to use </li></ul></ul><ul><ul><li>Used for most predicate types </li></ul></ul><ul><ul><li>Collect all values for low COLCARDF columns </li></ul></ul><ul><ul><li>Useful for indexed and non-indexed columns </li></ul></ul><ul><li>Histograms </li></ul><ul><ul><li>Optimizer needs literal value to use </li></ul></ul><ul><ul><li>Can be used in virtually all scenarios frequencies are used, also join predicates </li></ul></ul><ul><ul><li>Collect histograms rather than frequencies for RANGE predicates </li></ul></ul>
    29. 29. Filter factor issues <ul><li>Filter factor accuracy important for... </li></ul><ul><ul><li>Index matching </li></ul></ul><ul><ul><ul><li>Estimate index cost </li></ul></ul></ul><ul><ul><li>Total index filtering </li></ul></ul><ul><ul><ul><li>Estimate table access cost via index(es) </li></ul></ul></ul><ul><ul><ul><li>Choose how to use index (prefetch?) </li></ul></ul></ul><ul><ul><li>Table filtering </li></ul></ul><ul><ul><ul><li>Efficient join order </li></ul></ul></ul><ul><ul><ul><li>Efficient join method </li></ul></ul></ul><ul><ul><ul><li>Appropriate sorts </li></ul></ul></ul>
    30. 30. Multi-column cardinalities <ul><li>Multi-column cardinalities (MCARD) </li></ul><ul><ul><li>Stored in a few places... </li></ul></ul><ul><ul><ul><li>SYSINDEXES.FULLKEYCARDF </li></ul></ul></ul><ul><ul><ul><li>SYSCOLDIST.CARDF </li></ul></ul></ul><ul><ul><ul><ul><li>TYPE = 'C', NUMCOLUMNS > 1 </li></ul></ul></ul></ul><ul><ul><li>Assumes uniform distribution </li></ul></ul><ul><ul><li>When used? </li></ul></ul><ul><ul><ul><li>Primarily for indexes </li></ul></ul></ul><ul><ul><ul><li>Literal values not necessary </li></ul></ul></ul><ul><ul><ul><li>KEYCARD for partially matching indexes </li></ul></ul></ul><ul><ul><ul><ul><li>Collect for all indexes with 3 or more columns </li></ul></ul></ul></ul><ul><ul><ul><li>Collect to support multi-column frequencies </li></ul></ul></ul><ul><ul><ul><li>Collect for all multi-column join situations </li></ul></ul></ul>
    31. 31. RUNSTATS KEYCARD <ul><li>How to collect </li></ul><ul><ul><li>V7 RUNSTATS only collects KEYCARD on leading column of index </li></ul></ul><ul><ul><li>By default, RUNSTATS only collects FIRST/FULLKEYCARDF </li></ul></ul><ul><ul><li>INDEX PAT_INDEX (C1,C2,C3,C4) </li></ul></ul><ul><ul><li>RUNSTATS INDEX(PAT_INDEX KEYCARD ) </li></ul></ul><ul><ul><li>MCARD on leading concatenated column groups: </li></ul></ul><ul><ul><li>MCARD(C1,C2), MCARD(C1,C2,C3) </li></ul></ul>
    32. 32. RUNSTATS COLGROUP <ul><li>DB2 V8 allows collection of MCARD on any column group </li></ul><ul><ul><li>RUNSTATS TABLESPACE DB1.TS1 </li></ul></ul><ul><ul><li>TABLE(PAT_TABLE) COLUMN(C1,C2,C3,C4) </li></ul></ul><ul><ul><li>COLGROUP(C1,C4) </li></ul></ul><ul><ul><li>Specifying COLGROUP with multiple columns collects multi-column cardinality on the group. </li></ul></ul>
    33. 33. Multi-column cardinality <ul><li>Useful for local predicates </li></ul><ul><ul><li>SELECT name, address, ... </li></ul></ul><ul><ul><li>FROM CUST </li></ul></ul><ul><ul><li>WHERE City = ? </li></ul></ul><ul><ul><li>AND State = ? </li></ul></ul><ul><ul><li>AND Last_name = ? </li></ul></ul>INDEX Index column 1stkey KEYCARD FULLKEY IX1 City, State, Zip 10,000 12,000 1,000,000 IX2 Last_name 20,000 N/A 20,000 COLCARDF STATE 50 IX1 Filter Factor without KEYCARD 1 / 500,000 IX1 Filter Factor WITH KEYCARD 1 / 12,000 Last_name 1 / 20,000 keycard!!!
    34. 34. Multi-column cardinality <ul><li>Useful for join predicates </li></ul><ul><ul><li>SELECT cols </li></ul></ul><ul><ul><li>FROM T1, T2 </li></ul></ul><ul><ul><li>WHERE T1.C1 = T2.C1 </li></ul></ul><ul><ul><li>AND T1.C2 = T2.C2 </li></ul></ul><ul><ul><li>INDEX I1 ( C1,C2 ) on table T1 </li></ul></ul><ul><ul><li>INDEX I2 ( C1,C2 ,C3) on table T2 </li></ul></ul><ul><ul><li>KEYCARD is necessary on index I2 to accurately estimate T1  T2 join size. </li></ul></ul>
    35. 35. Multi-column cardinality <ul><li>Matching + Screening </li></ul><ul><ul><li>SELECT cols </li></ul></ul><ul><ul><li>FROM T1 </li></ul></ul><ul><ul><li>WHERE T1.C1 = ? </li></ul></ul><ul><ul><li>AND T1.C3 = ? </li></ul></ul><ul><ul><li>AND T1.C4 = ? </li></ul></ul><ul><ul><li>INDEX I2 ( C1 ,C2, C3 ) on table T1 </li></ul></ul><ul><ul><li>COLCARDF / FIRSTKEYCARDF for matching only </li></ul></ul><ul><ul><li>COLGROUP (C1,C3)  matching + screen </li></ul></ul><ul><ul><li>COLGROUP (C1,C3,C4)  table level filtering </li></ul></ul>
    36. 36. Filter factor issues (reminder) <ul><li>Filter factor accuracy important for... </li></ul><ul><ul><li>Index matching </li></ul></ul><ul><ul><ul><li>Estimate index cost </li></ul></ul></ul><ul><ul><li>Total index filtering </li></ul></ul><ul><ul><ul><li>Estimate table access cost via index(es) </li></ul></ul></ul><ul><ul><ul><li>Choose how to use index (prefetch?) </li></ul></ul></ul><ul><ul><li>Table filtering </li></ul></ul><ul><ul><ul><li>Efficient join order </li></ul></ul></ul><ul><ul><ul><li>Efficient join method </li></ul></ul></ul><ul><ul><ul><li>Appropriate sorts </li></ul></ul></ul>
    37. 37. Multi-column frequency <ul><li>Multi-column frequencies </li></ul><ul><ul><li>Very similar to single column frequencies </li></ul></ul><ul><ul><ul><li>Distribution statistics concatenated column group values </li></ul></ul></ul><ul><ul><ul><li>Identifies multi-column skewed distributions </li></ul></ul></ul><ul><ul><li>Stored in </li></ul></ul><ul><ul><ul><li>SYSCOLDIST.FREQUENCYF </li></ul></ul></ul><ul><ul><ul><li>TYPE = ‘F’ </li></ul></ul></ul><ul><ul><ul><li>NUMCOLUMNS > 1 </li></ul></ul></ul>
    38. 38. RUNSTATS INDEX <ul><li>How to collect </li></ul><ul><ul><li>V7 RUNSTATS only collects on leading concatenated column of index </li></ul></ul><ul><ul><li>RUNSTATS does NOT collect multi-column frequencies by default. </li></ul></ul><ul><ul><li>Must be explicitly requested. </li></ul></ul><ul><ul><li>INDEX (I1) columns ( C1,C2 ,C3) </li></ul></ul><ul><ul><li>Collect top 15 values for column group ( C1,C2 ) </li></ul></ul><ul><ul><li>RUNSTATS INDEX (I1 FREQVAL NUMCOLS(2) COUNT(15) ) </li></ul></ul><ul><ul><li>Eliminate frequencies on column group ( C1,C2 ): </li></ul></ul><ul><ul><li>RUNSTATS INDEX (I1 FREQVAL NUMCOLS(2) COUNT(0) ) </li></ul></ul>
    39. 39. RUNSTATS COLGROUP <ul><li>DB2 V8 allows collection of multi-column frequencies on almost any column group </li></ul><ul><ul><li>Examples </li></ul></ul><ul><ul><li>RUNSTATS TABLESPACE DB1.TS1 </li></ul></ul><ul><ul><li>TABLE (T1) COLUMN(C1,C2,C3) </li></ul></ul><ul><ul><li>COLGROUP(C1,C3) FREQVAL COUNT(10) MOST </li></ul></ul><ul><ul><li>COLGROUP(C2,C3) FREQVAL COUNT(1) LEAST </li></ul></ul><ul><ul><li>Eliminate frequencies on column group ( C1,C3 ): </li></ul></ul><ul><ul><li>COLGROUP (C1,C3) FREQVAL COUNT(0) </li></ul></ul>
    40. 40. Multi-column frequency <ul><li>Multi-column frequencies </li></ul><ul><ul><li>Limited use </li></ul></ul><ul><ul><ul><li>Boolean equal predicates only </li></ul></ul></ul><ul><ul><ul><li>Always collect supporting multi-column cardinality </li></ul></ul></ul><ul><ul><li>Collect single column frequencies for </li></ul></ul><ul><ul><ul><li>Range predicates </li></ul></ul></ul><ul><ul><ul><li>In-lists </li></ul></ul></ul><ul><ul><ul><li>Single column predicates </li></ul></ul></ul><ul><ul><ul><li>other non-equal predicates </li></ul></ul></ul>
    41. 41. Multi-column frequency Gender COLCARDF = 2 Category COLCARDF = 4 (Gender,Category) MCARD = 8 Category Gender FREQUENCYF 1/MCARD Women's Health F 0.2375 0.125 Women's Health M 0.0125 0.125 Men's Health M 0.2375 0.125 Men's Health F 0.0125 0.125 Hockey M 0.15 0.125 Hockey F 0.10 0.125 Soccer F 0.15 0.125 Soccer M 0.10 0.125
    42. 42. Multi-column frequency trap <ul><li>Assumptions </li></ul><ul><ul><li>Table T1 cardinality 150,000,000 </li></ul></ul><ul><ul><li>Index I1 (GENDER, ACCT_NUM) </li></ul></ul><ul><ul><li>COLCARDF GENDER 2 </li></ul></ul><ul><ul><li>COLCARDF ACCT_NUM 120,000,000 </li></ul></ul><ul><li>SQL </li></ul><ul><ul><li>SELECT * </li></ul></ul><ul><ul><li>FROM T1 </li></ul></ul><ul><ul><li>WHERE GENDER = ‘M’ </li></ul></ul><ul><li>Global RUNSTATS with FREQVAL used </li></ul><ul><ul><li>RUNSTATS INDEX (ALL) FREQVAL NUMCOLS (6) COUNT(10) </li></ul></ul><ul><ul><li>(Don’t do this!) </li></ul></ul>
    43. 43. Useless multi-column frequency GENDER ACCT_NUM FREQUENCYF M 1541235 0.000000008 F 12351235 0.000000008 F 81235135 0.000000008 M 423613246 0.000000008 F 32151234 0.000000008 M 76823414 0.000000008 M 43451235 0.000000008 F 12351235 0.000000008 <ul><li>Useless because… </li></ul><ul><ul><li>Frequencies on GENDER alone either non-existant, or collected long ago and becoming stale </li></ul></ul><ul><ul><li>Multi-column frequency on (GENDER,ACCT_NUM) not used since only one column exists </li></ul></ul><ul><ul><li>Even if frequency could be used – ACCT_NUM is so selective, the frequencies are too diluted to tell us anything about GENDER alone </li></ul></ul>
    44. 44. Multi-column considerations <ul><li>Cardinality </li></ul><ul><ul><li>Collect KEYCARD for all indexes with 3 or more columns </li></ul></ul><ul><ul><li>Literal values not required </li></ul></ul><ul><li>Frequencies </li></ul><ul><ul><li>Not useful when literal values aren't known </li></ul></ul><ul><ul><li>Dynamic SQL prepared with parameter markers </li></ul></ul><ul><ul><li>Static SQL with hostvars (consider REOPT) </li></ul></ul><ul><ul><li>Special registers (consider REOPT) </li></ul></ul><ul><ul><li>Collect for specific cases </li></ul></ul><ul><li>Pay special attention to... </li></ul><ul><ul><li>Low cardinality column groups </li></ul></ul><ul><ul><li>Volatile data </li></ul></ul>
    45. 45. Join considerations <ul><li>Which table should be outer? </li></ul><ul><ul><li>Hmmm, low cardinality predicates.... </li></ul></ul><ul><ul><li>Significant opportunity for misestimation on both T1 and T2 – could lead to poor choice. </li></ul></ul><ul><li>The more tables in the join and the more predicates involved - the more important this becomes </li></ul>Select T1.C4, T2.C6 from T1, T2 Where T1.C1 = T2.C1 AND T1.C2 = 50 AND T2.C3 = 'A' AND T2.C4 = 'B' Unique indexes T1.C1 T2.C1 T1 & T2 CARDF = 10 million T1.C2 COLCARDF = 30 T2.C3 COLCARDF = 10 T2.C4 COLCARDF = 15
    46. 46. One more time <ul><li>Filter factor accuracy important for... </li></ul><ul><ul><li>Index matching </li></ul></ul><ul><ul><ul><li>Estimate index cost </li></ul></ul></ul><ul><ul><li>Total index filtering </li></ul></ul><ul><ul><ul><li>Estimate table access cost via index(es) </li></ul></ul></ul><ul><ul><ul><li>Choose how to use index (prefetch?) </li></ul></ul></ul><ul><ul><li>Table filtering </li></ul></ul><ul><ul><ul><li>Efficient join order </li></ul></ul></ul><ul><ul><ul><li>Efficient join method </li></ul></ul></ul><ul><ul><ul><li>Appropriate sorts </li></ul></ul></ul>

    ×