Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Histograms: Pre-12c and now

776 views

Published on

Published in the Summer 2015 edition of Oracle Scene. Republished with permission of UK Oracle User Group.

Published in: Technology
  • Be the first to comment

Histograms: Pre-12c and now

  1. 1. Header here www.ukoug.org 11 Histograms: Histograms are used by the optimizer to compute the selectivity of filter and join predicates in case of skewed data distribution. Prior to 12c, two types of histograms could be created: frequency histograms and height-balanced histograms. 12c introduces top frequency and hybrid histograms which are designed to overcome the limitations of their precursors. This article discusses the need for histograms, the interpretation of various types of histograms and the evolution of histograms from 11g to 12c. Anju Garg, Corporate Trainer Pre-12c & Now Need for Histograms When a SQL statement is issued, the optimizer generates an optimum execution plan based on the information available to it. If data is uniformly distributed across various values in a column and table statistics have been gathered, the optimizer estimates cardinality (row count) accurately and makes correct decision with respect to access method, join order and join method to be used. But if data distribution is skewed, the optimizer might make an incorrect estimate for the cardinality and choose a bad execution path. For example, consider a table HR.HIST having a skewed data distribution in column ID as shown in Figure 1.1. Pre-12c Histograms Prior to Oracle 12c, two types of histograms could be created (as shown in Figure 1.2): - Frequency histograms - Height-balanced histograms Technology FIGURE 1.1 FIGURE 1.2
  2. 2. 12 www.ukoug.org SUMMER 15 Technology: Anju Garg OracleScene D I G I T A L Frequency Histograms A frequency histogram is a frequency distribution which records each different value and its exact cardinality. A frequency histogram is created when - Requested no. of buckets (Nb) = No. of distinct values (NDV) and - NDV = 254 (2,048 in 12c). A frequency histogram with 26 buckets for ID column can be created as under: TABLE 2.1 SQLexec dbms_stats.gather_table_stats - (ownname = ‘HR’,tabname = ‘HIST’,method_opt = ‘FOR COLUMNS ID’, cascade = true); SQL select table_name, column_name, histogram, num_distinct, num_buckets from dba_tab_col_statistics where table_name = ‘HIST’ and column_name = ‘ID’; TABLE_NAME COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS ---------- --------------- --------------- ------------ ----------- HIST ID FREQUENCY 26 26 The histogram can be viewed from DBA_HISTOGRAMS (as shown in Table 2.2) SQL select ENDPOINT_VALUE, ENDPOINT_NUMBER from dba_histograms where table_name = ‘HIST’ and column_name = ‘ID’; ENDPOINT_VALUE ENDPOINT_NUMBER -------------- --------------- 1 4 2 6 3 7 4 9 5 10 6 12 7 15 8 65 9 68 10 70 11 76 12 82 13 88 14 91 15 96 16 99 17 102 18 103 19 104 20 109 21 111 22 112 23 113 24 115 25 118 26 120 TABLE 2.2 FIGURE 2.2 Interpreting Frequency Histogram It can be seen from Table 2.2 that a frequency histogram with 26 buckets, one for each distinct value, has been created. • ENDPOINT_VALUE - The value in a bucket. • ENDPOINT_NUMBER - Cumulative frequency Thus, you can find out the exact counts for each of the distinct values in the data. For example, the optimizer makes an accurate estimate of 50 rows for ID = 8 and uses FTS access path as desired (Table 2.3) even though column ID is indexed. TABLE 2.3 SQLexplain plan for select * from hr.hist where id = 8; select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------Plan hash value: 538080257 -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 50 | 50200 | 7 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| HIST | 50 | 50200 | 7 (0)| 00:00:01 | --------------------------------------------------------------------------
  3. 3. Technology: Anju Garg www.ukoug.org 13 Thus, prior to 12c, frequency histograms could be used to accurately estimate the frequencies if NDV = 254. Height-balanced Histograms A height-balanced histogram is created if NDV 254 or Nb NDV. This histogram distributes the count of all rows evenly across all histogram buckets, so all buckets will have almost exactly the same number of rows. A height-balanced histogram is much less precise and can’t really capture information about more than 127 popular values. To create height balanced histogram, specify no. of buckets = 20 ( NDV (=26) ) DB11gexec dbms_stats.gather_table_stats - (ownname = ‘HR’, tabname = ‘HIST’,method_opt = ‘FOR COLUMNS ID size 20’, cascade = true); It can be seen that the height-balanced histogram has been created as No. of buckets (20) NDV (26) (Table 2.4). DB11gselect table_name, column_name, histogram, num_distinct, num_buckets from dba_tab_col_statistics where table_name = ‘HIST’ and column_name = ‘ID’; TABLE_NAME COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS ---------- --------------- --------------- ------------ ----------- HIST ID HEIGHT BALANCED 26 20 TABLE 2.4 The height-balanced histogram that has been created can be viewed in Table 2.5. DB11gselect ENDPOINT_VALUE, ENDPOINT_NUMBER from dba_histograms where table_name = ‘HIST’ and column_name = ‘ID’; ENDPOINT_VALUE ENDPOINT_NUMBER -------------- -------------- 1 0 2 1 6 2 8 10 9 11 11 12 12 13 13 14 14 15 15 16 17 17 20 18 24 19 26 20 14 rows selected. TABLE 2.5 FIGURE 2.3 Interpreting Height-balanced Histogram • Bucket size = Total no. of rows / Nb = 120 / 20 = 6 • ENDPOINT_NUMBER - A number uniquely identifying a bucket • For bucket with ENDPOINT_NUMBER 0, ENDPOINT_VALUE = the lowest value (1 here) • For buckets with ENDPOINT_NUMBER 0, ENDPOINT_VALUE = largest value stored in that bucket Note that when storing the histogram selection, Oracle doesn’t store repetitions of end point values. If there are multiple buckets with same end points, only one bucket is stored with its highest end point number. For example, there are 8 buckets (3 - 10) containing the value 8. The histogram stores only one entry with the highest ENDPOINT_NUMBER, i.e. 10. The optimizer decides the popularity of a value by the number of buckets having that value as its end point. Since value 8 is the endpoint of multiple buckets, it is considered as a popular value. The cardinality of a popular value is derived as the product of bucket size and the number of buckets having the value as their end point. For example, cardinality for value 8 = no. of buckets having 8 as end point * bucket size i.e. 8 * 6 = 48 (actual = 50).
  4. 4. 14 www.ukoug.org SUMMER 15 Technology: Anju Garg OracleScene D I G I T A L TABLE 2.6 DB11g explain plan for select * from hr.hist where id =8; select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------- Plan hash value: 538080257 -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 48 | 48192 | 7 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| HIST | 48 | 48192 | 7 (0)| 00:00:01 | -------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(“ID”=8) If we search for an unpopular value i.e. the value which is not an end point or is the end point of only one bucket, the optimizer calculates the cardinality as (number of rows in table)*density where density is calculated by the optimizer using an internal algorithm based on factors such as the number of buckets and the NDV. For example, consider two unpopular values: ID = 15 occurs 5 times and is an end point of one bucket ID = 3 occurs once and is not an end point It can be seen that the number of rows estimated for both the unpopular values is same i.e. 3 (Table 2.7 and Table 2.8). TABLE 2.7 DB11gexplain plan for select * from hr.hist where id = 15; select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------- Plan hash value: 4058847011 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)|Time --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 3 | 3012 | 2 (0)| 00:00:01 | 1 | TABLE ACCESS BY INDEX ROWID| HIST | 3 | 3012 | 2 (0)| 00:00:01 |* 2 | INDEX RANGE SCAN | HIST_IDX | 3 | | 1 (0)| 00:00:01 --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(“ID”=15) TABLE 2.8 DB11gexplain plan for select * from hr.hist where id = 3; select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------- Plan hash value: 4058847011 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |Cost (%CPU)| Time --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 3 | 3012 | 2 (0)| 00:00:01 | 1 | TABLE ACCESS BY INDEX ROWID| HIST | 3 | 3012 | 2 (0)| 00:00:01 |* 2 | INDEX RANGE SCAN | HIST_IDX | 3 | | 1 (0)| 00:00:01 --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access”ID”=3) Hence, it can be inferred that height-balanced histograms simply decide the cardinality of value based on the popularity of a value which depends on the number of buckets having the value as end point. Issues with Histograms in 11g - Frequency histograms are accurate, but can be created only for NDV = 254 - Height-balanced histograms may cause the optimizer to choose a suboptimal plan in cases where a value is an end point of only one bucket, but almost fills up another bucket. In such a scenario the value might be considered unpopular.
  5. 5. Technology: Anju Garg www.ukoug.org 15 In 12c, frequency histograms can be created for up to 2048 distinct values, which implies that we can now have accurate cardinality estimates for a large range of NDVs. Moreover, two new types of histograms have been introduced: Top-n-frequency and hybrid, which aim at resolving the misestimates cropping up due to use of height balanced histograms. Top Frequency Histograms If a small number of distinct values dominate the data set, the database performs a full table scan and creates a top frequency histogram by using the small number of extremely popular distinct values. A top frequency histogram can produce a better histogram for highly popular values by ignoring statistically insignificant unpopular values. The decision whether data is dominated by popular values is made based on a threshold p which is defined as (1-(1/Nb))*100 where Nb = No. of buckets. If percentage of rows occupied by the top Nb frequent values is equal to or greater than threshold p, a top frequency histogram is created else a hybrid histogram will be created. Threshold p for 20 buckets can be calculated as: p = (1 - (1/Nb))*100 = (1 - (1/20))*100 = 95.0 There are 120 rows in table HR.HIST. Hence a top frequency histogram will be created if the top 20 most popular values occupy more than 95% of rows. i.e. 114 rows. As can be seen from Table 3.1, there are 114 rows having ID’s occurring top 20 times. Hence, a top frequency histogram is created (Table 3.2), in this case when statistics are gathered for bucket size = 20 and ESTIMATE_PERCENT = AUTO_SAMPLE_SIZE (default). TABLE 3.2 DB12cexec dbms_stats.gather_table_stats - (ownname = ‘HR’, tabname = ‘HIST’, method_opt = ‘FOR COLUMNS ID size 20’, cascade = true); select table_name, column_name, histogram, num_distinct, num_buckets from dba_tab_col_statistics where table_name = ‘HIST’ and column_name = ‘ID’; TABLE_NAME COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS ---------- --------------- --------------- ------------ ----------- HIST ID TOP-FREQUENCY 26 20 The top-frequency histogram can be queried from dba_histograms as in Table 3.3. FIGURE 3.1 TABLE 3.1 SQLselect sum (cnt) from (select id, count(*) cnt from hr.hist group by id order by count(*) desc) where rownum = 20; SUM(CNT) ---------- 114
  6. 6. 16 www.ukoug.org SUMMER 15 Technology: Anju Garg OracleScene D I G I T A L DB12cselect ENDPOINT_VALUE, ENDPOINT_NUMBER from dba_histograms where table_name = ‘HIST’ and column_name = ‘ID’; ENDPOINT_VALUE ENDPOINT_NUMBER -------------- --------------- 1 4 2 6 4 8 6 10 7 13 8 63 9 66 10 68 11 74 12 80 13 86 14 89 15 94 16 97 17 100 20 105 21 107 24 109 25 112 26 114 20 rows selected TABLE 3.3 FIGURE 3.2 Interpreting Top Frequency Histogram • ENDPOINT_VALUE represents key value (ID) • ENDPOINT_NUMBER represents cumulative frequency • Since NDV (26) Nb (20), only 20 values are captured which occur most frequently • Frequencies of least occurring 6 values (bottom 5%) have not been stored It can be seen that a top frequency histogram makes an accurate cardinality estimate for both id = 15 (Table 3.4) and 3 (Table 3.5) which were considered non-popular values in the height-balanced histogram. TABLE 3.4 DB12cexplain plan for select * from hr.hist where id = 15; select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------- Plan hash value: 3950962134 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5 |5020 | 2 (0)| 00:00:01 | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| HIST | 5 |5020 | 2 (0)| 00:00:01 |* 2 | INDEX RANGE SCAN |HIST_IDX| 5 | | 1 (0)| 00:00:01 --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(“ID”=15) TABLE 3.5 DB12cexplain plan for select * from hr.hist where id = 3; select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------- Plan hash value: 3950962134 ---------------------------------------------------------------------------------| Id | Operation | Name |Rows| Bytes|Cost(%CPU)|Time --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 |1004 | 2 (0)| 00:00:01 | 1 | TABLE ACCESS BY INDEX ROWID BATCHED|HIST | 1 |1004 | 2 (0)| 00:00:01 |* 2 | INDEX RANGE SCAN |HIST_IDX| 1 | | 1 (0)| 00:00:01 --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(“ID”=3) Thus the problem with height balanced histograms, of not being able to estimate the frequency of unpopular values accurately, has been resolved by top frequency histograms in cases when a small number of distinct values dominate the majority of distinct values. This histogram is gathered using a full table scan of a table. The occurrences of popular values are accurately captured at the expense of not capturing the data for least occurring values.
  7. 7. Technology: Anju Garg www.ukoug.org 17 Hybrid Histograms A hybrid histogram is so called as it combines the characteristics of both height-based histograms and frequency histograms. As we saw earlier, the height-balanced histogram may produce inaccurate estimates for: • a value that is not an end point • a value that is an end point of only one bucket • a value that is an end point of multiple buckets and almost fills up the last bucket A hybrid histogram attempts to overcome above shortcomings as it has following features: • For each end point in the histogram, it stores the ENDPOINT_REPEAT_COUNT value, which is the number of times the end point value is repeated. Thus, it has an accurate frequency of end point values. • As compared to a height-balanced histogram where a value having frequency greater than bucket size could be spread across multiple buckets, a hybrid histogram stores all the occurrences of every value in the same bucket, i.e. a value cannot span multiple buckets. As a result, it can capture more end points. • Similar to a height-balanced histogram, a bucket in a hybrid histogram can contain more than one value. An after effect of this implementation is variable bucket size. Since each value possibly having a different frequency will be contained entirely in one bucket only and one bucket can even have more than one value, buckets of different size may result. A histogram with 20 buckets will be created as a hybrid histogram if rows having top 20 most popular IDs are less than threshold p for 20 buckets. p = (1 - (1/nb))*100 = (1 - (1/20))*100 = 95.0 On deleting 20 rows with ID = 8 from table HR.HIST, it qualifies for hybrid histogram creation as no. of rows having id’s occurring top 20 times = 94 (Table 3.7) which is less than 95% of rows. i.e. 95 rows. TABLE 3.6 DB12cdelete from hr.hist where id = 8 and rownum =20; commit; select count(*) from hr.hist; COUNT(*) ---------- 100 TABLE 3.7 DB12cselect sum (cnt) from (select id, count(*) cnt from hr.hist group by id order by count(*) desc) where rownum = 20; SUM(CNT) ---------- 94 It can be seen from that, a hybrid histogram with 20 buckets has been created (Table 3.8 and Table 3.9). TABLE 3.8 DB12cexec dbms_stats.gather_table_stats - (ownname = ‘HR’,tabname = ‘HIST’, method_opt = ‘FOR COLUMNS ID size 20’, cascade = true); DB12c select table_name, column_name, histogram, num_distinct, num_buckets from dba_tab_col_statistics where table_name = ‘HIST’ and column_name = ‘ID’; TABLE_NAME COLUMN_NAME HISTOGRAM NUM_DISTINCT NUM_BUCKETS ---------- --------------- --------------- ------------ ----------- HIST ID HYBRID 26 20 DB12cselect ENDPOINT_VALUE, ENDPOINT_NUMBER, ENDPOINT_REPEAT_COUNT RPT_CNT from dba_histograms where table_name = ‘HIST’ and column_name = ‘ID’; ENDPOINT_VALUE ENDPOINT_NUMBER RPT_CNT -------------- --------------- ---------- 1 4 4 3 7 1 5 10 1 7 15 3 8 45 30 10 50 2 11 56 6 12 62 6 13 68 6 14 71 3 15 76 5 16 79 3 17 82 3 19 84 1 20 89 5 21 91 2 22 92 1 23 93 1 24 95 2 26 100 2 20 rows selected. TABLE 3.9
  8. 8. 18 www.ukoug.org SUMMER 15 Technology: Anju Garg OracleScene D I G I T A L Interpreting Hybrid Histogram • ENDPOINT_VALUE: The largest value in a bucket • ENDPOINT_NUMBER: Cumulative frequency. The difference of 2 consecutive ENDPOINT_NUMBER’s gives the bucket size. • ENDPOINT_REPEAT_COUNT: Frequency of endpoint Based on the above information, data has been arranged in buckets as shown in fig 3.3. It can be seen that Hybrid histogram captures more endpoints (20 = Nb) as compared to Height Balanced histogram (14) and can estimate their cardinality accurately. Thus, it is evident that Hybrid histograms have features of both frequency and height balanced histograms. Features similar to frequency histograms: • All occurrences of a value are placed in one bucket • ENDPOINT_NUMBER stores cumulative frequency Features similar to height-balanced histograms: • One bucket can contain multiple values. FIGURE 3.3 Summary • In 12c, a frequency histogram can be created for NDV = 2048. • Top frequency and hybrid histograms are designed to overcome flaws of height-balanced histograms. • Top frequency and hybrid histograms are created only if ESTIMATE_PERCENT = AUTO_SAMPLE_SIZE. • Top frequency histograms accurately estimate the frequencies for only top occurring values if a small number of values dominate the data set. • Hybrid histograms have features of both frequency and height-balanced histograms • Hybrid histograms capture more end points as compared to height-balanced histograms and estimate their frequency accurately. References • http://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm#TGSQL366 • http://jimczuprynski.files.wordpress.com/2014/04/czuprynski-select-q2-2014.pdf • http://jonathanlewis.wordpress.com/2013/09/01/histograms/ ABOUT THE AUTHOR Anju Garg Corporate Trainer Anju Garg is an Oracle Ace Associate with over 12 years of experience in the IT industry in various roles. Since 2010, she has been involved in teaching and has trained more than 100 DBAs from across the world in various core DBA technologies like RAC, Data guard, Performance Tuning, SQL statement tuning, Database Administration etc. Anju is passionate about learning and has a keen interest in RAC and Performance Tuning, sharing her knowledge via her technical blog. Blog: http://oracleinaction.com

×