Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Histograms in 12c era

Oracle database 12c introduced several improvements on histograms, this session focuses on what's changed and how those changes address past concerns

  • Login to see the comments

  • Be the first to like this

Histograms in 12c era

  1. 1. Histograms life in a pre and post 12c era Mauro Pagano
  2. 2. Agenda • Concepts • Life pre-12c • Available histograms • Known challenges • Life post-12c • Available histograms • How the old challenges are solved 2
  3. 3. What is a histogram? • “Special” type of column statistics • Provides info about data distribution in a column • Allows the optimizer to get (usually) more accurate estimations • Gathered *automatically* starting 10g 3
  4. 4. Why histograms have bad reputation? • Not always very clear where to put them • Histograms + bind peeking => plan instability in 10g • Adaptive Cursor Sharing (usually) solves the problem in 11g/12c but has a “warm-up” phase • Old histograms had limitations 4
  5. 5. Do we need a histogram here? 5 0 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10
  6. 6. Do we need a histogram here? 6 0 50 100 150 200 250 1 2 3 4 5 6 7 8 9 10
  7. 7. Do we need a histogram here? 7 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10
  8. 8. Life pre-12c – Available histograms • Two histogram type available before 12c • Frequency histogram • Very accurate • Available only when number of distinct values (NDV) small • Height-Balanced histogram • Less accurate • Always available but not always wanted 8
  9. 9. Life pre-12c – Frequency histogram • Each column value gets its own bucket • Available only when NDV <= num buckets (up to 254) • Accurate sel/card estimation, just compute the bucket size • Bucket size = endpoint_number – previous endpoint_number • Value in range with no bucket gets 50% smallest bucket 9
  10. 10. Life pre-12c – Frequency histogram select count(*),n1 from test_freq group by n1 COUNT(*) N1 -------- ---- 300 0 50 1 100 2 150 3 50 4 5 5 345 6 10 select endpoint_number, endpoint_number - lag( endpoint_number,1,0) over (order by endpoint_number) mybucketsize, endpoint_value from user_tab_histograms where table_name = 'TEST_FREQ' and column_name = 'N1'; ENDPOINT_NUMBER MYBUCKETSIZE ENDPOINT_VALUE --------------- ------------ -------------- 300 300 0 350 50 1 450 100 2 600 150 3 650 50 4 655 5 5 1000 345 6
  11. 11. Life pre-12c – Height-balanced histogram • Available only when NDV > num_buckets (or NDV > 254) • Range [min,max] divided into 254 buckets of equal size • Values spanning two or more buckets are considered popular • Good accuracy for popular values, everybody else gets density 11
  12. 12. Life pre-12c – Height-balanced histogram select count(*),n1 from test_hb group by n1 COUNT(*) N1 ---------- --- 50 0 10 1 100 2 150 3 50 4 540 5 100 6 12 select endpoint_number en, endpoint_value ep, case when endpoint_number - lag(endpoint_number) over (order by endpoint_number) > 1 then 'YES' else 'NO' end popular from user_tab_histograms where table_name = 'TEST_HB' and column_name = 'N1'; EN EP POP ---- ---- --- 0 0 NO 1 3 NO 3 5 YES 4 6 NO
  13. 13. Old limitations – Almost popular values in HB histograms • In HB a popular value accounts for at least 0.78% of the data • Every other value gets a much smaller selectivity (density) • “Almost popular value” is a value close to span two buckets • Example: 1B rows table, 20M NDV • Popular value from aprox 3.8M rows up • Everybody else estimated as aprox 50k rows • Almost popular value could have 3M rows 13
  14. 14. Old limitations – 254 may not be enough • Freq histograms provides MUCH better estimations than HB • But Freq only available when NDV <= 254 then we go HB • Even with HB 254 might not enough to describe a distribution with no big skewness 14
  15. 15. Old limitations – Buckets on first 32 chars • Histogram buckets computed on substr(col,1,32) • Incorrect buckets for values that start with same 32 chars • Incorrect selectivity caused by incorrect bucket size 15 C1 ---------------------------------------- /home/mpagano/Oracle/files/test/q1.sql /home/mpagano/Oracle/files/test/q2.sql /home/mpagano/Oracle/files/test/q3.sql /home/mpagano/Oracle/files/test/q4.sql /home/mpagano/Oracle/files/test/q5.sql /home/mpagano/Oracle/files/test/q6.sql /home/mpagano/Oracle/files/test/q7.sql /home/mpagano/Oracle/files/test/q8.sql /home/mpagano/Oracle/files/test/q9.sql COLUMN_NAME NUM_DISTINCT NUM_BUCKETS HISTOGRAM ------------- ------------ ----------- ---------- C1 10000 1 FREQUENCY
  16. 16. It’s time for AUTO_SAMPLE_SIZE • In 10g recommendation was to use a fixed percentage • In 11g new NDV algorithm gives much better results • Several new features requires AUTO_SAMPLE_SIZE • Incremental stats gathering • Top Frequency Histograms • Hybrid Histograms 16
  17. 17. Life post-12c Available histograms • Three histogram type available starting 12c • Frequency histogram • Top Frequency Histogram • Accurate as a freq histogram for almost every value • Hybrid histogram • Combines Freq and HB histograms approach 17
  18. 18. Life post-12c – Common improvements • Number of buckets increased to 2048 • Rarely useful when using new methods • Buckets build on first 64 chars instead of 32 • Reduce impact on having values starting with same chars • Buckets number identified during first scan of the table • Same as how NDV is computed in 11g 18
  19. 19. Life post-12c – Top Frequency Histogram What’s the idea • Each value in old Frequency histogram is important even if it’s very unpopular • Old Frequency histo provide very accurate info for each value • But limited to only 254 (2048) values • What if most of the data belong to those 254 values? 19
  20. 20. Life post-12c – Top Frequency Histogram How it works • Frequency histogram created for the Top N values • Each Top N value gets its own bucket => accurate estimations • Every other value not statistically relevant is not stored • Density used to all the non-popular values 20
  21. 21. Life post-12c – Top Frequency Histogram When it’s created • estimate_percent => AUTO_SAMPLE_SIZE • Number of Distinct Values in the column > Number buckets • The top N values accounts for (1-(1/N)) * 100 of the data • Example: Number of rows 1M, N 200, NDV 10k (1-1/200)*100 = 99.5% so when top 200 values >= 99.5% of the data 21
  22. 22. Life post-12c – Top Frequency Histogram 22 select count(*),n1 from test_top_freq group by n1 COUNT(*) N1 -------- ---- 300 0 50 1 100 2 150 3 50 4 5 5 345 6 Top Frequency => (1-(1/6))*100 => 83.3% => 833 rows select endpoint_number, endpoint_number - lag( endpoint_number,1,0) over (order by endpoint_number) mybucketsize, endpoint_value from user_tab_histograms where table_name = 'TEST_TOP_FREQ' and column_name = 'N1'; ENDPOINT_NUMBER MYBUCKETSIZE ENDPOINT_VALUE --------------- ------------ -------------- 300 300 0 350 50 1 450 100 2 600 150 3 650 50 4 995 345 6
  23. 23. Life post-12c – Hybrid Histogram What’s the idea • Biggest limitation in HB is almost popular value => poor plans • Split the range in buckets of equal size is unfair • Need to account for popularity • What if we can “adjust” the bucket size based on popularity? 23
  24. 24. Life post-12c – Hybrid Histogram How it works • Create HB histogram but no value spans more than one bucket • For each endpoint value track how popular the value is • Accurate estimation for endpoint values • Non endpoint values get the density 24
  25. 25. Life post-12c – Hybrid Histogram When it’s created 25 • estimate_percent => AUTO_SAMPLE_SIZE • Number of Distinct Values in the column > Number buckets • Top Frequency not available • Example: Number of rows 1M, N 200, NDV 10k (1-1/200)*100 = 99.5% so when top 200 values < 99.5% of the data
  26. 26. Life post-12c – Hybrid Histogram 26 select count(*),n1 from test_hybrid group by n1 COUNT(*) N1 -------- ---- 118 0 51 1 89 2 113 3 59 4 66 5 126 6 131 7 131 8 116 9 select endpoint_number en, endpoint_value ep, endpoint_repeat_count from user_tab_histograms where table_name = 'TEST_HYBRID' and column_name = 'N1'; EN EP ERC ---------- ---------- ------- 118 0 118 371 3 113 430 4 59 496 5 66 622 6 126 753 7 131 1000 9 116
  27. 27. Histogram selection • Assuming default parameters then usually: • If NDV <= 254 then Frequency Histogram • Else if top values account for most of the data then Top Freq Histogram • Else Hybrid Histogram • No HB histogram considered when using AUTO_SAMPLE_SIZE • Max number of buckets is 254 in AUTO (manual limit is 2048) 27
  28. 28. SQLT and histograms - Frequency 28
  29. 29. SQLT and histograms – Top Frequency 29
  30. 30. SQLT and histograms – Hydrid 30
  31. 31. Summary • Three types of histograms in 12c • introduced two new types (top frequency and hybrid) • Obsolete height-balanced • Leveraged only when using AUTO_SAMPLE_SIZE • Most of the old limitations have been addressed • Additional granularity (2048 buckets) available if needed 31
  32. 32. 32
  33. 33. Reference • Oracle® Database SQL Tuning Guide 12c Release 1 (12.1) • Master Note: Optimizer Statistics (Doc ID 1369591.1) • How to Gather Optimizer Statistics on 12c (Doc ID 1445302.1) 33
  34. 34. Contact info • Email mauro.pagano@gmail.com • Twitter @mautro • Blog http://mauro-pagano.com 34

×