Cost Based Optimizer - Part 2 of 2


Published on

This is a presentation that describes how Oracle uses histograms to make decisions on SQL query execution. To see the actual webinar and demo, go

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Note that without properly collected statistics, the CBO will do one of two things: if no statistics exist for any object used in the SQL statement, the CBO may use rule-based optimization (prior to v10) or use dynamic sampling if statistics exist for any single object but not others in the SQL statement, the CBO may use a set of default statistics for the object without statistics or use dynamic sampling. CBO default statistics for objects without collected stats (prior to v10…in v10 dynamic sampling is typically used instead of defaults): TABLE SETTING DEFAULT STATISTICS cardinality (number of blocks * (block size – cache layer) / average row length average row length 100 bytes number of blocks 100 or actual value based on the extent map remote cardinality (distrib) 2000 rows remote average row length 100 bytes INDEX SETTING DEFAULT STATISTICS levels 1 leaf blocks 25 leaf blocks/key 1 data blocks/key 1 distinct keys 100 clustering factor 800
  • Plot A illustrates a situation in which the execution plan does not change, but the query response time varies significantly as the number of rows in the table changes. This kind of thing occurs when an application chooses a TABLE ACCESS (FULL) execution plan for a growing table. It’s what causes RBO-based applications to appear fast in a small development environment, but then behave poorly in the production environment. Plot B illustrates the marginal improvement that’s achievable, for example, by distributing an inefficient application’s workload more uniformly across the disks in a disk array. Notice that the execution plan (or “shape of the performance curve”) isn’t necessarily changed by such an operation (although, if the output of dbms_stats.gather_system_statistics changes as a result of the configuration change, then the plan might change). The performance for a given number of rows might change, however, as the plot here indicates. Plot C illustrates what is commonly the most profound type of performance change: an execution plan change. This situation can be caused by a change to any of CBO inputs. For example, an accidental deletion of a segment’s statistics can change a plan from a nice fast plan (depicted by the green curve, which is O(log n)) to a horrifically slow plan (depicted by the red curve, which is O(n 2 )). The phenomenon illustrated in plot C is what has happened when a query that was fast last week now runs for 14 hours without completing before you finally give up and kill the session.
  • Since the CBO determines the selectivity of predicates that appear in queries, it is important that there be adequate information for the CBO to make it's estimates properly. By gathering histogram data, the CBO can make improved selectivity estimates in the presence of data skew, resulting in optimal execution plans with non-uniform data distributions. The histogram approach provides an efficient and compact way to represent data distributions. Selectivity estimates are used to decide when to use an index and the order in which to join tables. Many table columns are not uniformly distributed. Therefore, the normal calculations for selectivity may not be accurate without the use of histograms.
  • Height-balanced histograms put approximately the same number of values into each interval, so that the endpoints of the interval are determined by the number of values in that interval. Only the last (largest) values in each bucket appear as bucket (end point) values. A height-balanced histogram will be created if the number of histogram buckets ( SIZE ) indicates a value smaller than the number of distinct values in the column. Frequency histograms (sometimes called value-based histograms) are created when the number of histogram buckets ( SIZE ) specified is greater than or equal to the number of distinct column values. In frequency histograms, all the individual values in the column have a corresponding bucket, and the bucket number reflects the repetition count of each value. The type of histogram is stored in the HISTOGRAM column of the *TAB_COL_STATISTICS views. The column can have values of HEIGHT BALANCED, FREQUENCY , or NONE . The SIZE of a histogram can be set by you or automatically by Oracle when the histogram is collected. The default SIZE (when no SIZE is specified) is 75. The maximum SIZE is 255.
  • DBMS_STATS Constants SIZE REPEAT Causes the histograms to be created with the same options as last time you created it. It reads the data dictionary to figure out what to do. SIZE AUTO Oracle looks at the data and using a magical, undocumented and changing algorithm, figures out all by itself what columns to gather stats on and how many buckets and all. It'll collect histograms in memory only for those columns which are used by your applications (those columns appearing in a predicate involving an equality, range, or like operators). It knows that a particular column was used by an application because at parse time, it will store workload information in SGA. Then it will store histograms in the data dictionary only if it has skewed data (and it worthy of a histogram). SIZE SKEWONLY When you collect histograms with the SIZE option set to SKEWONLY , it collects histogram data in memory for all specified columns (if you do not specify any, all columns are used). Once an "in-memory" histogram is computed for a column, it is stored inside the data dictionary only if it has "popular" values (multiple end-points with the same value which is what is meant by "there is skew in the data").
  • In Oracle version 8, the use of bind variables in a predicate effectively disables the use of histograms. This is because the optimizer needs to know the value ( WHERE col = 'x' ) in order to check the histogram statistics for selectivity for that value. When a bind variable is used, it is not actually bound into the query until execution time. Since the execution plan is determined in the parse phase, the optimizer won't know the value and thus can't use the histogram to makes its decision. In Oracle version 9, the optimizer behavior regarding bind variables changed slightly. In version 9, when a query is initially parsed, the optimizer will "peek" at the value of the bind variable and use the value it finds to make decisions. Does that make the situation better or worse? It depends. Let's say that when the query is initially parsed, it has a bind variable value of 1 being used in the predicate. If the column has a histogram and the histogram indicates that selectivity is low for that value (few values match), then it will likely choose to use an index on that column if available. Everything works well, performance is sub-second and everyone is happy. Now, what happens if the query is executed a 2 nd time but passes the value of 0 in the bind variable (and the selectivity for the value 0 is high…lots of values match). What happens? The original plan is still used and the query will attempt to use the same index. If there are thousands of records in the row source, it is likely that the index scan will perform significantly worse than simply doing a full table scan. In this case, everything works but performance stinks and complaints arise. So, what do you do? For some, the best solution is to not use bind variables when you have a column with a limited number of values and the values are skewed and to just hard-code the value you need. The best way to know what to do is to test different approaches to find what works best for your environment.
  • The RBO workaround is forgivable because it’s all the RBO environment could offer as an option. The CBO technique shown here is particularly bad because it makes the application less flexible and therefore less able to respond appropriately to system changes. Ideally, if you (the developer) already know that data for certain columns tends to skew, you can write code to account for it. A good guideline to follow is to look at the number of distinct values in the column. If the column has only a few distinct values, then hard-coding the value will allow the optimizer to correctly choose the plan based on histogram data. If there are a lot of distinct values, but you know in advance the actual skewed values, you could write conditional code to use a bind variable in all cases except when the known skewed values are requested. In that case, the conditional code would branch to a SQL statement version which hard-codes the skewed value under those circumstances.
  • Cost Based Optimizer - Part 2 of 2

    1. 1. Cost Based Optimizer – 2 of 2 Hotsos Enterprises, Ltd. Grapevine, Texas Oracle. Performance. Now. [email_address]
    2. 2. Agenda <ul><li>Cost Based Optimizer and its impact on performance </li></ul><ul><li>Skewed Data </li></ul><ul><li>Histograms </li></ul><ul><li>Impact </li></ul><ul><ul><li>Performance (Logical I/O Impact) </li></ul></ul><ul><ul><li>Performance (Join Strategy) </li></ul></ul><ul><ul><li>Bind Variables </li></ul></ul><ul><ul><li>Cardinality and Cost </li></ul></ul><ul><li>Conclusion </li></ul>
    3. 3. Cost Based Optimizer
    4. 4. Cost Based Optimizer (CBO) <ul><li>The CBO in reality is a complex decision making software </li></ul><ul><ul><li>Use several Database Initialization Parameters </li></ul></ul><ul><ul><ul><li>These are listed in the 10053 trace file </li></ul></ul></ul><ul><ul><li>Uses several session level initialization parameter </li></ul></ul><ul><ul><ul><li>These are parameters at the session level that override the database initialization parameters </li></ul></ul></ul><ul><ul><li>Uses statistics about the objects (Tables, Indexes) </li></ul></ul><ul><ul><li>Hints to the optimizer </li></ul></ul><ul><ul><li>Uses Statistics about the system (CPU, Disk etc) </li></ul></ul><ul><ul><li>Use this information and makes decisions on the “best way” to generate an execution plan </li></ul></ul><ul><ul><li>Use Information about the skew of the column if that information is gathered </li></ul></ul>
    5. 5. CBO will be part of your life if you keep working with Oracle. <ul><li>The cost-based query optimizer (CBO)… </li></ul><ul><ul><li>Uses data from a variety of sources </li></ul></ul><ul><ul><li>Estimates the costs of several execution plans </li></ul></ul><ul><ul><li>Chooses the plan it estimates to be the least expensive </li></ul></ul><ul><li>Characteristics </li></ul><ul><ul><li>Adapts to changing circumstances </li></ul></ul><ul><ul><li>Frustrating if you don’t know what it considers as input </li></ul></ul><ul><ul><ul><li>Works great if you know how to use it </li></ul></ul></ul><ul><ul><ul><li>But produces very poor results if you lie to it </li></ul></ul></ul><ul><ul><li>The only query optimizer supported by Oracle Corporation from release 10 onward </li></ul></ul>
    6. 6. The cost-based query optimizer chooses the plan that it computes as having the lowest estimated cost. <ul><li>Don’t assume the following are identical </li></ul><ul><ul><li>CBO’s estimated cost of an execution plan </li></ul></ul><ul><ul><li>The actual cost of an execution plan </li></ul></ul><ul><li>CBO’s cost estimate can be imperfect </li></ul><ul><ul><li>Are your CBO inputs perfect? </li></ul></ul><ul><ul><li>CBO isn’t perfect, but by 9.2 it’s almost always good enough </li></ul></ul><ul><li>Without properly collected statistics, the CBO will </li></ul><ul><ul><li>use RBO if no statistics exist on any object in the statement </li></ul></ul><ul><ul><li>use default statistics if statistics exist for a single object in the statement but not others </li></ul></ul><ul><ul><li>use dynamic sampling to generate statistics (based on parameter setting and Oracle version) </li></ul></ul>
    7. 7. Cost Based Optimizer
    8. 8. Execution plan changes can result in profoundly different application performance. <ul><li>Table size change </li></ul><ul><li>Device latency change </li></ul><ul><li>Execution plan change </li></ul><ul><li>Type C performance changes are the most profound </li></ul>size change performance change performance change performance change
    9. 9. Recap <ul><li>The CBO is a complex piece of software </li></ul><ul><li>It uses several data points to calculate the cost of the execution plan and will choose the plan with the lowest cost </li></ul><ul><li>It is dynamic and will adapt to changing data better than the Rule Based Optimizer </li></ul><ul><li>A good understanding of the Cost Based Optimizer is imperative in understanding the rationale behind some of the choices </li></ul>
    10. 10. Skewed Data
    11. 11. Skewed Data <ul><li>Skewed Data is where the data distribution is not uniform </li></ul><ul><li>A good example is the owner column for dba_objects </li></ul><ul><li>The column is highly skewed </li></ul><ul><li>Select owner,count(*) from dba_objects </li></ul><ul><li>Group by owner; </li></ul>
    12. 12. Some kinds of data skew naturally; some don’t. <ul><li>Guaranteed to be skewed </li></ul><ul><ul><li>E.g., status attribute (open | closed) of a sales order table </li></ul></ul><ul><li>Possibly not skewed </li></ul><ul><ul><li>E.g., sale date attribute of a sales order table </li></ul></ul>
    13. 13. Histograms
    14. 14. What are the costs and benefits of histograms? <ul><li>Benefits of histograms </li></ul><ul><ul><li>CBO sometimes needs the information to make good decisions </li></ul></ul><ul><li>Costs of histograms </li></ul><ul><ul><li>Computing histograms will consume extra computing capacity during the statistics collection </li></ul></ul><ul><ul><li>Some CPU time and extra latching is required during plan determination for the optimizer to consider histograms </li></ul></ul>
    15. 15. Histograms provide the optimizer with better information from which to derive an execution plan for a query. <ul><li>A histogram is a graphic representation of frequency distribution by means of rectangles whose widths represent class intervals and whose heights represent corresponding frequencies </li></ul><ul><li>Oracle implements histograms in two ways </li></ul><ul><ul><li>Height-balanced – created if column NDV > SIZE </li></ul></ul><ul><ul><li>Frequency – created if column NDV <= SIZE </li></ul></ul>
    16. 16. Types of Histograms <ul><li>Frequency </li></ul><ul><ul><li>Every distinct value in the column will have a count of how many occurrences of that value </li></ul></ul><ul><li>Height Balanced Histograms </li></ul><ul><ul><li>All histogram entries will have the same value but a range for the columns will be used </li></ul></ul>
    17. 17. Frequency Histogram
    18. 18. Height Balanced Histogram
    19. 19. Histograms can be gathered by setting the parameter for METHOD_OPT . <ul><li>For a specific column: </li></ul><ul><li>FOR COLUMNS column_x SIZE <n|REPEAT|AUTO|SKEWONLY> </li></ul><ul><li>For all the columns in a table: </li></ul><ul><li>FOR ALL COLUMNS </li></ul><ul><li>For only the columns that have an index: </li></ul><ul><li>FOR ALL INDEXED COLUMNS </li></ul>EXEC DBMS_STATS.GATHER_TABLE_STATS( ownname=>'OP', tabname=>'my_table', method_opt=>'FOR COLUMNS column_x SIZE 10')
    20. 20. Histograms are not useful in all cases. <ul><li>Histograms are not useful for columns with the following characteristics: </li></ul><ul><ul><li>All (or most) predicates on the column use bind variables </li></ul></ul><ul><ul><li>The column data is uniformly distributed </li></ul></ul><ul><ul><li>The column is unique and is used only with equality predicates </li></ul></ul><ul><ul><li>Data distribution changes frequently and statistics aren't collected to match </li></ul></ul>
    21. 21. Even in the most recent Oracle versions, histogram optimization doesn’t completely work with bind variables. <ul><li>Oracle version 8 </li></ul><ul><ul><li>Use of bind variables prohibits histogram optimization </li></ul></ul><ul><li>Oracle version 9 and above </li></ul><ul><ul><li>Oracle query optimizer “peeks” at bind value to use histogram optimization </li></ul></ul><ul><ul><li>But only on initial hard parse of a query </li></ul></ul>
    22. 22. Be prepared for how application developers might have worked around skew problems. <ul><li>The old-fashioned RBO technique </li></ul><ul><ul><li>Create the index </li></ul></ul><ul><ul><li>Hard-code the selective query with “ status=1 ” </li></ul></ul><ul><ul><li>Hard-code the un-selective query with “ status+0=1 ” </li></ul></ul><ul><li>A CBO technique </li></ul><ul><ul><li>Create the index </li></ul></ul><ul><ul><li>Hard-code the selective query with /*+ index(t) */ </li></ul></ul><ul><ul><li>Hard-code the un-selective query with /*+ full(t) */ </li></ul></ul><ul><ul><li>Don’t resort to either of these! </li></ul></ul>
    23. 23. Where Histogram Information is Stored <ul><li>DBA_TAB_HISTOGRAMS </li></ul><ul><li>DBA_TAB_COL_STATISTICS </li></ul>
    24. 24. Demo Histogram Data Dictionary Tables
    25. 25. Impact Performance in terms of Logical I/O’s
    26. 26. Demo Cardinality
    27. 27. Demo Join Cardinality
    28. 28. Recap <ul><li>Histograms can be really useful when gathered on skewed columns </li></ul><ul><li>Histograms are specific to your data and version </li></ul><ul><li>Test it out and prove that gathering histograms is beneficial </li></ul><ul><li>Be careful of bind variable substitutions as histograms may not be used </li></ul>