Firebird: cost-based optimization and statistics, by Dmitry Yemanov (in English)


Published on

Basic introduction to internal mechanism of Firebird optimizer. How it works, how it decides to use this or that index, why sometimes it fails and what you can do to improve performance? Definitely this presentation will not answer all these questions but it gives you a basic knowledge of Firebird optimizer internals. This is not for all developers and requires some qualification, definitely.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Firebird: cost-based optimization and statistics, by Dmitry Yemanov (in English)

  1. 1. Cost-based OptimizationandStatistics in Firebird<br />Dmitry Yemanov<br />The Firebird Project<br /><br />
  2. 2. Introduction<br /><ul><li>Optimizer decides how to find all the information required in the most efficient way it can
  3. 3. Different queries and/or fetch strategies may benefit from different data access paths
  4. 4. Some information should exist in order to help the optimizer in guessing about the best access path
  5. 5. Optimization strategies
  6. 6. Rule-based (heuristics)
  7. 7. Cost-based (statistics)</li></li></ul><li>Rule-based Optimization<br /><ul><li>Heuristical definitions
  8. 8. Indexed retrieval is better than a full table scan(and indexed loop join is better than a merge join)
  9. 9. B-tree has three levels of depth
  10. 10. Compound indices are better than simple ones
  11. 11. Drawbacks
  12. 12. Indices could be bad for some operations
  13. 13. User intentions are not taken into account
  14. 14. Not ready for “ad hoc” queries</li></li></ul><li>Cost-based Optimization<br /><ul><li>Key points
  15. 15. Every operation has an associated cost value
  16. 16. Cost value is calculated using statistical data
  17. 17. Cost is aggregated from bottom up in the access path
  18. 18. Drawbacks
  19. 19. Complex implementation
  20. 20. Slow optimization process
  21. 21. Requires up-to-date statistics</li></li></ul><li>Basic Terms<br /><ul><li>Selectivity
  22. 22. Represents a fraction of rows from a row set
  23. 23. Lies in the value range 0.0 to 1.0
  24. 24. Cardinality
  25. 25. Represents number of rows in a row set
  26. 26. Base cardinality is the number of rows in a base table</li></li></ul><li>Understanding of Cost<br /><ul><li>Cost
  27. 27. Is a function of the estimated cardinalities
  28. 28. Represents computational complexity of the retrieval
  29. 29. Measurement
  30. 30. Cost value linearly depends on the number of logical reads required to perform an operation
  31. 31. Logical read is equal to a single page fetch
  32. 32. Cost value may also take into account auxiliary steps such as an external sorting</li></li></ul><li>Cost Measurement (example)<br /><ul><li>Full table scan
  33. 33. cost = base cardinality
  34. 34. Unique index scan
  35. 35. cost = b-tree level + 1
  36. 36. Range index scan
  37. 37. cost = b-tree level + N + selectivity * base cardinality(N represents the number of the required leaf page fetches and thus depends on the average key length)</li></li></ul><li>Cost Aggregation (example)<br />Final Row Set<br />cost = 9000<br />SELECT *<br />FROM T1<br /> JOIN T2 ON T1.PK = T2.FK<br />WHERE T1.VAL + T2.VAL &lt; 100<br />ORDER BY T1.NUM<br />Sort<br />cost = 9000<br />Filter<br />cost = 7000<br />Loop Join<br />cost = 6000<br />Full Scan<br />cost = 1000<br />Index Scan<br />cost = 5<br />
  38. 38. Statistics<br /><ul><li>Information describing data amounts and distribution of values on different levels(table, index, column)
  39. 39. Stored in a database or estimated at runtime
  40. 40. Collected by request or automatically</li></li></ul><li>Core Statistics<br /><ul><li>Number of Rows in a Table (Base Cardinality)
  41. 41. Small tables:number of used record slots on the data pages
  42. 42. Large tables:number of used data pages / average record length
  43. 43. Estimated at runtimevia scanning pointer or data pages</li></li></ul><li>Core Statistics (continued)<br /><ul><li>Index Selectivity
  44. 44. 1 / number of distinct keys in the index
  45. 45. Maintained per segment: (A), (A, B), (A, B, C)
  46. 46. Assumes uniform distribution of values
  47. 47. Calculated during index creation or upon request(SET STATISTICS statement)
  48. 48. Stored on the index root page
  49. 49. Visible in RDB$INDICES and RDB$INDEX_SEGMENTS</li></li></ul><li>Decisions Based on Core Statistics<br /><ul><li>Full Table Scan over Indexed Retrieval
  50. 50. Selectivity close to 1.0 suggests a full scan
  51. 51. What Indices to Use
  52. 52. Compare index selectivities and index scan costs
  53. 53. Consider segment operations for compound indices
  54. 54. Calculate selectivities for AND and OR operations
  55. 55. Order of Streams in Loop Joins
  56. 56. Calculate costs for different join ordersand choose the best one</li></li></ul><li>Advanced Statistics<br /><ul><li>Table level
  57. 57. Average page fill factor
  58. 58. Average row length(both help with a better base cardinality estimation)
  59. 59. Number of rows(allows to avoid the runtime pages scan)</li></li></ul><li>Advanced Statistics (continued)<br /><ul><li>Index level
  60. 60. B-tree depth
  61. 61. Average key length(both help with a better cost estimation for index scans)
  62. 62. Clustering factor(allows to prefer an index navigationover an external sort under some conditions;also could be used to avoid filling the sparse bitmap)</li></li></ul><li>Clustering Factor<br />Index Key 1<br />Data Page 12<br />Index Key 2<br />Data Page 25<br />Data Page 12<br />Index Key 3<br />Data Page 28<br />Data Page 13<br />Index Key 4<br />Data Page 44<br />Data Page 14<br />Index Key 5<br />Data Page 57<br />Bad Clustering Factor<br />Good Clustering Factor<br />
  63. 63. Advanced Statistics (continued)<br /><ul><li>Column level
  64. 64. Selectivity(core feature, required to estimate costs)
  65. 65. Number of NULLs(useful for selectivity estimations for IS [NOT] NULL)
  66. 66. Value distribution histogram(allows selectivity estimations for non-uniform value distributions)</li></li></ul><li>Sample Histograms<br />1. Non-Selective Column<br />&apos;A&apos;<br />&apos;B&apos;<br />&apos;C&apos;<br />&apos;D&apos;<br />2. Selective Column<br />1 5 5 5 10 20 50 50 80 100<br />
  67. 67. Decisions Based on Advanced Statistics<br /><ul><li>Sort Aggregation vs Hash Aggregation
  68. 68. Selectivity of columns being grouped by
  69. 69. Loop Join vs Merge Join vs Hash Join
  70. 70. Cardinality of tables and filtering predicates
  71. 71. Index Usage
  72. 72. Number of NULLs or histogram
  73. 73. Index Navigation vs External Sorting
  74. 74. Clustering factor</li></li></ul><li>The Firebird<br />