3. 3
Query optimizer needs histograms
●
Possible query plans:
– orders(in given period)→lineitem
– lineitem(price >1000)→orders
●
The speed depends on how many rows match
the conditions (aka condition selectivity)
●
Selectivity is provided by histogram:
– selectivity(tbl.col BETWEEN c1 AND c2)
– selectivity(tbl.col > c3)
– selectivity(tbl.col = c4)
– ...
●
MySQL/MariaDB: use index as a histogram
– Expensive
select *
from
orders, lineitem
where
o_orderkey=l_orderkey and
o_orderdate between $DATE1 and $DATE2 and
l_extendedprice > 1000
orders
lineitem
join
4. 4
Basic histogram: Equi-width
●
Equi-width
●
Pre-defined bucket
bounds
●
Accuracy is determined
by max(bucket_size).
It can be large:
– A few “outlier” values in peripheral buckets
– Most values are in a few very popular buckets
●
What if densely-populated regions had more buckets?
10 20 30 40 50 60 70 80
5. 5
Equi-height histogram
●
aka “Height-balanced”
●
Let all buckets have the same
number of rows
– Densely populated areas
get more buckets
– Sparsely populated get less
●
Accuracy: max bucket_size
– n_rows / n_buckets
●
Can be collected with one
pass over sorted dataset
●
Widely used in databases
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
15% 15% 15% 15% 15%
6. 6
Summary so far
●
Query optimizer uses histograms to get condition
selectivity estimates
– Essential for good query plans
●
Equi-height (or Height-balanced) histograms are good
– Accurate
– Easy to collect
– Widely used in databases
10. 10
Histograms in MariaDB 10.4: internals
●
Works well for
– dense datatypes
– range conditions (column > const)
●
Using fractions loses precision: not so
good for
– column = ‘[popular] value’
– Data with very popular values that
are close to unpopular
VARCHARs
●
long values with common prefix
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
0.0, 0.42, 0.6, 0.7,
0.85,
0.91,
0.93,
1.0
min_value=10
max_value=80
15% 15% 15% 15%
11. 11
Precision issue (MDEV-26125)
●
1M population, “country” matches real countries’ population.
set histogram_type=double_prec_hb;
analyze table generated_pop persistent for all;
analyze select * from generated_pop where country='China';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 22.66 | 21.04 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Chile';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 22.66 | 0.25 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Sweden';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 4.69 | 0.15 | Using where |
..-+---------+------------+----------+------------+-------------+
Ok, for China it’s close
estimated selectivity
observed selectivity
Very inaccurate for its
”neighbor” Chile!
Better for Sweden
12. 12
Histograms in MariaDB 10.7 10.8
●
Started as GSoC 2021 project by Michael Okoko
– Got the basic variant working
– Then, improved by Sergei Petrunia
●
Initially targeted MariaDB 10.7, missed it.
●
Now it’s getting into MariaDB 10.8
13. 13
Histograms in MariaDB 10.8
●
New histogram_type=JSON_HB
– The new default type
– Old SINGLE|DOUBLE_PREC_HB is available
●
Height-balanced*
●
Uses JSON for storage
– Stores exact bucket endpoints, not fractions.
14. 14
JSON_HB under the hood
{
"target_histogram_size": 254,
"collected_at": "2022-01-16 18:29:46",
"collected_by": "10.8.0-MariaDB-log",
"histogram_hb": [
{
"start": "Afghanistan",
"size": 0.003938,
"ndv": 2
},
...
{
"start": "Chad",
"size": 0.002678,
"ndv": 2
},
{
"start": "China",
"size": 0.210369,
"ndv": 1
},
...
]
}
●
start – Start value
●
size – fraction of table rows in the bucket
– May vary
– So, not really “equi-height”
– The reason: “popular” values are in their
own buckets
●
ndv – Number of Distinct Values
– Important special case: ndv=1
●
1M population, “country” matches real countries’ population.
15. 15
JSON_HB fixes MDEV-26125
●
The same 1M population dataset, with JSON histogram:
set histogram_type=json_hb;
analyze table generated_pop persistent for all;
analyze select * from generated_pop where country='China';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 21.04 | 21.04 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Chile';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 0.13 | 0.25 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Sweden';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 0.08 | 0.15 | Using where |
..-+---------+------------+----------+------------+-------------+
Same as before for
China
Much better for Chile
Also better for Sweden
17. 17
Histograms in MySQL 8.0
●
Also stored in JSON
●
Two types: “singleton” or “equi-height”*
●
Different syntax to collect:
ANALYZE TABLE t UPDATE HISTOGRAM ON (col) ...
●
Different
– collection algorithm
– estimation formulas
18. 18
Histograms in MySQL 8.0
A bucket is a JSON array:
●
min, max – base64 encoded.
●
cumulative_fraction
– Buckets may have different sizes:
– Popular values are in their own buckets
– “Each value should be only in one bucket”
(not in two adjacent buckets)
●
Should we also do this?
– Suboptimal choice of bucket bounds:
https://bugs.mysql.com/bug.php?id=104789
●
ndv
{
"buckets": [
[
"base64:type254:QWZnaGFuaXN0YW4=",
"base64:type254:QW5kb3JyYQ==",
0.009284244225931855,
4
],
[
"base64:type254:QW5nb2xh",
"base64:type254:QXVzdHJhbGlh",
0.02095815229819346,
5
],
...
]
"data-type": "string",
"null-values": 0.0,
"collation-id": 255,
"last-updated": "2022-01-16 16:55:...",
"sampling-rate": 0.0873173443400688,
"histogram-type": "equi-height",
"number-of-buckets-specified": 100
}
19. 19
MDEV-26125 on MySQL 8
●
The same generated 1M world population dataset.
analyze table generated_pop update histogram on Country [with 100 buckets];
explain select count(*) from generated_pop where country='China';
..+--------+----------+-------------+
| rows | filtered | Extra |
..+--------+----------+-------------+
| 995148 | 21.25 | Using where | -- 21.04
..+--------+----------+-------------+
explain select count(*) from generated_pop where country='Chile';
..+--------+----------+-------------+
| rows | filtered | Extra |
..+--------+----------+-------------+
| 995148 | 0.0 | Using where | -- 0.25
..+--------+----------+-------------+
explain select count(*) from generated_pop where country='Sweden';
..-+--------+----------+-------------+
| rows | filtered | Extra |
..-+--------+----------+-------------+
| 995148 | 0.0 | Using where | -- 0.15
..-+--------+----------+-------------+
Good
Bad
Bad
●
Increasing
@@histogram...mem_size
didn’t help
●
Setting buckets > countries
did help
20. 20
MDEV-26125 on MySQL 8
●
Trying a few more conditions
●
Histogram has 100 buckets
real MySQL-8 MariaDB
Country='China' 21.04 20.92 21.04
Country='Chile' 0.25 0.00 0.25
Country like 'Chile%' 0.25 11.11 0.45
Country='Sweden' 0.17 0.00 0.15
Country='Pakistan' 2.58 2.59 2.58
Country='Egypt' 0.50 1.15 1.14
Country='Neverland' 0.00 0.00 0.09
Country='Roman Empire' 0.00 0.00 0.17
●
0.00 for unpopular values!
– ‘Egypt’ is near the
threshold
●
Chile vs Chile%
●
Non-existant countries: 0.00
vs bigger number
●
Differenent assumptions
about the data?
●
MySQL handles different
datatypes differently
21. 21
Comparison
●
Differences
– ANALYZE TABLE Syntax
– JSON representation
– Ways to pick bucket bounds
– Estimation formulas
– MariaDB: can use the whole dataset, or a n% sample
MySQL: samples @@histogram_generation_max_mem_size
22. 22
Comparison
●
Similarities
– Both are JSON
– Both Height-Balanced* histograms
●
Special treatment for common values
– Both use Bernoulli sampling (read + discard)
●
Expensive, can’t have auto-update
●
Good enough?
– Tell us
24. 24
Combining multiple conditions
●
Combined selectivity
– if independent: sel1*sel2
– If dependent:
●
Full overlap: MIN(sel1, sel2)
●
No overlap: 0.0
●
Can cause *huge* mis-estimates!
●
Solution: multi-column statistics?
select ...
from order_items
where shipdate='2021-12-15' AND item_name='christmas light'
'swimsuit'
sel1 sel2
26. 26
Combining multi-column statistics
●
Multi-column statistics can “overlap”
●
Combining is hard: can’t assume independence.
– For more details on this, check out my talk at MariaDB Server Fest 2021.
●
MariaDB already hits the issue when using multi-column indexes as histograms
– See MDEV-23707, MDEV-22537, and linked issues.
– Working on fixing this.
col1=10 AND col2='foo' AND col3<='BBB' AND col4 IN (1,2,3)
Stat 1
Stat 2
Stat 3
Stat 4