Improved histograms in MariaDB 10.8

Sergei Petrunia
MariaDB devroom
FOSDEM 2021
Improved histograms
in MariaDB 10.8
FOSDEM 2022
Sergei Petrunia
MariaDB developer

2
Background info
●
What histograms are used for
●
Types of histograms

3
Query optimizer needs histograms
●
Possible query plans:
– orders(in given period)→lineitem
– lineitem(price >1000)→orders
●
The speed depends on how many rows match
the conditions (aka condition selectivity)
●
Selectivity is provided by histogram:
– selectivity(tbl.col BETWEEN c1 AND c2)
– selectivity(tbl.col > c3)
– selectivity(tbl.col = c4)
– ...
●
MySQL/MariaDB: use index as a histogram
– Expensive
select *
from
orders, lineitem
where
o_orderkey=l_orderkey and
o_orderdate between $DATE1 and $DATE2 and
l_extendedprice > 1000
orders
lineitem
join

4
Basic histogram: Equi-width
●
Equi-width
●
Pre-defined bucket
bounds
●
Accuracy is determined
by max(bucket_size).
It can be large:
– A few “outlier” values in peripheral buckets
– Most values are in a few very popular buckets
●
What if densely-populated regions had more buckets?
10 20 30 40 50 60 70 80

5
Equi-height histogram
●
aka “Height-balanced”
●
Let all buckets have the same
number of rows
– Densely populated areas
get more buckets
– Sparsely populated get less
●
Accuracy: max bucket_size
– n_rows / n_buckets
●
Can be collected with one
pass over sorted dataset
●
Widely used in databases
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
15% 15% 15% 15% 15%

6
Summary so far
●
Query optimizer uses histograms to get condition
selectivity estimates
– Essential for good query plans
●
Equi-height (or Height-balanced) histograms are good
– Accurate
– Easy to collect
– Widely used in databases

8
Histograms in MariaDB
●
Introduced in MariaDB 10.0
●
Improved in MariaDB 10.4
●
Height-Balanced
●
Very compact, store fractions.

9
Histograms in MariaDB 10.4: internals
●
Stores
– min_value, max_value
– bucket bounds are stored as
“fractions”:
– 0.0 <= fraction <= 1.0
●
Fractions are fixed-point numbers
– SINGLE_PREC_HB – 1 byte/value
– DOUBLE_PREC_HB – 2 bytes/value
●
DECODE_HISTOGRAM()
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
0.0, 0.42, 0.6, 0.7,
0.85,
0.91,
0.93,
1.0
min_value=10
max_value=80
15% 15% 15% 15%
fraction(X)=
X−min_value
max_value−min_value

10
Histograms in MariaDB 10.4: internals
●
Works well for
– dense datatypes
– range conditions (column > const)
●
Using fractions loses precision: not so
good for
– column = ‘[popular] value’
– Data with very popular values that
are close to unpopular
VARCHARs
●
long values with common prefix
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
0.0, 0.42, 0.6, 0.7,
0.85,
0.91,
0.93,
1.0
min_value=10
max_value=80
15% 15% 15% 15%

11
Precision issue (MDEV-26125)
●
1M population, “country” matches real countries’ population.
set histogram_type=double_prec_hb;
analyze table generated_pop persistent for all;
analyze select * from generated_pop where country='China';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 22.66 | 21.04 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Chile';
..-+---------+------------+----------+------------+-------------+
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 22.66 | 0.25 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Sweden';
..-+---------+------------+----------+------------+-------------+
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 4.69 | 0.15 | Using where |
..-+---------+------------+----------+------------+-------------+
Ok, for China it’s close
estimated selectivity
observed selectivity
Very inaccurate for its
”neighbor” Chile!
Better for Sweden

12
Histograms in MariaDB 10.7 10.8
●
Started as GSoC 2021 project by Michael Okoko
– Got the basic variant working
– Then, improved by Sergei Petrunia
●
Initially targeted MariaDB 10.7, missed it.
●
Now it’s getting into MariaDB 10.8

13
Histograms in MariaDB 10.8
●
New histogram_type=JSON_HB
– The new default type
– Old SINGLE|DOUBLE_PREC_HB is available
●
Height-balanced*
●
Uses JSON for storage
– Stores exact bucket endpoints, not fractions.

14
JSON_HB under the hood
{
"target_histogram_size": 254,
"collected_at": "2022-01-16 18:29:46",
"collected_by": "10.8.0-MariaDB-log",
"histogram_hb": [
{
"start": "Afghanistan",
"size": 0.003938,
"ndv": 2
},
...
{
"start": "Chad",
"size": 0.002678,
"ndv": 2
},
{
"start": "China",
"size": 0.210369,
"ndv": 1
},
...
]
}
●
start – Start value
●
size – fraction of table rows in the bucket
– May vary
– So, not really “equi-height”
– The reason: “popular” values are in their
own buckets
●
ndv – Number of Distinct Values
– Important special case: ndv=1
●
1M population, “country” matches real countries’ population.

15
JSON_HB fixes MDEV-26125
●
The same 1M population dataset, with JSON histogram:
set histogram_type=json_hb;
analyze table generated_pop persistent for all;
analyze select * from generated_pop where country='China';
..-+---------+------------+----------+------------+-------------+
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 21.04 | 21.04 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Chile';
..-+---------+------------+----------+------------+-------------+
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 0.13 | 0.25 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Sweden';
..-+---------+------------+----------+------------+-------------+
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 0.08 | 0.15 | Using where |
..-+---------+------------+----------+------------+-------------+
Same as before for
China
Much better for Chile
Also better for Sweden

16
How does it compare
to histograms in MySQL 8?

17
Histograms in MySQL 8.0
●
Also stored in JSON
●
Two types: “singleton” or “equi-height”*
●
Different syntax to collect:
ANALYZE TABLE t UPDATE HISTOGRAM ON (col) ...
●
Different
– collection algorithm
– estimation formulas

18
Histograms in MySQL 8.0
A bucket is a JSON array:
●
min, max – base64 encoded.
●
cumulative_fraction
– Buckets may have different sizes:
– Popular values are in their own buckets
– “Each value should be only in one bucket”
(not in two adjacent buckets)
●
Should we also do this?
– Suboptimal choice of bucket bounds:
https://bugs.mysql.com/bug.php?id=104789
●
ndv
{
"buckets": [
[
"base64:type254:QWZnaGFuaXN0YW4=",
"base64:type254:QW5kb3JyYQ==",
0.009284244225931855,
4
],
[
"base64:type254:QW5nb2xh",
"base64:type254:QXVzdHJhbGlh",
0.02095815229819346,
5
],
...
]
"data-type": "string",
"null-values": 0.0,
"collation-id": 255,
"last-updated": "2022-01-16 16:55:...",
"sampling-rate": 0.0873173443400688,
"histogram-type": "equi-height",
"number-of-buckets-specified": 100
}

19
MDEV-26125 on MySQL 8
●
The same generated 1M world population dataset.
analyze table generated_pop update histogram on Country [with 100 buckets];
explain select count(*) from generated_pop where country='China';
..+--------+----------+-------------+
| rows | filtered | Extra |
..+--------+----------+-------------+
| 995148 | 21.25 | Using where | -- 21.04
..+--------+----------+-------------+
explain select count(*) from generated_pop where country='Chile';
..+--------+----------+-------------+
..+--------+----------+-------------+
| 995148 | 0.0 | Using where | -- 0.25
..+--------+----------+-------------+
explain select count(*) from generated_pop where country='Sweden';
..-+--------+----------+-------------+
..-+--------+----------+-------------+
| 995148 | 0.0 | Using where | -- 0.15
..-+--------+----------+-------------+
Good
Bad
Bad
●
Increasing
@@histogram...mem_size
didn’t help
●
Setting buckets > countries
did help

20
MDEV-26125 on MySQL 8
●
Trying a few more conditions
●
Histogram has 100 buckets
real MySQL-8 MariaDB
Country='China' 21.04 20.92 21.04
Country='Chile' 0.25 0.00 0.25
Country like 'Chile%' 0.25 11.11 0.45
Country='Sweden' 0.17 0.00 0.15
Country='Pakistan' 2.58 2.59 2.58
Country='Egypt' 0.50 1.15 1.14
Country='Neverland' 0.00 0.00 0.09
Country='Roman Empire' 0.00 0.00 0.17
●
0.00 for unpopular values!
– ‘Egypt’ is near the
threshold
●
Chile vs Chile%
●
Non-existant countries: 0.00
vs bigger number
●
Differenent assumptions
about the data?
●
MySQL handles different
datatypes differently

21
Comparison
●
Differences
– ANALYZE TABLE Syntax
– JSON representation
– Ways to pick bucket bounds
– Estimation formulas
– MariaDB: can use the whole dataset, or a n% sample
MySQL: samples @@histogram_generation_max_mem_size

22
Comparison
●
Similarities
– Both are JSON
– Both Height-Balanced* histograms
●
Special treatment for common values
– Both use Bernoulli sampling (read + discard)
●
Expensive, can’t have auto-update
●
Good enough?
– Tell us

23
Statistics use in the optimizer

24
Combining multiple conditions
●
Combined selectivity
– if independent: sel1*sel2
– If dependent:
●
Full overlap: MIN(sel1, sel2)
●
No overlap: 0.0
●
Can cause *huge* mis-estimates!
●
Solution: multi-column statistics?
select ...
from order_items
where shipdate='2021-12-15' AND item_name='christmas light'
'swimsuit'
sel1 sel2

25
Combining multi-column statistics
col1=10 AND col2='foo' AND col3<='BBB' AND col4 IN (1,2,3)

26
Combining multi-column statistics
●
Multi-column statistics can “overlap”
●
Combining is hard: can’t assume independence.
– For more details on this, check out my talk at MariaDB Server Fest 2021.
●
MariaDB already hits the issue when using multi-column indexes as histograms
– See MDEV-23707, MDEV-22537, and linked issues.
– Working on fixing this.
col1=10 AND col2='foo' AND col3<='BBB' AND col4 IN (1,2,3)
Stat 1
Stat 2
Stat 3
Stat 4

Improved histograms in MariaDB 10.8

Recommended

Recommended

More Related Content

Similar to Improved histograms in MariaDB 10.8

Similar to Improved histograms in MariaDB 10.8 (20)

More from Sergey Petrunya

More from Sergey Petrunya (20)

Recently uploaded

Recently uploaded (20)

Improved histograms in MariaDB 10.8