SlideShare a Scribd company logo
1 of 27
Download to read offline
Sergei Petrunia
MariaDB devroom
FOSDEM 2021
Improved histograms
in MariaDB 10.8
FOSDEM 2022
Sergei Petrunia
MariaDB developer
2
Background info
●
What histograms are used for
●
Types of histograms
3
Query optimizer needs histograms
●
Possible query plans:
– orders(in given period)→lineitem
– lineitem(price >1000)→orders
●
The speed depends on how many rows match
the conditions (aka condition selectivity)
●
Selectivity is provided by histogram:
– selectivity(tbl.col BETWEEN c1 AND c2)
– selectivity(tbl.col > c3)
– selectivity(tbl.col = c4)
– ...
●
MySQL/MariaDB: use index as a histogram
– Expensive
select *
from
orders, lineitem
where
o_orderkey=l_orderkey and
o_orderdate between $DATE1 and $DATE2 and
l_extendedprice > 1000
orders
lineitem
join
4
Basic histogram: Equi-width
●
Equi-width
●
Pre-defined bucket
bounds
●
Accuracy is determined
by max(bucket_size).
It can be large:
– A few “outlier” values in peripheral buckets
– Most values are in a few very popular buckets
●
What if densely-populated regions had more buckets?
10 20 30 40 50 60 70 80
5
Equi-height histogram
●
aka “Height-balanced”
●
Let all buckets have the same
number of rows
– Densely populated areas
get more buckets
– Sparsely populated get less
●
Accuracy: max bucket_size
– n_rows / n_buckets
●
Can be collected with one
pass over sorted dataset
●
Widely used in databases
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
15% 15% 15% 15% 15%
6
Summary so far
●
Query optimizer uses histograms to get condition
selectivity estimates
– Essential for good query plans
●
Equi-height (or Height-balanced) histograms are good
– Accurate
– Easy to collect
– Widely used in databases
7
Histograms in MariaDB
8
Histograms in MariaDB
●
Introduced in MariaDB 10.0
●
Improved in MariaDB 10.4
●
Height-Balanced
●
Very compact, store fractions.
9
Histograms in MariaDB 10.4: internals
●
Stores
– min_value, max_value
– bucket bounds are stored as
“fractions”:
– 0.0 <= fraction <= 1.0
●
Fractions are fixed-point numbers
– SINGLE_PREC_HB – 1 byte/value
– DOUBLE_PREC_HB – 2 bytes/value
●
DECODE_HISTOGRAM()
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
0.0, 0.42, 0.6, 0.7,
0.85,
0.91,
0.93,
1.0
min_value=10
max_value=80
15% 15% 15% 15%
fraction(X)=
X−min_value
max_value−min_value
10
Histograms in MariaDB 10.4: internals
●
Works well for
– dense datatypes
– range conditions (column > const)
●
Using fractions loses precision: not so
good for
– column = ‘[popular] value’
– Data with very popular values that
are close to unpopular
VARCHARs
●
long values with common prefix
10 20 30 40 50 60 70 80
10 20 30 40 50 60 70 80
0.0, 0.42, 0.6, 0.7,
0.85,
0.91,
0.93,
1.0
min_value=10
max_value=80
15% 15% 15% 15%
11
Precision issue (MDEV-26125)
●
1M population, “country” matches real countries’ population.
set histogram_type=double_prec_hb;
analyze table generated_pop persistent for all;
analyze select * from generated_pop where country='China';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 22.66 | 21.04 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Chile';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 22.66 | 0.25 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Sweden';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 4.69 | 0.15 | Using where |
..-+---------+------------+----------+------------+-------------+
Ok, for China it’s close
estimated selectivity
observed selectivity
Very inaccurate for its
”neighbor” Chile!
Better for Sweden
12
Histograms in MariaDB 10.7 10.8
●
Started as GSoC 2021 project by Michael Okoko
– Got the basic variant working
– Then, improved by Sergei Petrunia
●
Initially targeted MariaDB 10.7, missed it.
●
Now it’s getting into MariaDB 10.8
13
Histograms in MariaDB 10.8
●
New histogram_type=JSON_HB
– The new default type
– Old SINGLE|DOUBLE_PREC_HB is available
●
Height-balanced*
●
Uses JSON for storage
– Stores exact bucket endpoints, not fractions.
14
JSON_HB under the hood
{
"target_histogram_size": 254,
"collected_at": "2022-01-16 18:29:46",
"collected_by": "10.8.0-MariaDB-log",
"histogram_hb": [
{
"start": "Afghanistan",
"size": 0.003938,
"ndv": 2
},
...
{
"start": "Chad",
"size": 0.002678,
"ndv": 2
},
{
"start": "China",
"size": 0.210369,
"ndv": 1
},
...
]
}
●
start – Start value
●
size – fraction of table rows in the bucket
– May vary
– So, not really “equi-height”
– The reason: “popular” values are in their
own buckets
●
ndv – Number of Distinct Values
– Important special case: ndv=1
●
1M population, “country” matches real countries’ population.
15
JSON_HB fixes MDEV-26125
●
The same 1M population dataset, with JSON histogram:
set histogram_type=json_hb;
analyze table generated_pop persistent for all;
analyze select * from generated_pop where country='China';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 21.04 | 21.04 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Chile';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 0.13 | 0.25 | Using where |
..-+---------+------------+----------+------------+-------------+
analyze select * from generated_pop where country='Sweden';
..-+---------+------------+----------+------------+-------------+
| rows | r_rows | filtered | r_filtered | Extra |
..-+---------+------------+----------+------------+-------------+
| 1000000 | 1000000.00 | 0.08 | 0.15 | Using where |
..-+---------+------------+----------+------------+-------------+
Same as before for
China
Much better for Chile
Also better for Sweden
16
How does it compare
to histograms in MySQL 8?
17
Histograms in MySQL 8.0
●
Also stored in JSON
●
Two types: “singleton” or “equi-height”*
●
Different syntax to collect:
ANALYZE TABLE t UPDATE HISTOGRAM ON (col) ...
●
Different
– collection algorithm
– estimation formulas
18
Histograms in MySQL 8.0
A bucket is a JSON array:
●
min, max – base64 encoded.
●
cumulative_fraction
– Buckets may have different sizes:
– Popular values are in their own buckets
– “Each value should be only in one bucket”
(not in two adjacent buckets)
●
Should we also do this?
– Suboptimal choice of bucket bounds:
https://bugs.mysql.com/bug.php?id=104789
●
ndv
{
"buckets": [
[
"base64:type254:QWZnaGFuaXN0YW4=",
"base64:type254:QW5kb3JyYQ==",
0.009284244225931855,
4
],
[
"base64:type254:QW5nb2xh",
"base64:type254:QXVzdHJhbGlh",
0.02095815229819346,
5
],
...
]
"data-type": "string",
"null-values": 0.0,
"collation-id": 255,
"last-updated": "2022-01-16 16:55:...",
"sampling-rate": 0.0873173443400688,
"histogram-type": "equi-height",
"number-of-buckets-specified": 100
}
19
MDEV-26125 on MySQL 8
●
The same generated 1M world population dataset.
analyze table generated_pop update histogram on Country [with 100 buckets];
explain select count(*) from generated_pop where country='China';
..+--------+----------+-------------+
| rows | filtered | Extra |
..+--------+----------+-------------+
| 995148 | 21.25 | Using where | -- 21.04
..+--------+----------+-------------+
explain select count(*) from generated_pop where country='Chile';
..+--------+----------+-------------+
| rows | filtered | Extra |
..+--------+----------+-------------+
| 995148 | 0.0 | Using where | -- 0.25
..+--------+----------+-------------+
explain select count(*) from generated_pop where country='Sweden';
..-+--------+----------+-------------+
| rows | filtered | Extra |
..-+--------+----------+-------------+
| 995148 | 0.0 | Using where | -- 0.15
..-+--------+----------+-------------+
Good
Bad
Bad
●
Increasing
@@histogram...mem_size
didn’t help
●
Setting buckets > countries
did help
20
MDEV-26125 on MySQL 8
●
Trying a few more conditions
●
Histogram has 100 buckets
real MySQL-8 MariaDB
Country='China' 21.04 20.92 21.04
Country='Chile' 0.25 0.00 0.25
Country like 'Chile%' 0.25 11.11 0.45
Country='Sweden' 0.17 0.00 0.15
Country='Pakistan' 2.58 2.59 2.58
Country='Egypt' 0.50 1.15 1.14
Country='Neverland' 0.00 0.00 0.09
Country='Roman Empire' 0.00 0.00 0.17
●
0.00 for unpopular values!
– ‘Egypt’ is near the
threshold
●
Chile vs Chile%
●
Non-existant countries: 0.00
vs bigger number
●
Differenent assumptions
about the data?
●
MySQL handles different
datatypes differently
21
Comparison
●
Differences
– ANALYZE TABLE Syntax
– JSON representation
– Ways to pick bucket bounds
– Estimation formulas
– MariaDB: can use the whole dataset, or a n% sample
MySQL: samples @@histogram_generation_max_mem_size
22
Comparison
●
Similarities
– Both are JSON
– Both Height-Balanced* histograms
●
Special treatment for common values
– Both use Bernoulli sampling (read + discard)
●
Expensive, can’t have auto-update
●
Good enough?
– Tell us
23
Statistics use in the optimizer
24
Combining multiple conditions
●
Combined selectivity
– if independent: sel1*sel2
– If dependent:
●
Full overlap: MIN(sel1, sel2)
●
No overlap: 0.0
●
Can cause *huge* mis-estimates!
●
Solution: multi-column statistics?
select ...
from order_items
where shipdate='2021-12-15' AND item_name='christmas light'
'swimsuit'
sel1 sel2
25
Combining multi-column statistics
col1=10 AND col2='foo' AND col3<='BBB' AND col4 IN (1,2,3)
26
Combining multi-column statistics
●
Multi-column statistics can “overlap”
●
Combining is hard: can’t assume independence.
– For more details on this, check out my talk at MariaDB Server Fest 2021.
●
MariaDB already hits the issue when using multi-column indexes as histograms
– See MDEV-23707, MDEV-22537, and linked issues.
– Working on fixing this.
col1=10 AND col2='foo' AND col3<='BBB' AND col4 IN (1,2,3)
Stat 1
Stat 2
Stat 3
Stat 4
27
Thanks for your attention!

More Related Content

Similar to Improved histograms in MariaDB 10.8

Understanding Optimizer-Statistics-for-Developers
Understanding Optimizer-Statistics-for-DevelopersUnderstanding Optimizer-Statistics-for-Developers
Understanding Optimizer-Statistics-for-Developers
Enkitec
 

Similar to Improved histograms in MariaDB 10.8 (20)

Optimizer features in recent releases of other databases
Optimizer features in recent releases of other databasesOptimizer features in recent releases of other databases
Optimizer features in recent releases of other databases
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Dun ddd
Dun dddDun ddd
Dun ddd
 
Building Machine Learning Pipelines
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning Pipelines
 
A few things about the Oracle optimizer - 2013
A few things about the Oracle optimizer - 2013A few things about the Oracle optimizer - 2013
A few things about the Oracle optimizer - 2013
 
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
 
MySQL SQL Tutorial
MySQL SQL TutorialMySQL SQL Tutorial
MySQL SQL Tutorial
 
Building ML Pipelines
Building ML PipelinesBuilding ML Pipelines
Building ML Pipelines
 
MySQL and GIS Programming
MySQL and GIS ProgrammingMySQL and GIS Programming
MySQL and GIS Programming
 
Adapting to Adaptive Plans on 12c
Adapting to Adaptive Plans on 12cAdapting to Adaptive Plans on 12c
Adapting to Adaptive Plans on 12c
 
Riyaj: why optimizer_hates_my_sql_2010
Riyaj: why optimizer_hates_my_sql_2010Riyaj: why optimizer_hates_my_sql_2010
Riyaj: why optimizer_hates_my_sql_2010
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Why Use EXPLAIN FORMAT=JSON?
 Why Use EXPLAIN FORMAT=JSON?  Why Use EXPLAIN FORMAT=JSON?
Why Use EXPLAIN FORMAT=JSON?
 
MariaDB: Engine Independent Table Statistics, including histograms
MariaDB: Engine Independent Table Statistics, including histogramsMariaDB: Engine Independent Table Statistics, including histograms
MariaDB: Engine Independent Table Statistics, including histograms
 
Output drops due to qo s on cisco 2960 3560 3750 switches
Output drops due to qo s on cisco 2960 3560 3750 switchesOutput drops due to qo s on cisco 2960 3560 3750 switches
Output drops due to qo s on cisco 2960 3560 3750 switches
 
Understanding Optimizer-Statistics-for-Developers
Understanding Optimizer-Statistics-for-DevelopersUnderstanding Optimizer-Statistics-for-Developers
Understanding Optimizer-Statistics-for-Developers
 
How to use histograms to get better performance
How to use histograms to get better performanceHow to use histograms to get better performance
How to use histograms to get better performance
 
Using histograms to get better performance
Using histograms to get better performanceUsing histograms to get better performance
Using histograms to get better performance
 
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
Understand the Query Plan to Optimize Performance with EXPLAIN and EXPLAIN AN...
 

More from Sergey Petrunya

More from Sergey Petrunya (20)

New optimizer features in MariaDB releases before 10.12
New optimizer features in MariaDB releases before 10.12New optimizer features in MariaDB releases before 10.12
New optimizer features in MariaDB releases before 10.12
 
MariaDB's join optimizer: how it works and current fixes
MariaDB's join optimizer: how it works and current fixesMariaDB's join optimizer: how it works and current fixes
MariaDB's join optimizer: how it works and current fixes
 
JSON Support in MariaDB: News, non-news and the bigger picture
JSON Support in MariaDB: News, non-news and the bigger pictureJSON Support in MariaDB: News, non-news and the bigger picture
JSON Support in MariaDB: News, non-news and the bigger picture
 
Optimizer Trace Walkthrough
Optimizer Trace WalkthroughOptimizer Trace Walkthrough
Optimizer Trace Walkthrough
 
ANALYZE for Statements - MariaDB's hidden gem
ANALYZE for Statements - MariaDB's hidden gemANALYZE for Statements - MariaDB's hidden gem
ANALYZE for Statements - MariaDB's hidden gem
 
MariaDB 10.4 - что нового
MariaDB 10.4 - что новогоMariaDB 10.4 - что нового
MariaDB 10.4 - что нового
 
MariaDB Optimizer - further down the rabbit hole
MariaDB Optimizer - further down the rabbit holeMariaDB Optimizer - further down the rabbit hole
MariaDB Optimizer - further down the rabbit hole
 
Query Optimizer in MariaDB 10.4
Query Optimizer in MariaDB 10.4Query Optimizer in MariaDB 10.4
Query Optimizer in MariaDB 10.4
 
Lessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmarkLessons for the optimizer from running the TPC-DS benchmark
Lessons for the optimizer from running the TPC-DS benchmark
 
MariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it standMariaDB 10.3 Optimizer - where does it stand
MariaDB 10.3 Optimizer - where does it stand
 
MyRocks in MariaDB | M18
MyRocks in MariaDB | M18MyRocks in MariaDB | M18
MyRocks in MariaDB | M18
 
New Query Optimizer features in MariaDB 10.3
New Query Optimizer features in MariaDB 10.3New Query Optimizer features in MariaDB 10.3
New Query Optimizer features in MariaDB 10.3
 
MyRocks in MariaDB
MyRocks in MariaDBMyRocks in MariaDB
MyRocks in MariaDB
 
Histograms in MariaDB, MySQL and PostgreSQL
Histograms in MariaDB, MySQL and PostgreSQLHistograms in MariaDB, MySQL and PostgreSQL
Histograms in MariaDB, MySQL and PostgreSQL
 
Say Hello to MyRocks
Say Hello to MyRocksSay Hello to MyRocks
Say Hello to MyRocks
 
Common Table Expressions in MariaDB 10.2
Common Table Expressions in MariaDB 10.2Common Table Expressions in MariaDB 10.2
Common Table Expressions in MariaDB 10.2
 
MyRocks in MariaDB: why and how
MyRocks in MariaDB: why and howMyRocks in MariaDB: why and how
MyRocks in MariaDB: why and how
 
Эволюция репликации в MySQL и MariaDB
Эволюция репликации в MySQL и MariaDBЭволюция репликации в MySQL и MariaDB
Эволюция репликации в MySQL и MariaDB
 
Common Table Expressions in MariaDB 10.2 (Percona Live Amsterdam 2016)
Common Table Expressions in MariaDB 10.2 (Percona Live Amsterdam 2016)Common Table Expressions in MariaDB 10.2 (Percona Live Amsterdam 2016)
Common Table Expressions in MariaDB 10.2 (Percona Live Amsterdam 2016)
 
MariaDB 10.1 - что нового.
MariaDB 10.1 - что нового.MariaDB 10.1 - что нового.
MariaDB 10.1 - что нового.
 

Recently uploaded

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 

Improved histograms in MariaDB 10.8

  • 1. Sergei Petrunia MariaDB devroom FOSDEM 2021 Improved histograms in MariaDB 10.8 FOSDEM 2022 Sergei Petrunia MariaDB developer
  • 2. 2 Background info ● What histograms are used for ● Types of histograms
  • 3. 3 Query optimizer needs histograms ● Possible query plans: – orders(in given period)→lineitem – lineitem(price >1000)→orders ● The speed depends on how many rows match the conditions (aka condition selectivity) ● Selectivity is provided by histogram: – selectivity(tbl.col BETWEEN c1 AND c2) – selectivity(tbl.col > c3) – selectivity(tbl.col = c4) – ... ● MySQL/MariaDB: use index as a histogram – Expensive select * from orders, lineitem where o_orderkey=l_orderkey and o_orderdate between $DATE1 and $DATE2 and l_extendedprice > 1000 orders lineitem join
  • 4. 4 Basic histogram: Equi-width ● Equi-width ● Pre-defined bucket bounds ● Accuracy is determined by max(bucket_size). It can be large: – A few “outlier” values in peripheral buckets – Most values are in a few very popular buckets ● What if densely-populated regions had more buckets? 10 20 30 40 50 60 70 80
  • 5. 5 Equi-height histogram ● aka “Height-balanced” ● Let all buckets have the same number of rows – Densely populated areas get more buckets – Sparsely populated get less ● Accuracy: max bucket_size – n_rows / n_buckets ● Can be collected with one pass over sorted dataset ● Widely used in databases 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 15% 15% 15% 15% 15%
  • 6. 6 Summary so far ● Query optimizer uses histograms to get condition selectivity estimates – Essential for good query plans ● Equi-height (or Height-balanced) histograms are good – Accurate – Easy to collect – Widely used in databases
  • 8. 8 Histograms in MariaDB ● Introduced in MariaDB 10.0 ● Improved in MariaDB 10.4 ● Height-Balanced ● Very compact, store fractions.
  • 9. 9 Histograms in MariaDB 10.4: internals ● Stores – min_value, max_value – bucket bounds are stored as “fractions”: – 0.0 <= fraction <= 1.0 ● Fractions are fixed-point numbers – SINGLE_PREC_HB – 1 byte/value – DOUBLE_PREC_HB – 2 bytes/value ● DECODE_HISTOGRAM() 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 0.0, 0.42, 0.6, 0.7, 0.85, 0.91, 0.93, 1.0 min_value=10 max_value=80 15% 15% 15% 15% fraction(X)= X−min_value max_value−min_value
  • 10. 10 Histograms in MariaDB 10.4: internals ● Works well for – dense datatypes – range conditions (column > const) ● Using fractions loses precision: not so good for – column = ‘[popular] value’ – Data with very popular values that are close to unpopular VARCHARs ● long values with common prefix 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 0.0, 0.42, 0.6, 0.7, 0.85, 0.91, 0.93, 1.0 min_value=10 max_value=80 15% 15% 15% 15%
  • 11. 11 Precision issue (MDEV-26125) ● 1M population, “country” matches real countries’ population. set histogram_type=double_prec_hb; analyze table generated_pop persistent for all; analyze select * from generated_pop where country='China'; ..-+---------+------------+----------+------------+-------------+ | rows | r_rows | filtered | r_filtered | Extra | ..-+---------+------------+----------+------------+-------------+ | 1000000 | 1000000.00 | 22.66 | 21.04 | Using where | ..-+---------+------------+----------+------------+-------------+ analyze select * from generated_pop where country='Chile'; ..-+---------+------------+----------+------------+-------------+ | rows | r_rows | filtered | r_filtered | Extra | ..-+---------+------------+----------+------------+-------------+ | 1000000 | 1000000.00 | 22.66 | 0.25 | Using where | ..-+---------+------------+----------+------------+-------------+ analyze select * from generated_pop where country='Sweden'; ..-+---------+------------+----------+------------+-------------+ | rows | r_rows | filtered | r_filtered | Extra | ..-+---------+------------+----------+------------+-------------+ | 1000000 | 1000000.00 | 4.69 | 0.15 | Using where | ..-+---------+------------+----------+------------+-------------+ Ok, for China it’s close estimated selectivity observed selectivity Very inaccurate for its ”neighbor” Chile! Better for Sweden
  • 12. 12 Histograms in MariaDB 10.7 10.8 ● Started as GSoC 2021 project by Michael Okoko – Got the basic variant working – Then, improved by Sergei Petrunia ● Initially targeted MariaDB 10.7, missed it. ● Now it’s getting into MariaDB 10.8
  • 13. 13 Histograms in MariaDB 10.8 ● New histogram_type=JSON_HB – The new default type – Old SINGLE|DOUBLE_PREC_HB is available ● Height-balanced* ● Uses JSON for storage – Stores exact bucket endpoints, not fractions.
  • 14. 14 JSON_HB under the hood { "target_histogram_size": 254, "collected_at": "2022-01-16 18:29:46", "collected_by": "10.8.0-MariaDB-log", "histogram_hb": [ { "start": "Afghanistan", "size": 0.003938, "ndv": 2 }, ... { "start": "Chad", "size": 0.002678, "ndv": 2 }, { "start": "China", "size": 0.210369, "ndv": 1 }, ... ] } ● start – Start value ● size – fraction of table rows in the bucket – May vary – So, not really “equi-height” – The reason: “popular” values are in their own buckets ● ndv – Number of Distinct Values – Important special case: ndv=1 ● 1M population, “country” matches real countries’ population.
  • 15. 15 JSON_HB fixes MDEV-26125 ● The same 1M population dataset, with JSON histogram: set histogram_type=json_hb; analyze table generated_pop persistent for all; analyze select * from generated_pop where country='China'; ..-+---------+------------+----------+------------+-------------+ | rows | r_rows | filtered | r_filtered | Extra | ..-+---------+------------+----------+------------+-------------+ | 1000000 | 1000000.00 | 21.04 | 21.04 | Using where | ..-+---------+------------+----------+------------+-------------+ analyze select * from generated_pop where country='Chile'; ..-+---------+------------+----------+------------+-------------+ | rows | r_rows | filtered | r_filtered | Extra | ..-+---------+------------+----------+------------+-------------+ | 1000000 | 1000000.00 | 0.13 | 0.25 | Using where | ..-+---------+------------+----------+------------+-------------+ analyze select * from generated_pop where country='Sweden'; ..-+---------+------------+----------+------------+-------------+ | rows | r_rows | filtered | r_filtered | Extra | ..-+---------+------------+----------+------------+-------------+ | 1000000 | 1000000.00 | 0.08 | 0.15 | Using where | ..-+---------+------------+----------+------------+-------------+ Same as before for China Much better for Chile Also better for Sweden
  • 16. 16 How does it compare to histograms in MySQL 8?
  • 17. 17 Histograms in MySQL 8.0 ● Also stored in JSON ● Two types: “singleton” or “equi-height”* ● Different syntax to collect: ANALYZE TABLE t UPDATE HISTOGRAM ON (col) ... ● Different – collection algorithm – estimation formulas
  • 18. 18 Histograms in MySQL 8.0 A bucket is a JSON array: ● min, max – base64 encoded. ● cumulative_fraction – Buckets may have different sizes: – Popular values are in their own buckets – “Each value should be only in one bucket” (not in two adjacent buckets) ● Should we also do this? – Suboptimal choice of bucket bounds: https://bugs.mysql.com/bug.php?id=104789 ● ndv { "buckets": [ [ "base64:type254:QWZnaGFuaXN0YW4=", "base64:type254:QW5kb3JyYQ==", 0.009284244225931855, 4 ], [ "base64:type254:QW5nb2xh", "base64:type254:QXVzdHJhbGlh", 0.02095815229819346, 5 ], ... ] "data-type": "string", "null-values": 0.0, "collation-id": 255, "last-updated": "2022-01-16 16:55:...", "sampling-rate": 0.0873173443400688, "histogram-type": "equi-height", "number-of-buckets-specified": 100 }
  • 19. 19 MDEV-26125 on MySQL 8 ● The same generated 1M world population dataset. analyze table generated_pop update histogram on Country [with 100 buckets]; explain select count(*) from generated_pop where country='China'; ..+--------+----------+-------------+ | rows | filtered | Extra | ..+--------+----------+-------------+ | 995148 | 21.25 | Using where | -- 21.04 ..+--------+----------+-------------+ explain select count(*) from generated_pop where country='Chile'; ..+--------+----------+-------------+ | rows | filtered | Extra | ..+--------+----------+-------------+ | 995148 | 0.0 | Using where | -- 0.25 ..+--------+----------+-------------+ explain select count(*) from generated_pop where country='Sweden'; ..-+--------+----------+-------------+ | rows | filtered | Extra | ..-+--------+----------+-------------+ | 995148 | 0.0 | Using where | -- 0.15 ..-+--------+----------+-------------+ Good Bad Bad ● Increasing @@histogram...mem_size didn’t help ● Setting buckets > countries did help
  • 20. 20 MDEV-26125 on MySQL 8 ● Trying a few more conditions ● Histogram has 100 buckets real MySQL-8 MariaDB Country='China' 21.04 20.92 21.04 Country='Chile' 0.25 0.00 0.25 Country like 'Chile%' 0.25 11.11 0.45 Country='Sweden' 0.17 0.00 0.15 Country='Pakistan' 2.58 2.59 2.58 Country='Egypt' 0.50 1.15 1.14 Country='Neverland' 0.00 0.00 0.09 Country='Roman Empire' 0.00 0.00 0.17 ● 0.00 for unpopular values! – ‘Egypt’ is near the threshold ● Chile vs Chile% ● Non-existant countries: 0.00 vs bigger number ● Differenent assumptions about the data? ● MySQL handles different datatypes differently
  • 21. 21 Comparison ● Differences – ANALYZE TABLE Syntax – JSON representation – Ways to pick bucket bounds – Estimation formulas – MariaDB: can use the whole dataset, or a n% sample MySQL: samples @@histogram_generation_max_mem_size
  • 22. 22 Comparison ● Similarities – Both are JSON – Both Height-Balanced* histograms ● Special treatment for common values – Both use Bernoulli sampling (read + discard) ● Expensive, can’t have auto-update ● Good enough? – Tell us
  • 23. 23 Statistics use in the optimizer
  • 24. 24 Combining multiple conditions ● Combined selectivity – if independent: sel1*sel2 – If dependent: ● Full overlap: MIN(sel1, sel2) ● No overlap: 0.0 ● Can cause *huge* mis-estimates! ● Solution: multi-column statistics? select ... from order_items where shipdate='2021-12-15' AND item_name='christmas light' 'swimsuit' sel1 sel2
  • 25. 25 Combining multi-column statistics col1=10 AND col2='foo' AND col3<='BBB' AND col4 IN (1,2,3)
  • 26. 26 Combining multi-column statistics ● Multi-column statistics can “overlap” ● Combining is hard: can’t assume independence. – For more details on this, check out my talk at MariaDB Server Fest 2021. ● MariaDB already hits the issue when using multi-column indexes as histograms – See MDEV-23707, MDEV-22537, and linked issues. – Working on fixing this. col1=10 AND col2='foo' AND col3<='BBB' AND col4 IN (1,2,3) Stat 1 Stat 2 Stat 3 Stat 4
  • 27. 27 Thanks for your attention!