Let us get started.
I would like to welcome you to this presentation on the MySQL optimizer.
My name is Olav Sandstå and I work in the MySQL optimizer team.
The goal for this session is to give an overview of the MySQL optimizer.
I will first give a short overview of the optimizer and then present each of the main optimizations of it in more details.
(The main parts that I will cover is query transformations, access method selection and the join optimizer. I will also include an overview of how subqueries are optimized)
I will try to leave some time for questions at the end of the presntation but if you have short and simple questions feel free to ask during the presentation.
The query optimizer takes as input a SQL query and produces a plan for how this query should be executed.
For complex queries there are many possible query plans.The goal is that the optimizer should be able to find the best plan.
In order to optimize the query, the optimizer use information from the data dictionary and statistics from the storage engine about the data in the tables.
At the end of this session you all should know a bit more about what is inside the “optimizer cloud” and how it works.
Let us first start with an overview of the architecture of MySQL to show where the query optimizer fits in.
When a query arrives it first goes through the parser. The second step is the “resolverr” which does name resolutions and semantic checks of the tables and columns in the query.
These two steps produces a query tree representation of the SQL query that goes into the optimizer.
The optimizer will optimize the query and produce a query execution plan. This query execution plan will then be executed and data will be read from the storage engine and the result will be returned to the client.
The rest of this presentation will go into more details of what happens inside the the optimizer module.
Here are some initial characteristics of the optimizer:
Let us first go back to the MySQL architecture and start to look what is inside the optimzer.
Each query that is optimized goes through four main stagees. I will show the stages here and briefly mention what they do.
The first phase is the logical transformations. This stage is about simplifying the query and prepare it for later optiizations. You do not need to look at the details in the yellow boxes. I will cover that in details later in the presentation.
The secnond phase is to make prepartions for the main optimizer. Here we mostly analyze alternative ways of reading data from tables.
The third it the main join optimizer.
And the final phase is to make some final adjustments and optimizations to the query plan.
Let us start with the first stage of the optimizer which is the logical transformations the optimizer does to the query.
The logical transformations the optimizer has are mostly rule based. The reason for doing logical transformations to the query is either to simplify the query or re-write it in order to pre-pare the query for the later optimizations the optimizer has.
Here is a list of the main transformations we do:
To simplify the query condions we a apply a set of rule based transformations. I will show an example of these on the next slike
We convert outer joins to inner joins
We merge views and derived tables into the main query
And finally, We have a large set of transformations that apply to optimizations of subqueries. I will get back to the subquery optimizations at the end of the presentations.
Join of two tables with a fairly complex where condition.
The slide shows how we apply different rule based transformations to the where clause in order to simplify it
……
The final result is easier to optimize and cost less during execution.
8 minutt:
Before going to the next stage of the optimizer, I will spend a few minutes presenting how the optimizer does cost based optimiations.
8 minutt:
The general way we do cost based optimization of queries in MySQL is that we:
First calculate the cost for all alternative ways of reading data from tables. This includes looking at all useful ways for using different indexes and different access methods.
Then we build alternative plans for how the query can be executed. This is mostly done by the join optimizer. For each alternative plan we calculate the cost
Finally, we select the query plan with the lowest cost.
This plan is then executed.
In MySQL the main cost based optimization are:
Choosing which index and which access method to use for each table
The join order
Which join buffer strategy to use
And for subqueries, which subquery strategy to use.
In order to be able to do cost based optimization we need to have a model for what the cost of a query is. Here is a very simplified view of the cost model in MySQL.
As input it takes a basic operation like reading data from a table or joining two tables. As output it produces an estimate for the cost of doing this operation. In addition to the cost estimate, it will in most cases also produce an estimate for how many rows this operation will produce.
The cost model consists for a lot of formulas for calculating cost and record estimates for different operations. In addition to the cost fomulas, the cost model consists of a set of “cost constants”. These are the cost of the basic operations that the MySQL server does when execution a query.
The cost model uses information from the data dictionary and statistics from storage engines to do it calculations. The main statistics is the number of keys in a table, cardinality and range estimates. All of these are produced by the storage engine.
From the data dictionary we use information about records and indexes: like the length of records and keys, uniqueness and if they can be null.
One thing that is new in 5.7 is that the cost model has been made configurable. The cost constants are now stored in database tables and can be changed. I will go into more datails on this on next slide.
In the MySQL cost model the basic cost unit is the cost for reading a random data page from disk.
All other cost numbers are relative to this cost unit.
The main cost factors we include when estimating the cost for a query is:
IO cost: where we estimate the number of pages we need to read for tables and indexes.
For CPU the main contributions are the cost for evaluating query conditions and for comparing keys and records.
The main cost constants we use in the cost calculations are:
Reading a random page from disk which has a cost of 1.0
Reading a page from the database buffer.
Evaluating the query condition on a records, which has a cost of 0.2
And comparing keys or records which has a cost of 0.1
----- Meeting Notes (9/30/14 22:15) -----
Olav
To hopefully make it a bit more clear how the cost model works, I will show an example. This query can be executed as a table scan or as a range scan if we add a secondary index.
For a table scan, the server must read the entire table and evaluate the query condition for all records.
The second alternative is to run this as a range scan. The optimizer will ask the storage engine for an estimate of how many records are in the range between 20 and 23. The cost model will compute the cost estimate as follows:
The IO cost will be dominated by having to look each record up in the base table. The cost for reading the secondary index will be small compared to all these lookups. So the IO cost will be calculated as the number of pages we will need to read from the base table. One per record.
The cost fomula for computing CPU cost is as follows: We first multiple the number of records to read by the cost for evaluating a condition. This corresponds to having to evaluate the range condition on every record. Then we add the same cost a second time. This time it is for account the CPU cost of evaluating the WHERE condition.
The final choice of whether this query will be done as a table scan or a range scan depends on which of these have the lowest cost estimate.
If we put in the estimated number of records in this range we get the followin cost estimates.
Total cost is about 112.000.
If we compare this to the cost for the table scan from the previous slide which was more than two millions, then it is clear that the optimizer will prefer to do this query as a range on a secondary index over doing a full table scan.
14 minutt.
Stage 2 of the optimizer is to analyze possible access methods for reading the data needed.
14 minutt
The goal is to find for each table in the query what is the best way to read the data.
The box to the right lists the main access methods that is used by MySQL. I will go into more details about most of these on the following slide.
For each table in the query we do the following:
Checks if access method is possible and useful
Estimates the cost of using that access method
Select the access method with the lowest cost
Index lookup or ref access is an access method for reading all record with a given key value using an index.
This is the main access method for most of the tables in a join. The first table can be read with different access methods, but if the following tables have useful indexes, then ref access will be used.
In the first example….
In the second example we will be able to use ref access on the second table when doing the join operation.
In the optimizer and in the explain output we distinguish between two different ref access methods, equality ref and normal ref. The first one, equality reference, is used when reading from a unique index. The second one is used when reading from a non-unique index or from an prefix of an index.
One important thing the optimizer does when evaluating access methods is to do ref access analysis.
By analysing the query and check which fields that has indexes the optimzer determines which indexes that can be used for ref access in a join.
The result of this analysis is a ref access graph as shown in this figure.
The figure is for a join of three tables. Each arrow in the graphs shows which fields that have an equality relationship and a corresponding index that can be used during the join. This graphs is used later by the join optimizer.
The next important access method is range access.
The range optimizer will for each index on the table try to find what is the minimal range that needs to be read using each index.
Example: This example the table has index on key1 and an index on key2. The range optimizer will analyze the query condition and determine which part of the index it has to read using each of thes index.
The range optimizer is able to use all parts of the where condition that is comparing an column with an index against a constant value. It support nested and and or conditions.
The result from the range optimizer is a list of ranges that need to be read from each index.
The cost estimate is based on the number of records that need to be read from each range. The index that has the lowest cost estimate will be selected.
For indexes that only contains a single column, range access is fairly easy to estimate. If the index contains multiple parts it getting more complex.
Her is an example of an index that covers three columns (a,b c) in a table.
The layout of the index is such that it is first sorted on values from the first column a,
Within each a value, we have the corresponding b-values sorted. And similarly for the c values.
The range optimizer is able to find which ranges to read on this multipart index but there are some specific requirements that must be fulfilled.
Conditions on the first column can always be used by the range optimizer.
If the condition on the first index part is an equality condition, it can also use the conditons on the second index part in the range optimzer.
Here is an example where it can use conditions on the two first columns. The condition on a is an equality condition, a should either 10, 11 or 13.
The second index part can be added. And since the second index part is also an eqaulity condtion, be should be 2 or 4, would could have added conditions on c if there where any.
So after having run the range optimizer, the resulting range scan would read the following range from the index when executing this query.
Let look at another example. The query is almost the same. But this time the condition on the first index part is not an equality condition, a should be larger an 10 and less than 13.
In this case we can use the condition on a as a range criterium but we can not use the condition on b.
The resulting range scan produced by the range optimizer is shown below. We see that much more of the index has to be read.
When using range access we are only reading from a single index. In some cases we can read from multiple indexes simultanously and use this to reduce the number of records that need to be read from the table.
This is called index merge. Three index merge strategies are implemented.
Example: Single index cannot handle Ored conditions on different columns
The last access method is called loose index scan. This is an optimization for queries containing group by or distinct.
If we look at the last of the three example queries, we are grouping on a and would like to have the lowest b value. By using loose index scan we can do this very effiently by just reading the first index entry for each a value. And then “jump” to the next without having to read the index entries inbetween.
22 minutt
Then we should be ready for the join optimizer, which is the third stage of the optimizer. The join optimizer¨
Does the main job of building the final query plan. It decides the final join order for the tables in the query.
22 minutt
The goal of the join optimizer is the find the best join order for the table in a query.
With N tables, there are N faculty possible plans. For instance if you have ten tables, there are 3.6 million possible join orders that are considered. In most cases we do not have to evaluate all possible join orders.
This figure shows the alternative plans that we would evaluate for a 4 table join.
Our join optimizer uses a “greedy search strategy” to evaluate all possible join orders.
We start with all 1-table plans.
For each of these we expand the plan by adding the other tables in a depth –first order.
When adding a new table to the plan, we estimate the cost of this plan. If the cost is larger than the cost of the currently best plan, we prune this.
How this works is best illustrated with an example.
When adding a new table to the join we do:
-select the best access method (use the ref access graph we made earlier)
-Estimate:
-number of rows
-calculates the cost of adding this table (both the cost of reading the table and the cost of the join)
-Prune plan if more expensive
In the previous example I showed how we used the cost estimte for different join orders was used for selecting the best one. This slide shows how we calculate the estimate for how many records we expect to read from a table in a join.
In this slide have read data from one table and are going to add the next table to the join.
The number of records that we estimates to be read from the second is computed like this. We start with the number of records read from the first table, then we calculate how many of these that will be filtered away by the query conditions on this table and finally we use the index statistics for getting an estimate of how many records we will need to read from the second table.
The inclusion of the condition filter effect is new in MySQL 5.7. This should cause the estimate for how many records that will be read from next table in a join to be more accurate than earlier. In our testing we see that a lot of multi-table join queries getting a better join order due to this.
Let look at how we calculate the “condtion filter effect”.
This query has three conditons in the WHERE clause. For each table we find which conditions that will be used for filtering away records. We do this by looking at each condition.
In order to have any effect, the condtion must:
-reference a field in the table
-it must not be used by the access method (because then it is already taken into accunt when calculating the number of record that will be read)
-it must be compared against an avaiable value: For instancde the employee.name= john will always be possible to evaluate when reading the employee table.
-whice the first_office_id <> id depends on the table order of the join
So when we have determined which conditions that should be used for calculating the condition filter effect, then we need to find out how much of the records it will filter out.
We base this on the following:
If this is an indexed column and it has a range prediciate, then we use that.
If no range estimate is availabe, we use index statics,
For non-indexed columns, we use guestimates: here are some examples
30 minutt
After the join optimizer is run, the final join order has been decided.
Still, we have some adjustements we can do to improve the query plan.
30 minutt
The main optimization are:
34 minutt
As I said at the start of my presentation, I would get back to subquery optimizations at the end of this talk.
34 minutt
This is a good example of a transformation where we re-write the query to a similar query that will be less costly to execute.
40 minutt
That was the end of what the optimizer does.
Let us quickly spend a minute or two looking at what the optimizer has produced.
40 minutt.
If you want to see the cost numbers for a given query, we have added this to EXPLAIN in JSON format in 5.7.
Here you see the output from the explain of the same query we just used in the example for the cost model for range access.