I would like to welcome you to this presentation on the MySQL optimizer.
My name is Olav Sandstå and I work in the MySQL optimizer team.
The goal for this session is to give an overview of how the MySQL optimizer works.. I will start with a short overview of the optimizer and then present each of the main optimizations of it in more details.
I will try to leave some time for questions at the end of the presentation but if you have short and easy questions feel free to ask during the presentation.
The query optimizer takes as input a SQL query and produces a plan for how this query should be executed. For complex queries there are many possible query plans. The goal is that the optimizer should be able to find the best query plan.
In order to optimize the query, the optimizer uses information from the data dictionary and statistics from the storage engine about the data in the tables.
At the end of this session you all should know a bit more about what happens inside the “optimizer cloud” and how it works.
Let us first start with an overview of the architecture of MySQL to show where the query optimizer fits in.
When a query arrives it first goes through the parser. The second step is the “resolver” which does name resolutions and semantic checks of the tables and columns in the query. These two steps produce a query tree representation of the SQL query that goes into the optimizer.
The optimizer will optimize the query and produce a query execution plan. This query execution plan will then be executed and data will be read from the storage engine and the result will be returned to the client. The rest of this presentation will go into more details of what happens inside the the optimizer module.
Here are some initial characteristics of the optimizer:
Now we should be ready to look at what is inside of the optimizer. Let us first go back to the MySQL architecture I presented a few slides earlier.
Each query that is optimized goes through four main stages. I will show the stages here and briefly mention what they do.
The first phase is the logical transformations. This stage is about simplifying the query and prepare it for later optiizations. You do not need to look at the details in the yellow boxes. I will cover that in details later in the presentation.
The secnond phase is to make prepartions for the main optimizer. Here we mostly analyze alternative ways of reading data from tables. The third it the main join optimizer. And the final phase is to make some final adjustments and optimizations to the query plan.
Let us start with the first stage of the optimizer which is the logical transformations the optimizer does to the query.
The logical transformations the optimizer has are mostly rule based. The reason for doing logical transformations to the query is either to simplify the query or re-write it in order to prepare it for the later optimizations.
Here is a list of the main transformations we do:
To simplify the query conditons we a apply a set of rule based transformations. I will show an example of these on the next slide-
We convert outer joins to inner joins
We merge views and derived tables into the main query
And finally, We have a large set of transformations that apply to optimize subqueries. I will get back to the subquery optimizations at the end of the presentations.
Join of two tables with a fairly complex where condition.
The slide shows how we apply different rule based transformations to the where clause in order to simplify it
The final result is easier to optimize and cost less during execution.
Before going to the next stage of the optimizer, I will spend a few minutes presenting how the optimizer does cost based optimiations.
The general way we do cost based optimization of queries in MySQL is that we: First calculate the cost for all alternative ways of reading data from tables. This includes looking at all useful ways for using different indexes and different access methods. Then we build alternative plans for how the query can be executed. This is mostly done by the join optimizer. For each alternative plan we calculate the cost Finally, we select the query plan with the lowest cost.
This plan is then executed.
In MySQL the main cost based optimization are: Choosing which index and which access method to use for each table The join order Which join buffer strategy to use And for subqueries, which subquery strategy to use.
In order to be able to do cost based optimization we need to have a model for what the cost of a query is. Here is a very simplified view of the cost model in MySQL.
As input it takes a basic operation like reading data from a table or joining two tables. As output it produces an estimate for the cost of doing this operation. In addition to the cost estimate, it will in most cases also produce an estimate for how many rows this operation will produce.
The cost model consists of a lot of formulas for calculating cost and record estimates for different operations. In addition to the cost formulas, the cost model consists of a set of “cost constants”. These are the cost of the basic operations that the MySQL server does when execution a query.
The cost model uses information from the data dictionary and statistics from storage engines to do its calculations. The main statistics is the number of rows in a table, cardinality and range estimates. All of these are produced by the storage engine.
From the data dictionary we use information about records and indexes: like the length of records and keys, uniqueness and if they can be null.
One thing that is new in 5.7 is that the cost model has been made configurable. The cost constants are now stored in database tables and can be changed. I will go into more details on this on next slide.
In the MySQL cost model the basic cost unit is the cost for reading a random data page from disk. All other cost numbers are relative to this cost unit.
The main cost factors we include when estimating the cost for a query is:
IO cost: where we estimate the number of pages we need to read for tables and indexes. For CPU the main contributions are the cost for evaluating query conditions and for comparing keys and records.
The main cost constants we use in the cost calculations are: Reading a random page from disk which has a cost of 1.0 Reading a page from the database buffer. Evaluating the query condition on a records, which has a cost of 0.2 And comparing keys or records which has a cost of 0.1
To hopefully make it a bit more clear how the cost model works, I will show an example. This query can be executed as a table scan or as a range scan if we add a secondary index.
For a table scan, the server must read the entire table and evaluate the query condition for all records. The IO cost estimate for this is based on the number of pages in the table and the cpu cost is based on having to evaluate the query condition on each record.
The second alternative is to run this as a range scan. The optimizer will ask the storage engine for an estimate of how many records are in the range between 20 and 23. The cost model will compute the cost estimate as follows:
The IO cost will be dominated by having to look each record up in the base table. The cost for reading the secondary index will be small compared to all these lookups. So the IO cost will be calculated as the number of pages we will need to read from the base table. One per record. The cost fomula for computing CPU cost is as follows: We first multiple the number of records to read by the cost for evaluating a condition. This corresponds to having to evaluate the range condition on every record. Then we add the same cost a second time. This time it is for account the CPU cost of evaluating the WHERE condition.
The final choice of whether this query will be done as a table scan or a range scan depends on which of these have the lowest cost estimate.
The second stage of the optimizer is to analyze possible access methods for reading the data needed by the query.
The goal is find for each table in the query what is the best way to read the data from the table.
For each table in the query we do the following:
Checks if access method is possible and useful Estimates the cost of using that access method Select the access method with the lowest cost
The box on the right of the slide lists the main access methods that is used by MySQL. I will go into more details about most of these on the following slides.
Index lookup or ref access is an access method for reading all record with a given key value using an index.
This is the main access method for most of the tables in a join. The first table can be read with different access methods, but if the following tables have useful indexes, then ref access will be used.
In the first example we will use ref access to look up all record where the key value is seven.
In the second example we will be able to use ref access on the second table when doing the join operation.
In the optimizer and in the explain output we distinguish between two different ref access methods, equality ref and normal ref. The first one, equality reference, is used when reading from a unique index. The second one is used when reading from a non-unique index or from an prefix of an index.
One important thing the optimizer does when evaluating access methods is to do ref access analysis.
This example shows a query that joins three tables. The query finds the name of the capital and the languages that is used in of each country.
By analyzing the query and check which columns that have indexes, the optimizer determines which indexes that can be used for ref access for this query. The optimizer looks at the query conditions. If we start by the first one, we see that the the country_id can be used for joining both city and country table by using index lookup. In the same way, by looking at the second query condition, we see that country and language both the can joined using index lookup. And based on this, we can also use country_id for index lookup between the city and language tables.
Finally the last condition in the where clause gives us information that we can use index lookup on the city_id in the index table from the country table.
This graf is used later by the join optimizer to know which tables that can be added using index lookup..
The next important access method is range access.
The range optimizer will for each index on the table try to find what is the minimal range that needs to be read using each index.
Example: This example the table has index on key1 and an index on key2. The range optimizer will analyze the query condition and determine which part of the index it has to read using each of these indexes.
The range optimizer is able to use all parts of the query condition that is comparing a column with an index against a constant value. It supports nested and and or conditions.
The result from the range optimizer is a list of ranges that need to be read from each index. The cost estimate is based on the number of records that need to be read from each range. The index that has the lowest cost estimate will be selected.
For indexes that only contains a single column, range access is fairly easy to estimate. If the index contains multiple parts it getting more complex.
Her is an example of an index that covers three columns (a,b c) in a table. The layout of the index is such that it is first sorted on values from the first column a, Within each a value, we have the corresponding b-values sorted. And similarly for the c values.
The range optimizer is able to find which ranges to read on this multipart index but there are some specific requirements that must be fulfilled.
Conditions on the first column can always be used by the range optimizer. If the condition on the first index part is an equality condition, it can also use the conditons on the second index part in the range optimzer.
Here is an example where it can use conditions on the two first columns. The condition on a is an equality condition, a should either 10, 11 or 13. The second index part can be added. And since the second index part is also an equality condition, be should be 2 or 4, would could have added conditions on c if there where any.
So after having run the range optimizer, the resulting range scan would read the following range from the index when executing this query.
Let look at another example. The query is almost the same. But this time the condition on the first index part is not an equality condition, a should be larger an 10 and less than 13. In this case we can use the condition on a as a range criterium but we can not use the condition on b.
The resulting range scan produced by the range optimizer is shown below. We see that much more of the index has to be read.
When using range access we are only reading from a single index. In some cases we can read from multiple indexes simultanously and use this to reduce the number of records that need to be read from the table.
This is called index merge. Three index merge strategies are implemented.
Example: Single index cannot handle Ored conditions on different columns
The last access method is called loose index scan. This is an optimization for queries containing group by or distinct and twhere the GROUP BY/DISTINCT is on a prefix of a multipart index.
If we look at the last of the three example queries, we are grouping on a and would like to have the lowest b value. By using loose index scan we can do this very effiently by just reading the first index entry for each a value. And then “jump” to the next a value without having to read the index entries inbetween.
Then we should be ready for the join optimizer, which is the third stage of the optimizer. The join optimizer does the main job of deciding the final join order for the tables in the query.
The goal of the join optimizer is the find the best join order for the tables in the query.
With N tables, there are N faculty possible plans. For instance if you have ten tables, there are 3.6 million possible join orders that are considered. In most cases we do not have to evaluate all possible join orders. This figure shows the alternative plans that we would evaluate for a 4 table join.
Our join optimizer uses a “greedy search strategy” to evaluate all possible join orders. We start with all 1-table plans. For each of these we expand the plan by adding the other tables in a depth –first order. When adding a new table to the plan, we estimate the cost of this plan. If the cost is larger than the cost of the currently best plan, we prune this.
How this works is best illustrated with an example.
When adding a new table to the join we do: -select the best access method (use the ref access graph we made earlier) -Estimate: -number of rows -calculates the cost of adding this table (both the cost of reading the table and the cost of the join)
It is important to note that we here select the best plan based on a cost estimate. In order to get good cost estimates for each alternative plan we need to be able to estimate the number of records that each step in the join will produce.
In the previous example I showed how we used the cost estimte for different join orders was used for selecting the best one. This slide shows how we calculate the estimate for how many records we expect to read from a table in a join. In this slide have read data from one table and are going to add the next table to the join.
The number of records that we estimates to be read from the second is computed like this. We start with the number of records read from the first table, then we calculate how many of these that will be filtered away by the query conditions on this table and finally we use the index statistics for getting an estimate of how many records we will need to read from the second table.
The inclusion of the condition filter effect is new in MySQL 5.7. This should cause the estimate for how many records that will be read from next table in a join to be more accurate than earlier. In our testing we see that a lot of multi-table join queries getting a better join order due to this.
Let look at how we calculate the “condtion filter effect” for a table.
This query has three conditons in the WHERE clause. For each table we find which conditions that will be used for filtering away records. We do this by looking at each condition. In order to have any effect, the condtion must:
-reference a field in the table -it must not be used by the access method (because then it is already taken into accunt when calculating the number of record that will be read) -it must be compared against an avaiable value: For instancde the employee.name= john will always be possible to evaluate when reading the employee table. -whice the first_office_id <> id depends on the table order of the join
So when we have determined which conditions that should be used for calculating the condition filter effect, then we need to find out how much of the records it will filter out.
We base this on the following:
If this is an indexed column and it has a range prediciate, then we use that. If no range estimate is availabe, we use index statics, For non-indexed columns, we use guestimates: here are some examples
After the join optimizer is run, the final join order has been decided.
Still, we have some adjustements we can do to improve the query plan.
The main optimization are:
As I said at the start of my presentation, I would get back to subquery optimizations at the end of this talk.
This is a good example of a transformation where we re-write the query to a similar query that will be less costly to execute.
That was the end of what the optimizer does.
Let us quickly spend a few minutes looking at what the optimizer has produced.
If you want to see the cost numbers for a given query, we have added this to EXPLAIN in JSON format in 5.7.
Here you see the output from the explain of the same query we just used in the example for the cost model for range access.
If you need to understand why the optimizer selects a given plan, then optimizer trace can be used. In the optimizer trace you will find all the main steps and decisions done by the optimizer.
To get the optimizer trace you first enable optimizer trace, then run your query and finally get the trace from the information schema table named “optimizer trace”.
This example show a part of the optimizer trace for this query. The part that is shown in the result from analyzing which access method to use. -table scan -covering index scan -range scan
There might be cases where the optimizer fails at finding the best query plan. Luckily, there are some ways you can deal with that and force it the select a better plan.
This slide list some of the options that can be used: