1. Prof. Neeraj Bhargava
Pooja Dixit
Department of Computer Science
School of Engineering & System Science
MDS, University Ajmer, Rajasthan, India
1
2. Query optimization is a function of many relational database management systems. The
query optimizer attempts to determine the most efficient way to execute a given query
by considering the possible query plans.
Generally, the query optimizer cannot be accessed directly by users: once queries are
submitted to the database server, and parsed by the parser, they are then passed to the
query optimizer where optimization occurs.
Query optimization is a combination of:-
◦ Query: A query is a request for information from a database.
◦ Query Plans: A query plan (or query execution plan) is an ordered set of steps used to
access data in a SQL relational database management system.
◦ Query Optimization: A single query can be executed through different algorithms or
re-written in different forms and structures. Hence, the question of query
optimization comes into the picture – Which of these forms or pathways is the most
optimal? The query optimizer attempts to determine the most efficient way to execute
a given query by considering the possible query plans.
2
3. Importance: The goal of query optimization is to reduce the system resources required to
fulfill a query, and ultimately provide the user with the correct result set faster.
◦ First, it provides the user with faster results, which makes the application seem faster to the user.
◦ Secondly, it allows the system to service more queries in the same amount of time, because each
request takes less time than unoptimized queries.
◦ Thirdly, query optimization ultimately reduces the amount of wear on the hardware (e.g. disk drives),
and allows the server to run more efficiently (e.g. lower power consumption, less memory usage).
There are broadly two ways a query can be optimized:
◦ Analyze and transform equivalent relational expressions: Try to minimize the tuple and column counts
of the intermediate and final query processes (discussed here).
◦ Using different algorithms for each operation: These underlying algorithms determine how tuples are
accessed from the data structures they are stored in, indexing, hashing, data retrieval and hence
influence the number of disk and block accesses (discussed in query processing).
3
4. Query processing refers to the range of activities involved in extracting data from a database.
The activities include translation of queries in high-level database languages into
expressions that can be used at the physical level of the file system, a variety of query-
optimizing transformations, and actual evaluation of queries.
Overview
The steps involved in processing a query appear in Figure. The basic steps are:
◦ Parsing and translation.
◦ Optimization.
◦ Evaluation.
4
5. Before query processing can begin, the system must translate the query into a usable form. A
language such as SQL is suitable for human use, but is ill suited to be the system’s internal
representation of a query. A more useful internal representation is one based on the
extended relational algebra.
Given a query, there are generally a variety of methods for computing the answer. For
example, we have seen that, in SQL, a query could be expressed in several different ways.
Each SQL query can itself be translated into a relationalalgebra expression in one of several
ways. Furthermore, the relational-algebra representation of a query specifies only partially
how to evaluate a query; there are usually several ways to evaluate relational-algebra
expressions. As an
select salary from instructor where salary < 75000;
This query can be translated into either of the following relational-algebra expressions:
5
6. Further, we can execute each relational-algebra operation by one of several different
algorithms. For example, to implement the preceding selection, we can search every tuple in
instructor to find tuples with salary less than 75000. If a B+-tree index is available on the
attribute salary, we can use the index instead to locate the tuples.
A Query-Evaluation Plan
6
7. A sequence of primitive operations that can be used to evaluate a query is a query-
execution plan or query-evaluation plan.
The query-execution engine takes a query-evaluation plan, executes that plan, and
returns the answers to the query.
The query optimizer uses these two techniques to determine which process or
expression to consider for evaluating the query.
There are two methods of query optimization.
1. Cost based Optimization (Physical)
2. Heuristic Optimization (Logical)
7
8. Cost-Based Optimization also known as Cost-Based Query
Optimization or CBO Optimizer) is an optimization technique in
Spark SQL that uses table statistics to determine the most efficient
query execution plan of a structured query (given the logical query
plan).
Cost-based optimization is disabled by default. Spark SQL uses
spark.sql.cbo.enabled configuration property to control whether
the CBO should be enabled and used for query optimization or not.
Cost-Based Optimization uses logical optimization rules (e.g.
CostBasedJoinReorder) to optimize the logical plan of a structured
query based on statistics.
8
9. Heuristic Based Optimization
◦ Heuristic based optimization uses rule-based optimization approaches
for query optimization. These algorithms have polynomial time and
space complexity, which is lower than the exponential complexity of
exhaustive search-based algorithms. However, these algorithms do not
necessarily produce the best query plan.
◦ Some of the common heuristic rules are −
Perform select and project operations before join operations. This is
done by moving the select and project operations down the query
tree. This reduces the number of tuples available for join.
Perform the most restrictive select/project operations at first before
the other operations.
Avoid cross-product operation since they result in very large-sized
intermediate tables.
9
10. External sorting is a technique in which the data is stored on the secondary memory, in
which part by part data is loaded into the main memory and then sorting can be done over
there. Then this sorted data will be stored in the intermediate files. Finally, these files will be
merged to get a sorted data. Thus by using the external sorting technique, a huge amount of
data can be sorted easily. In case of external sorting, all the data cannot be accommodated
on the single memory, in this case, some amount of memory needs to be kept on a memory
such as hard disk, compact disk and so on.
The requirement of external sorting is there, where the data we have to store in the main
memory does not fit into it. Basically, it consists of two phases that are:
Sorting phase: This is a phase in which a large amount of data is sorted in an intermediate
file.
Merge phase: In this phase, the sorted files are combined into a single larger file.
10
11. One of the best examples of external sorting is external merge sort.
External merge sort
The external merge sort is a technique in which the data is stored in intermediate files and
then each intermediate files are sorted independently and then combined or merged to get a
sorted data.
For example: Let us consider there are 10,000 records which have to be sorted. For this, we
need to apply the external merge sort method. Suppose the main memory has a capacity to
store 500 records in a block, with having each block size of 100 records.
11
12. In this example, we can see 5 blocks will be sorted in intermediate files. This
process will be repeated 20 times to get all the records. Then by this, we start
merging a pair of intermediate files in the main memory to get a sorted output.
Two-Way Merge Sort
Two-way merge sort is a technique which works in two stages which are as follows
here:
◦ Stage 1: Firstly break the records into the blocks and then sort the individual record with
the help of two input tapes.
◦ Stage 2: In this merge the sorted blocks and then create a single sorted file with the help
of two output tapes.
By this, it can be said that two-way merge sort uses the two input tapes and two
output tapes for sorting the data.
12
13. Algorithm for Two-Way Merge Sort:
Step 1) Divide the elements into the blocks of size M. Sort each block and then write
on disk.
Step 2) Merge two runs
◦ Read first value on every two runs.
◦ Then compare it and sort it.
◦ Write the sorted record on the output tape.
Step 3) Repeat the step 2 and get longer and longer runs on alternates tapes.
Finally, at last, we will get a single sorted list.
13
14. Analysis
This algorithm requires log(N/M) passes with initial run pass.
Therefore, at each pass the N records are processed and at last we
will get a time complexity as O(N log(N/M).
14