Chapter 4 - Query Processing and Optimization.pptx
1.
1
Chapter 4
Query Processingand Optimization
Query processing :
• In general, a query is a form of questioning, in a line of inquiry
• A query language is a language in which user requests
information from the database
The aim of query processing is to find information in one or
more databases and deliver it to the user
Query optimization:
It is the process of choosing a suitable execution strategy for
processing a query
Two internal representations of a query:
Query Tree
Query graph
2.
Query Processing andOptimization
Scanner: identify language components.
keywords, attribute, relation names
Parser : check query system
Validation: check attributes & relations
Query tree (query graph) : internal
representation
Execution strategy: plan
Query optimization : choose a strategy
(reasonably efficient strategy)
Query Processing (cont…)
Query Processing can be divided into four main phases:
Decomposition:
Optimization
Code generation, and
Execution
Decomposition: it is the process of transforming a high
level query into a relational algebra query, and to check
that the query is syntactically and semantically correct.
Query decomposition consists of parsing and validation
4
5.
Query Processing (cont…)
Typical stages in query decomposition are:
Analysis: lexical and syntactical analysis of the query
(correctness).
Query tree will be built for the query containing leaf
node for base relations, one or many non-leaf
nodes for relations produced by relational algebra
operations and root node for the result of the query.
Sequence of operation is from the leaves to the
root.
Normalization: convert the query into a normalized
form. The predicate WHERE will be converted to
Conjunctive () or Disjunctive () Normal form.
5
6.
Query Processing (cont…)
Typical stages in query decomposition(cont…)
Semantic Analysis: to reject normalized queries that are not
correctly formulated or contradictory.
Incorrect if components do not contribute to generate
result.
Contradictory if the predicate can not be satisfied by any
tuple.
Simplification: to detect redundant qualifications, eliminate
common sub-expressions, and transform the query to a
semantically equivalent but more easily and effectively
computed form.
Query Restructuring When more than one translation is
possible, use transformation rules to re arranging nodes so
that the most restrictive condition will be executed first.
6
7.
7
Transformation Rules forRelational Algebra
Relational algebra operations:
Select, Project , Join, Union, Intersection, Cartesian Poduct
-
refer the text book for detail understanding
General Transformation rules for relational
algebra
1. Cascade of s: A conjunctive selection condition can be
broken up into a cascade (sequence) of individual s
operations:
s c1 AND c2 AND ... AND cn(R) = sc1 (sc2 (...(scn(R))...) )
2. Commutativity of s:
The s operation is commutative:
sc1 (sc2(R)) = sc2 (sc1(R))
3. Cascade of p: In a cascade (sequence) of p operations, all
but the last one can be ignored:
pList1 (pList2 (...(pListn(R))...) ) = pList1(R)
8.
8
Transformation Rules forRelational Algebra
(cont…)
4. Commuting s with p:
If the selection condition c involves only the
attributes A1, ..., An in the projection list, the two
operations can be commuted:
pA1, A2, ..., An (sc (R)) = sc (pA1, A2, ..., An (R))
5. Commuting of and X: both are commutative
R c S = S c R, RXS=SXR
9.
9
Transformation Rules forRelational Algebra
(cont…)
6. Commuting p with : If projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and
B1, ..., Bm are attributes of S and the join condition c
involves only attributes in L, the two operations can be
commuted as follows:
pL ( R C S ) = (pA1, ..., An (R)) C (p B1, ..., Bm (S))
7. Converting a (s, x) sequence into : If the condition c of
a s that follows a x Corresponds to a join condition,
convert the (s, x) sequence into a as follows:
(sC (R x S)) = (R C S)
10.
Transformation Rules forRelational
Algebra (cont…)
Reading Assignment
8. Commutativity of THETA JOIN/Cartesian Product
R X S is equivalent to S X R
Also holds for Equi-Join and Natural-Join
(R c1S)= (S c1R)
9. Commutativity of SELECTION with THETA JOIN
If the predicate c1 involves only attributes of one of the
relations (R) being joined, then the Selection and Join
operations commute
10. Commuting SELECTION with SET OPERATIONS
11. Commuting PROJECTION with UNION
10
11.
11
Query Processing
QueryBlock:
It is the basic unit that can be translated into the
algebraic operators
A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
Nested queries within a query are identified as
separate query blocks.
Example: Find the name of employees whose
salary is below the average salary of all employees
who are working at department number 5
12.
12
Query Processing (cont…)
SELECTFNAME, LNAME
FROM EMPLOYEE
WHERE SALARY < C ( SELECT AVG (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);
SELECT AVG (SALARY)
FROM EMPLOYEE
WHERE DNO = 5
SELECT FNAME, LNAME
FROM EMPLOYEE
WHERE SALARY < C
πFNAME, LNAME (σSALARY<C(EMPLOYEE)) ℱAVGSALARY (σDNO=5 (EMPLOYEE))
13.
13
Basic algorithms forexecuting query
operations
External sorting:
Refers to sorting algorithms that are suitable for large
files of records stored on disk that do not fit entirely in
main memory, such as most database files
External Sorting Uses Sort-Merge Strategy:
Starts by sorting small sub files (runs) of the main file
and then merges the sorted runs, creating larger sorted
sub files that are merged in turn
Sorting Phase:
Number of subfiles (runs) nR = (b/nB)
Merging phase:
Degree of merging(dM) = Min (nB-1, nR); nP =
(logdM(nR))
14.
14
Sort-Merge strategy (cont…)
nR: number of initial runs;
b: number of file blocks;
nB: available buffer space;
dM: degree of merging;
P: number of passes
The size of a run and number of initial run depends on the
number of file blocks (b) and available buffer space (nB)
Example: if nB=5 blocks and size of the file=1024 blocks,
nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks
except the last run which will have 4 blocks.
Hence, after the sort phase, 205 sorted runs are stored
as temporary subfiles on disk
15.
15
Sort-Merge strategy (cont…)
In the merging phase, the sorted runs are merged during
one or more passes.
The degree of merging (dM) is the number of runs that
can be merged in each pass.
dM=min((nB-1) and nR))
The number of passes=[(logdM
(nR)
)]
In each pass, one buffer block is needed to hold one block
from each of the runs being merged and one block is
needed for containing one block of the merge result
16.
16
Sort-Merge strategy (cont…)
In the above example, dM=4(four way merging)
Hence, the 205 initial sorted runs would be merged into:
52 at the end of the first pass
13 at the end of the second pass
4 at the end of the third pass
1 at the end of the fourth pass
The minimum dM of 2 gives the worst case performance
of the algorithm, which is
2*b +2(*(b*(log2
(nR)
)
The first term represents the number of block access for
the sort phase and the second term represents the
number of block access for the merge phase
17.
Use the followinglogical model to understand the
discussions in this chapter
17
18.
Implementing SELECT operation
18
There are many options for executing a SELECT
operation
Some options depend on the file having specific
access paths and may apply only to certain types of
selection conditions
Examples: Use the logical model provided in the
previous slide to understand the following operations:
(OP1) σSsn=12345689 (EMPLOYEE)
equality comparison on key attribute
(OP2) σDNUMBER > 5 (DEPARMENT)
nonequality comparison on key attribute
(OP3) σDNO=5 (EMPLOYEE)
equality comparison on non key attribute
(OP4) σDNO=5 AND SALARY >30000 AND SEX=F(EMPLOYEE)
conjunctive condition
(OP5) σSsn=123456789 AND PNO=10 (WORKS_ON)
conjunctive condition and composite key
19.
19
Implementing the SELECTOperation (cont…)
Search Methods for implementing Simple Selection:
S1 Linear search (brute force):
Retrieve every record in the file, and test whether its attribute
values satisfy the selection condition.
S2 Binary search:
If the selection condition involves an equality comparison on a
key attribute on which the file is ordered, binary search (which
is more efficient than linear search) can be used. An example
is OP1 if IDNo is the ordering attribute for EMPLOYEE file
S3 Using a primary index or hash key to retrieve a
single record:
If the selection condition involves an equality comparison on a
key attribute with a primary index (or a hash key), use the
primary index (or the hash key) to retrieve the record.
For Example, OP1 use primary index to retrieve the record
20.
20
Implementing the SELECTOperation (cont…)
Search Methods for implementing Simple Selection(cont…)
S4 Using a primary index to retrieve multiple records:
If the comparison condition is >, ≥, <, or ≤ on a key
field with a primary index, use the index to find the
record satisfying the corresponding equality condition,
then retrieve all subsequent records in the (ordered)
file. (see OP2)
S5 Using a clustering index to retrieve multiple
records:
If the selection condition involves an equality
comparison on a non-key attribute with a clustering
index, use the clustering index to retrieve all the
records satisfying the selection condition. (See OP3)
21.
21
Implementing the SELECTOperation
(cont…)
S6: using a secondary index on an equality
comparison:
• This search method can be used to retrieve a single record if
the indexing field is a key (has unique values) or to retrieve
multiple records if the indexing field is not a key
• Search Methods for implementing complex Selection
S7: Conjunctive selection using an individual index :
If an attribute involved in any single simple condition in
the conjunctive condition has an access path that
permits the use of one of the methods S2 to S5, use that
condition to retrieve the records and then check whether
each retrieved record satisfies the remaining simple
conditions in the conjunctive condition
22.
22
Implementing the SELECTOperation (summarized)
Whenever a single condition specifies the selection, we
can only check whether an access path exists on the
attribute involved in that condition.
If an access path exists, the method corresponding
to that access path is used;
Otherwise, the “brute force” linear search approach
of method S1 is used
For conjunctive selection conditions, whenever
more than one of the attributes involved in the
conditions have an access path, query optimization
should be done to choose the access path that
retrieves the fewest records in the most efficient way
23.
23
Implementing the SELECTOperation
(summarized)
Disjunctive selection conditions: This is a situation where
simple conditions are connected by the OR logical
connective rather than AND
Compared to conjunctive selection, It is much harder to
process and optimize
Example:s DNO=5 OR SALARY>3000 OR SEX=‘F‘ (EMPLOYEE)
Little optimization can be done because the records
satisfying the disjunctive condition are the union of the
records satisfying the individual conditions
Hence, if any of the individual conditions does not have
an access path, we are compelled to use the brute force
approach
24.
24
Implementing the JOINOperation:
The join operation is one of the most time
consuming operation in query processing
Join
two–way join: a join on two files
e.g. R A=B S
multi-way joins: joins involving more than two files
e.g. R A=B S C=D T
Examples
(OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT
(OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE
25.
25
Implementing the JOINOperation(cont…)
Methods for implementing joins:
J1 Nested-loop join (brute force):
For each record t in R (outer loop), retrieve every
record s from S (inner loop) and test whether the
two records satisfy the join condition t[A] = s[B]
J2 Single-loop join (Using an access structure to
retrieve the matching records):
If an index (or hash key) exists for one of the two
join attributes — say, B of S — retrieve each record
t in R, one at a time, and then use the access
structure to retrieve directly all matching records s
from S that satisfy s[B] = t[A].
26.
26
Algorithms for SELECTand JOIN
Operations (cont…)
Implementing the JOIN Operation (cont...):
Factors affecting JOIN performance
Available buffer space
Join selection factor
Choice of inner VS outer relation
Selectivity
Is ratio of the number of records (tuples) that satisfy
the condition to the total number of records (tuples) in
the file (relation
27.
Algorithms for SELECTand JOIN
Operations (cont…)
is a number between zero and 1
zero selectivity means no records satisfy the
condition and 1 means all the records satisfy the
condition.
Although exact selectivities of all conditions may
not be available, estimates of selectivities are
often kept in the DBMS catalog and are used by
the optimizer
27
28.
28
Algorithms for PROJECTand SET
Operations
Algorithm for PROJECT operations ()
<attribute list>(R)
1. If <attribute list> has a key of relation R, extract all tuples
from R with only the values for the attributes in <attribute
list>.
2. If <attribute list> does NOT include a key of relation R,
duplicated tuples must be removed from the results.
Methods to remove duplicate tuples
1. Sorting
2. Hashing
29.
Query Optimization
Giventhe database structure, the challenge of query
optimization is to find the sequence of steps that produces
the answer to user request in the most efficient manner
The performance of a query is affected by the tables or
queries that underlies the query and by the complexity of the
query.
Given a request for data manipulation or retrieval, an
optimizer will choose an optimal plan for evaluating the
request from among the available alternative strategies.
There are many ways (access paths) for accessing desired
file/record. The optimizer tries to select the most efficient
(cheapest) access path for accessing the data.
DBMS is responsible to pick the best execution strategy
based on various considerations.
29
30.
Approaches to QueryOptimization
Heuristics Approach
The heuristic approach uses the knowledge of the
characteristics of the relational algebra operations and
the relationship between the operators to optimize the
query.
Thus the heuristic approach of optimization will make
use of:
Properties of individual operators
Association between operators
Query Tree: a graphical representation of the
operators, relations, attributes and predicates and
processing sequence during query processing.
30
31.
31
Using Heuristics inQuery Optimization
Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query
The main heuristic is to apply first the operations that
reduce the size of intermediate results
E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations
32.
32
Using Heuristics inQuery Optimization
(cont…)
Heuristic Optimization of Query Trees:
The same query could correspond to many different
relational algebra expressions — and hence many different
query trees.
The task of heuristic optimization of query trees is to find a
final query tree that is efficient to execute.
Example: Find the name of all employees who are working
in AQUARIUS project and born before ‘1957-12-31’
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON,
PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE < ‘1957-12-31’;
33.
Steps in TypicalHeuristic optimization
1. Deconstruct conjunctive selections into a sequence of
single selection operations (Equiv. rule 1.).
2. Move selection operations down the query tree for the
earliest possible execution
3. Execute first those selection and join operations that will
produce the smallest relations
4. Replace Cartesian product operations that are followed by
a selection condition by join operations
5. Deconstruct and move as far down the tree as possible
lists of projection attributes, creating new projections
where needed
33
34.
34
Steps in convertinga query tree during heuristic optimization:
Initial query tree for the query Q on slide 32
35.
35
Steps in convertinga query tree during heuristic optimization:
Moving the select operation down the tree
36.
36
Steps in convertinga query tree during heuristic optimization:
Applying the more restrictive select operation first
37.
37
Steps in convertinga query tree during heuristic optimization:
Replacing Cartesian product and select with join
38.
38
Steps in convertinga query tree during heuristic optimization:
Moving project operations down the query tree
39.
39
Heuristics in QueryOptimization (cont…)
Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that
reduce the size of intermediate results
2. Perform select operations as early as possible to reduce
the number of tuples and perform project operations as
early as possible to reduce the number of attributes. (This
is done by moving select and project operations as far
down the tree as possible.)
3. The select and join operations that are most restrictive
should be executed before other similar operations. (This
is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree
appropriately.)
40.
40
Using Selectivity andCost Estimates in
Query Optimization
Cost-based query optimization:
Estimate and compare the costs of executing a query
using different execution strategies and choose the
strategy with the lowest cost estimate
Cost Components for Query Execution
1. Access cost to secondary storage
2. Computation cost
3. Memory usage cost
4. Communication cost
Note: Different database systems may focus on different
cost components.
41.
41
Using Selectivity andCost Estimates in
Query Optimization (cont…)
Catalog Information Used in Cost Functions
Information about the size of a file
number of records (tuples) (r),
record size (R),
number of blocks (b)
blocking factor (bfr)
Information about indexes and indexing attributes of a file
Number of levels (x) of each multilevel index
Number of first-level index blocks (bI1)
Number of distinct values (d) of an attribute
Selectivity (sl) of an attribute
42.
42
Semantic Query Optimization
Semantic Query Optimization:
Uses constraints specified on the database schema in order to
modify one query into another query that is more efficient to
execute
Read more about:
Sematic query optimization and
Semantic Query Optimization (SQO) Techniques