Distributed Database Systems
Autumn, 2008
Chapter 7
Overview of Query
Processing
1
Distributed Database Systems
SQL: Non-Procedural Language of RDB
 Tuple calculus
◦ { t | F(t) } where:
 t : tuple variable
 F(t) : well formed formula
 Example
◦ Get the No. and name of all managers
2
Distributed Database Systems
   
 
"
"
|
, MANAGER
TITLE
t
EMP
t
ENAME
ENO
t 


SQL: Non-Procedural Language of RDB
 Domain calculus
where:
 xi : domain variables
 : well formed formula
 Example
{ x, y | E(x, y, "manager") }
3
Distributed Database Systems
 
 
,
,
,
|
,
,
, 2
1
2
1 n
n x
x
x
F
x
x
x 





 
n
x
x
x
F ,
,
, 2
1 


Variables are position sensitive!
SQL: Non-Procedural Language of RDB
 SQL is a tuple calculus language
SELECT ENO,ENAME
FROM EMP
WHERE TITLE=“manager”
4
Distributed Database Systems
End user uses non-procedural languages
to express queries.
Query Processor
 Query processor transforms queries into
procedural operations to access data
5
Distributed Database Systems
Query Processor
 Distributed query processor has to deal
with
◦query decomposition, and
◦data localization
6
Distributed Database Systems
7.1 Query Processing Problems
Distributed Database Systems 7
7.1 Query Processing Problems
 Centralized query processor must
◦transform calculus query into
algebra operation, and
◦choose the best execution plan
 Example:
SELECT ENAME
FROM E,G
WHERE E.ENO = G.ENO
AND RESP=“manager”
8
Distributed Database Systems
7.1 Query Processing Problems
 Relational Algebra 1
 Relational Algebra 2
9
Distributed Database Systems
 
 
G
E Manager
RESP
ENO
ENAME "
"


 

 
 
G
E
ENO
G
ENO
E
Manager
RESP
ENAME 


 .
.
"
"


Execution plan 2 is better for consuming
less resources!
7.1 Query Processing Problems
 In DDB, the query processor must
consider the communication cost and
select the best site!
 Same query as last example, but G and E
are distributed.
 Simple plan:
◦ To transport all segments to query site and
execute there.This causes too much network
traffic, very costly.
10
Distributed Database Systems
7.1 Query Processing Problems
 Distributed Query Example
◦ Distribution of E and G
11
Distributed Database Systems
7.1 Query Processing Problems
 Distributed Query Example
◦ Query
12
Distributed Database Systems
 
 
G
E Manager
REPSP
ENO
ENAME "
"


 

7.1 Query Processing Problems
 Distributed Query Example
◦ Optimized Processing
13
Distributed Database Systems
7.2 Objectives of Query Processing
Distributed Database Systems 14
7.2 Objectives of Query Processing
 Two-fold objectives:
◦Transformation, and
◦Optimization
15
Distributed Database Systems
7.2 Objectives of Query Processing
 Cost to be considered for optimization:
◦CPU time
◦I/O time, and
◦Communication time
16
Distributed Database Systems
WAN: the last cost is dominant
LAN: all three are equal
7.3 Complexity of Relational Algebra Operations
Distributed Database Systems 17
7.3 Complexity of Relational Algebra Operations
 Measured by n (cardinality) and tuples are
sorted on comparison attributes
Distributed Database Systems 18
O(n)
O(nlogn)
O(nlogn)
O(n2)
)
duplicates
(with
,

GROUP
),
duplicates
(with


 ,
,
,
, 




7.4 Characterization of Query Processor
Distributed Database Systems 19
7.4.1 Languages
 For users:
◦ calculus or algebra based languages.
 For query processor:
◦ map the input into internal form of
algebra augmented with
communication primitives.
Distributed Database Systems 20
7.4.2 Types of Optimization
 Exhaustive search
◦ Workable for small solution space
 Heuristics
◦ Perform first, semi-join, etc. for large
solution space
Distributed Database Systems 21
 
,
7.4.3 Optimization Timing
 Static
◦ Do it at compiling time by using statistics,
appropriate for exhaustive search, optimized
once, but executed many times.
 Dynamic
◦ Do it at execution time, accurate, repeated
for every execution, expensive.
Distributed Database Systems 22
7.4.4 Statistics
 Facts of
◦ Cardinalities
◦ Attribute value distribution
◦ Size of relation, etc.
 Provided to query optimizer and
periodically updated.
Distributed Database Systems 23
7.4.5 Decision Site
 For query optimization, it may be done by
◦ Single site – centralized approach, or
◦ All the sites involved – distributed, or
◦ Hybrid – one site makes major decision in
cooperation with other sites making local
decisions
Distributed Database Systems 24
7.4.6 Exploration of the NetworkTopology
 WAN
◦ communication cost is dominant
 LAN
◦ communication cost is comparable to I/O
cost. Broadcasting capability, star network,
satellite network should be considered.
Distributed Database Systems 25
7.4.7 Exploration of Replicated Fragments
Use replications to minimize
communication costs.
Distributed Database Systems 26
7.4.8 Use of Semi-joins
Reduce the size of operand
relations to cut down
communication costs when
overhead is not significant.
Distributed Database Systems 27
7.5 Layers of Query Processing
Distributed Database Systems 28
Distributed Database Systems 29
Generic Laying Scheme
for Distributed Query
Processing
7.5.1 Query Decomposition
 Decompose calculus query into algebra
query using global conceptual schema
information.
Distributed Database Systems 30
Step 1 – calculus normalization
Step 2 – semantic analysis to reject
incorrect queries
Step 3 – simplification to eliminate
redundant components
Step 4 – translation of calculus query
into optimized algebra query.
7.5.2 Data Localization
Distributed query is mapped into
a fragment query and simplified
to produce a good one.
Distributed Database Systems 31
7.5.3 Global Query Optimization
 Find an execution strategy close to
optimal.
 Find the best ordering of operations in
the fragment query, including
communication operations.
 Cost function defined in time is required.
Distributed Database Systems 32
7.5.4 Local Query Optimization
Centralized system algorithms
(to be discussed in chapter 9)
Distributed Database Systems 33
7.6 Conclusions
Distributed Database Systems 34
7.6 Conclusions
 Query processor – must be able to find
good execution plan for a calculus query, s.
t. CPU time, I/O time and communication
time are minimized.
 Method: laying of
◦ decomposition
◦ localization
◦ global query optimization
◦ local query optimization
Distributed Database Systems 35

07.Overview_of_Query_Processing.pdf

  • 1.
    Distributed Database Systems Autumn,2008 Chapter 7 Overview of Query Processing 1 Distributed Database Systems
  • 2.
    SQL: Non-Procedural Languageof RDB  Tuple calculus ◦ { t | F(t) } where:  t : tuple variable  F(t) : well formed formula  Example ◦ Get the No. and name of all managers 2 Distributed Database Systems       " " | , MANAGER TITLE t EMP t ENAME ENO t   
  • 3.
    SQL: Non-Procedural Languageof RDB  Domain calculus where:  xi : domain variables  : well formed formula  Example { x, y | E(x, y, "manager") } 3 Distributed Database Systems     , , , | , , , 2 1 2 1 n n x x x F x x x         n x x x F , , , 2 1    Variables are position sensitive!
  • 4.
    SQL: Non-Procedural Languageof RDB  SQL is a tuple calculus language SELECT ENO,ENAME FROM EMP WHERE TITLE=“manager” 4 Distributed Database Systems End user uses non-procedural languages to express queries.
  • 5.
    Query Processor  Queryprocessor transforms queries into procedural operations to access data 5 Distributed Database Systems
  • 6.
    Query Processor  Distributedquery processor has to deal with ◦query decomposition, and ◦data localization 6 Distributed Database Systems
  • 7.
    7.1 Query ProcessingProblems Distributed Database Systems 7
  • 8.
    7.1 Query ProcessingProblems  Centralized query processor must ◦transform calculus query into algebra operation, and ◦choose the best execution plan  Example: SELECT ENAME FROM E,G WHERE E.ENO = G.ENO AND RESP=“manager” 8 Distributed Database Systems
  • 9.
    7.1 Query ProcessingProblems  Relational Algebra 1  Relational Algebra 2 9 Distributed Database Systems     G E Manager RESP ENO ENAME " "          G E ENO G ENO E Manager RESP ENAME     . . " "   Execution plan 2 is better for consuming less resources!
  • 10.
    7.1 Query ProcessingProblems  In DDB, the query processor must consider the communication cost and select the best site!  Same query as last example, but G and E are distributed.  Simple plan: ◦ To transport all segments to query site and execute there.This causes too much network traffic, very costly. 10 Distributed Database Systems
  • 11.
    7.1 Query ProcessingProblems  Distributed Query Example ◦ Distribution of E and G 11 Distributed Database Systems
  • 12.
    7.1 Query ProcessingProblems  Distributed Query Example ◦ Query 12 Distributed Database Systems     G E Manager REPSP ENO ENAME " "     
  • 13.
    7.1 Query ProcessingProblems  Distributed Query Example ◦ Optimized Processing 13 Distributed Database Systems
  • 14.
    7.2 Objectives ofQuery Processing Distributed Database Systems 14
  • 15.
    7.2 Objectives ofQuery Processing  Two-fold objectives: ◦Transformation, and ◦Optimization 15 Distributed Database Systems
  • 16.
    7.2 Objectives ofQuery Processing  Cost to be considered for optimization: ◦CPU time ◦I/O time, and ◦Communication time 16 Distributed Database Systems WAN: the last cost is dominant LAN: all three are equal
  • 17.
    7.3 Complexity ofRelational Algebra Operations Distributed Database Systems 17
  • 18.
    7.3 Complexity ofRelational Algebra Operations  Measured by n (cardinality) and tuples are sorted on comparison attributes Distributed Database Systems 18 O(n) O(nlogn) O(nlogn) O(n2) ) duplicates (with ,  GROUP ), duplicates (with    , , , ,     
  • 19.
    7.4 Characterization ofQuery Processor Distributed Database Systems 19
  • 20.
    7.4.1 Languages  Forusers: ◦ calculus or algebra based languages.  For query processor: ◦ map the input into internal form of algebra augmented with communication primitives. Distributed Database Systems 20
  • 21.
    7.4.2 Types ofOptimization  Exhaustive search ◦ Workable for small solution space  Heuristics ◦ Perform first, semi-join, etc. for large solution space Distributed Database Systems 21   ,
  • 22.
    7.4.3 Optimization Timing Static ◦ Do it at compiling time by using statistics, appropriate for exhaustive search, optimized once, but executed many times.  Dynamic ◦ Do it at execution time, accurate, repeated for every execution, expensive. Distributed Database Systems 22
  • 23.
    7.4.4 Statistics  Factsof ◦ Cardinalities ◦ Attribute value distribution ◦ Size of relation, etc.  Provided to query optimizer and periodically updated. Distributed Database Systems 23
  • 24.
    7.4.5 Decision Site For query optimization, it may be done by ◦ Single site – centralized approach, or ◦ All the sites involved – distributed, or ◦ Hybrid – one site makes major decision in cooperation with other sites making local decisions Distributed Database Systems 24
  • 25.
    7.4.6 Exploration ofthe NetworkTopology  WAN ◦ communication cost is dominant  LAN ◦ communication cost is comparable to I/O cost. Broadcasting capability, star network, satellite network should be considered. Distributed Database Systems 25
  • 26.
    7.4.7 Exploration ofReplicated Fragments Use replications to minimize communication costs. Distributed Database Systems 26
  • 27.
    7.4.8 Use ofSemi-joins Reduce the size of operand relations to cut down communication costs when overhead is not significant. Distributed Database Systems 27
  • 28.
    7.5 Layers ofQuery Processing Distributed Database Systems 28
  • 29.
    Distributed Database Systems29 Generic Laying Scheme for Distributed Query Processing
  • 30.
    7.5.1 Query Decomposition Decompose calculus query into algebra query using global conceptual schema information. Distributed Database Systems 30 Step 1 – calculus normalization Step 2 – semantic analysis to reject incorrect queries Step 3 – simplification to eliminate redundant components Step 4 – translation of calculus query into optimized algebra query.
  • 31.
    7.5.2 Data Localization Distributedquery is mapped into a fragment query and simplified to produce a good one. Distributed Database Systems 31
  • 32.
    7.5.3 Global QueryOptimization  Find an execution strategy close to optimal.  Find the best ordering of operations in the fragment query, including communication operations.  Cost function defined in time is required. Distributed Database Systems 32
  • 33.
    7.5.4 Local QueryOptimization Centralized system algorithms (to be discussed in chapter 9) Distributed Database Systems 33
  • 34.
  • 35.
    7.6 Conclusions  Queryprocessor – must be able to find good execution plan for a calculus query, s. t. CPU time, I/O time and communication time are minimized.  Method: laying of ◦ decomposition ◦ localization ◦ global query optimization ◦ local query optimization Distributed Database Systems 35