1. Distributed Database Systems Distributed Database Systems
Contents I
1 Motivation
Distributed Database Systems 2 Detour on centralized query processing
Translating SQL into relational algebra
Distributed Query Processing
Phases of centralized query processing
Query parsing
Katja Hose, Ralf Schenkel Query transformation
Query optimization
Max-Planck-Institut f¨r Informatik, Cluster of Excellence MMCI
u 3 Basics of distributed query processing
Phases of distributed query processing
November 10, 2011 Introduction
November 17, 2011 Meta data management
Data localization
4 Global query optimization
Main questions
Katja Hose Distributed Database Systems November 10, 2011 1 / 167 Katja Hose Distributed Database Systems November 10, 2011 2 / 167
Distributed Database Systems Distributed Database Systems
Motivation
Contents II Motivation
Global query optimizer
Distributed cost model The task of query processing is . . .
Join order optimization . . . to answer user queries
Total time models
Response time models Example
How many students are at Saarland University?
Answer: 18.000
Additional constraints
5 Summary Low response times
High query throughput
Efficient hardware usage
...
Katja Hose Distributed Database Systems November 10, 2011 3 / 167 Katja Hose Distributed Database Systems November 10, 2011 4 / 167
2. Distributed Database Systems Distributed Database Systems
Motivation Detour on centralized query processing
Motivation 1 Motivation
2 Detour on centralized query processing
Translating SQL into relational algebra
Phases of centralized query processing
Differences to centralized query processing Query parsing
Considering the physical data distribution during query optimization Query transformation
Query optimization
Considering communication costs
3 Basics of distributed query processing
Assumptions
Phases of distributed query processing
Data is distributed among multiple nodes Introduction
Existence of a global conceptual schema, which is used by all nodes Meta data management
Data localization
Queries are formulated on the global schema
4 Global query optimization
Main questions
Global query optimizer
Distributed cost model
Katja Hose Distributed Database Systems November 10, 2011 5 / 167 Katja Hose Distributed Database Systems November 10, 2011 6 / 167
Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Translating SQL into relational algebra
Join order optimization Translating SQL into relational algebra
Total time models
Response time models SQL query structure:
select distinct a1 , . . . , an
from R1 , . . . , Rn
where p
Algorithm:
5 Summary 1 Translating the from clause
Let R1 , . . . , Rk be the relations in the from clause of the query
Construct expression:
R1 if k = 1
R=
((. . . (R1 × R2 ) × . . . ) × Rk ) otherwise
Katja Hose Distributed Database Systems November 10, 2011 7 / 167 Katja Hose Distributed Database Systems November 10, 2011 8 / 167
3. Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Translating SQL into relational algebra Translating SQL into relational algebra
Translating SQL into relational algebra Translating SQL into relational algebra
Algorithm : Algorithm :
2 Translating the where clause 3 Translating the select clause
Let F be the predicate in the where clause of the query (if a where clause Let a1 , . . . , an (or “*”) be the projection in the select clause of the query
exists) Construct expression:
Construct expression:
W if the projection is “*”
S=
R if there is no where clause πa1 ,...,an (W ) otherwise
W =
σF (R) otherwise Output:
S
Katja Hose Distributed Database Systems November 10, 2011 9 / 167 Katja Hose Distributed Database Systems November 10, 2011 10 / 167
Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Translating SQL into relational algebra Phases of centralized query processing
Translating SQL into relational algebra Workflow for centralized query processing
Example query
select distinct e.EN ame, s.Salary
from Employees e, Salary s
where e.T itle = s.T itle and s.Salary ≥ 60.000
R1 if k = 1
R=
((. . . (R1 × R2 ) × . . . ) × Rk ) otherwise
R = Employees × Salary
R if there is no where clause
W =
σF (R) otherwise
Katja Hose Distributed Database Systems November 10, 2011 11 / 167 Katja Hose Distributed Database Systems November 10, 2011 12 / 167
4. Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query parsing Query parsing
Query parsing Example
Transform a declarative query into an internal representation
Query formulated using a declarative query language, e.g., SQL Example
The Parser translates the query into an internal representation Database managing information about employees and projects
Called naive query plan Employees(EID, EN ame, T itle)
Plan described by an operator tree of relational algebra operators Assignment(EN o, P N o, Duration)
Query: return the names of all employees working for project ’P1’
SELECT EName
FROM Employees e, Assignment a
WHERE e.EID = ENo AND PNo=’P1’
Katja Hose Distributed Database Systems November 10, 2011 13 / 167 Katja Hose Distributed Database Systems November 10, 2011 14 / 167
Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query parsing Query parsing
Example Operator tree
πEN ame σP N o= P 1 ∧Employees.EID=Assignment.EN o Employees × Assignment
Query
SELECT EName
FROM Employees e, Assignment a
WHERE e.EID = ENo AND PNo=’P1’
Translation into relational algebra
πEN ame σP N o= P 1 ∧Employees.EID=Assignment.EN o Employees ×
Assignment
In contrast to the SQL statement, the algebra statement already contains
the required basic evaluation operators
Operator tree
Katja Hose Distributed Database Systems November 10, 2011 15 / 167 Katja Hose Distributed Database Systems November 10, 2011 16 / 167
5. Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query transformation Query transformation
Workflow for centralized query processing Query transformation
Steps
1 Name resolution
Transforming object names into internal names
2 Semantic analysis
Checking for global relations and attributes, view expansion, global
access control
3 Normalization
Transforming predicates into a canonical format
4 Simple algebraic rewriting
Application of heuristics to eliminate bad plans
Katja Hose Distributed Database Systems November 10, 2011 17 / 167 Katja Hose Distributed Database Systems November 10, 2011 18 / 167
Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query transformation Query transformation
Semantic analysis Normalization
Objective
Check if the global schema defines all attributes and relations
Simplification of the following optimization by transforming the query
referenced in the query
into a canonical format
If the query is formulated on a view, replace references to
Selection and join predicates
relations/attributes with references to global relations/attributes
Conjunctive normal form vs. disjunctive normal form
Perform simple integrity checks, e.g., are the types of attributes Conjunctive normal form:
used in comparison predicates of the same type? (p11 ∨ p12 ∨ · · · ∨ p1n ) ∧ · · · ∧ (pm1 ∨ pm2 ∨ · · · ∨ pmn )
Initial check if the query has the rights to access referenced Disjunctive normal form:
(p11 ∧ p12 ∧ · · · ∧ p1n ) ∨ · · · ∨ (pm1 ∧ pm2 ∧ · · · ∧ pmn )
relations/attributes
Transformation based on equivalence rules for logical operators
Katja Hose Distributed Database Systems November 10, 2011 19 / 167 Katja Hose Distributed Database Systems November 10, 2011 20 / 167
6. Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query transformation Query transformation
Normalization Normalization
Example
SELECT EName
Equivalence rules FROM Employees e, Assignment a
p1 ∧ p2 ⇐⇒ p2 ∧ p1 and p1 ∨ p2 ⇐⇒ p2 ∨ p1 WHERE e.EID = a.ENo AND Duration ≥ 3 AND (PNo=’P1’ OR
PNo=’P2’)
p1 ∧ (p2 ∧ p3 ) ⇐⇒ (p1 ∧ p2 ) ∧ p3 and p1 ∨ (p2 ∨ p3 ) ⇐⇒ (p1 ∨ p2 ) ∨ p3
p1 ∧ (p2 ∨ p3 ) ⇐⇒ (p1 ∧ p2) ∨ (p1 ∧ p3 ) and Selection condition in disjunctive normal form
p1 ∨ (p2 ∧ p3 ) ⇐⇒ (p1 ∨ p2) ∧ (p1 ∨ p3 )
(EID = ENo ∧ Duration ≥ 3 ∧ PNo=’P1’) ∨
¬(p1 ∧ p2 ) ⇐⇒ ¬p1 ∨ ¬p2 and ¬(p1 ∨ p2 ) ⇐⇒ ¬p1 ∧ ¬p2
(EID = ENo ∧ Duration ≥ 3 ∧ PNo=’P2’)
¬(¬p1 ) ⇐⇒ p1
Selection condition in conjunctive normal form
EID = ENo ∧ Duration ≥ 3 ∧ (PNo=’P1’ ∨ PNo=’P2’)
Katja Hose Distributed Database Systems November 10, 2011 21 / 167 Katja Hose Distributed Database Systems November 10, 2011 22 / 167
Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query transformation Query optimization
Simple algebraic rewriting Workflow for centralized query processing
Simple optimizations that are always beneficial regardless of system state
Elimination of redundant predicates
Simplification of expressions
Unnesting of subqueries and views
Tasks
Recognize and simplify all
expressions/operations/subqueries that
are “obviously” unnecessary, redundant,
or contradictory.
Do not consider system state
information, e.g., size of tables,
existence of indexes, etc.
Katja Hose Distributed Database Systems November 10, 2011 23 / 167 Katja Hose Distributed Database Systems November 10, 2011 24 / 167
7. Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query optimization Query optimization
Query optimization Heuristics
Steps Use simple heuristics which usually lead to better performance
1 Algebraic optimization
Not the optimal plan is needed, but the really bad ones should be
Find a good relational algebra operator tree avoided
Heuristic query optimization
Heuristics
Cost-based query optimization
Statistical query optimization Break selections
Complex selection criteria should be broken into multiple parts
2 Physical optimization Push projection and push selection
Find suitable algorithms for implementing the operations Cheap selections and projections should be performed as early as
possible to reduce the sizes of intermediate results
Force joins
In most cases, using a join is much cheaper than using a Cartesian
product and a selection
Katja Hose Distributed Database Systems November 10, 2011 25 / 167 Katja Hose Distributed Database Systems November 10, 2011 26 / 167
Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query optimization Query optimization
Algebraic optimization rules Algebraic optimization rules
Operator is commutative:
r1 r2 ⇐⇒ r2 r1
Combinations of selections σ can be combined using logical and (∧). The
Operator is associative: order of the selections is arbitrary:
(r1 r2 ) r3 ⇐⇒ r1 (r2 r3 ) σF1 (σF2 (r1 )) ⇐⇒ σF1 ∧F2 (r1 ) ⇐⇒ σF2 (σF1 (r1 ))
For operator π in combination with another operator π, the “outer” Exploiting commutativity of ∧
parameter dominates the “inner” one:
πX (πY (r1 )) ⇐⇒ πX (r1 ) if X ⊆ Y
Katja Hose Distributed Database Systems November 10, 2011 27 / 167 Katja Hose Distributed Database Systems November 10, 2011 28 / 167
8. Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query optimization Query optimization
Algebraic optimization rules Algebraic optimization rules
Operators σ and commute if all selection attributes are contained in the same
relation:
Operators π and σ commute if predicate F is defined based on the σF (r1 r2 ) ⇐⇒ σF (r1 ) r2 if attr(F ) ⊆ R1
projection attributes: A selection predicate can be split up in conjunction with a join (F = F1 ∧ F2 ) if
the attributes referred to by F1 and F2 are contained in different relations:
σF (πX (r1 )) ⇐⇒ πX (σF (r1 )) if attr(F ) ⊆ X
σF (r1 r2 ) ⇐⇒ σF1 (r1 ) σF2 (r2 )
Alternatively, change in ordering possible if the projection is extended by
if attr(F1 ) ⊆ R1 and attr(F2 ) ⊆ R2
all necessary attributes:
In any case, part of a selection can be split up by separating predicates F1
πX1 (σF (r1 )) ⇐⇒ πX1 (σF (πX1 ,X2 (r1 ))) if attr(F ) ⊇ X2 referencing attributes of R1 only, F2 contains the remaining predicates referencing
attributes of both relations
σF (r1 r2 ) ⇐⇒ σF2 (σF1 (r1 ) r2 ) if attr(F1 ) ⊆ R1
Katja Hose Distributed Database Systems November 10, 2011 29 / 167 Katja Hose Distributed Database Systems November 10, 2011 30 / 167
Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query optimization Query optimization
Algebraic optimization rules Algebraic optimization rules
Commutativity of σ and ∪: Commutativity of π and :
σF (r1 ∪ r2 ) ⇐⇒ σF (r1 ) ∪ σF (r2 ) πX (r1 r2 ) ⇐⇒ πX (πY1 (r1 ) πY2 (r2 ))
Commutativity of σ and −: with
Y1 = (X ∩ R1 ) ∪ (R1 ∩ R2 )
σF (r1 − r2 ) ⇐⇒ σF (r1 ) − σF (r2 )
and
or in case F only references tuples in r1 : Y2 = (X ∩ R2 ) ∪ (R1 ∩ R2 )
σF (r1 − r2 ) ⇐⇒ σF (r1 ) − r2 Pushing a projection is possible if all Yi are defined in such a way that they
preserve all attributes necessary to perform the join.
Katja Hose Distributed Database Systems November 10, 2011 31 / 167 Katja Hose Distributed Database Systems November 10, 2011 32 / 167
9. Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query optimization Query optimization
Algebraic optimization rules Heuristic algebraic optimization – Example
Further rules
Commutativity of π and ∪:
πX (r1 ∪ r2 ) ⇐⇒ πX (r1 ) ∪ πX (r2 )
Use algebraic optimization heuristics
Distributive law for and ∪, distributive law for and −,
Commutativity of renaming β with other operators, . . . Force join
Idempotence, e.g., A ∨ A ⇐⇒ A Push selection and projection
Operations involving empty relations
Commutative and associative laws for , ∪ und ∩
Katja Hose Distributed Database Systems November 10, 2011 33 / 167 Katja Hose Distributed Database Systems November 10, 2011 34 / 167
Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Detour on centralized query processing
Query optimization Query optimization
Cost-based algebraic query optimization Physical query optimization
Physical optimization
Most non-distributed RDBMS strongly rely on cost-based optimizations
Input:
Aim for better optimized plan with respect to system and data Optimized query plan consisting of algebra operators
characteristics Choose an algorithm to compute a particular algebra operator
Join order optimization
Join:
Basic approach Block-Nested-Loop join, hash join, merge join, . . .
Establish a cost model for various operations
Enumerate all query plans and compute costs Select:
Pick the best query plan Full table scan, index lookup, ad-hoc index generation & lookup, . . .
Usually, dynamic programming techniques are used to keep Tasks
computational effort manageable
Translating a query plan into an execution plan
Physical and algebraic optimization are often interleaved
Katja Hose Distributed Database Systems November 10, 2011 35 / 167 Katja Hose Distributed Database Systems November 10, 2011 36 / 167
10. Distributed Database Systems Distributed Database Systems
Detour on centralized query processing Basics of distributed query processing
Query optimization
Query optimization example 1 Motivation
2 Detour on centralized query processing
Translating SQL into relational algebra
Phases of centralized query processing
Output: query execution plan
Query parsing
Query transformation
Query optimization
3 Basics of distributed query processing
Phases of distributed query processing
Introduction
Meta data management
Data localization
4 Global query optimization
Main questions
Global query optimizer
Distributed cost model
Katja Hose Distributed Database Systems November 10, 2011 37 / 167 Katja Hose Distributed Database Systems November 10, 2011 38 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Phases of distributed query processing
Join order optimization Workflow for distributed query processing
Total time models
Response time models
5 Summary
Katja Hose Distributed Database Systems November 10, 2011 39 / 167 Katja Hose Distributed Database Systems November 10, 2011 40 / 167
11. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Introduction Introduction
Basic considerations Basic considerations
Costs are more difficult to predict
Distributed query processing
Join selectivity: is it worthwhile to push down a selection?
Shares the same properties of centralized query processing
Data is distributed: difficult to get meaningful statistics
Similar problem but with different objectives and constraints
Network latency is very hard to predict
Objectives for centralized query processing Current workload at nodes, load shedding
Minimize the number of disk accesses Additional cost factors and constraints
Minimize computational time Extension of relational algebra (sending/receiving data)
Objectives for distributed query processing Data localization (which node holds relevant data)
Minimize resource consumption Replication and caching (where to compute an operation)
Minimize response time Network models
Maximize throughput Response-time models
Data and structural heterogeneity (federated databases . . . )
Katja Hose Distributed Database Systems November 10, 2011 41 / 167 Katja Hose Distributed Database Systems November 10, 2011 42 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Introduction Introduction
Consequences Example
Query
Optimization is much more difficult than in the central case
Return the names of all employees working for project ’P1’
Statistics and costs change over time, e.g., workload at a node,
network load πEN ame (πEID,EN ame (Employees) Employees.EID=Assignment.EN o
πEN o (σP N o= P 1 (Assignment)))
More conflicting optimization goals
Increase throughput → reduce replication and parallelization, Problems
increase query response time → increase parallelization Relations are fragmented and distributed among five nodes
More cost factors and constraints The Employees relation uses primary horizontal fragmentation
Consequences One fragment located at node 1, the other at node 2, no replication
Adaptive query plans (create an initial plan and optimize it on-the-fly) The Assignment relation uses derived horizontal fragmentation
One fragment located at node 3, the other at node 4, no replication
Do not aim for the best plan, but for a good plan
The query originates from node 5
Katja Hose Distributed Database Systems November 10, 2011 43 / 167 Katja Hose Distributed Database Systems November 10, 2011 44 / 167
12. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Introduction Introduction
Example Example
Cost model and statistics
Accessing a tuple costs 1 unit (acc)
Transferring a tuple costs 10 units (trans)
There are 400 employees and 1000 assignments
20 assignments for project ‘P1’
All tuples are uniformly distributed, i.e., nodes 3 and 4 provide 10
assignments for project ‘P1’ each
There are local indexes on attribute P N o at nodes 3 and 4 (as well as
indexes on primary keys at all nodes)
Direct tuple access is possible on local sites, no scanning
All nodes can directly communicate with each other
Simplification: no costs for unions and projections
Katja Hose Distributed Database Systems November 10, 2011 45 / 167 Katja Hose Distributed Database Systems November 10, 2011 46 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Introduction Introduction
Example Example
Simple execution plan - Version B
Simple execution plan - Version A
Ship intermediate results
Transfer all data to Node 5
Katja Hose Distributed Database Systems November 10, 2011 47 / 167 Katja Hose Distributed Database Systems November 10, 2011 48 / 167
13. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Introduction Introduction
Example Example
Costs plan B: 440 units
Costs plan A: 23.000 units
Katja Hose Distributed Database Systems November 10, 2011 49 / 167 Katja Hose Distributed Database Systems November 10, 2011 50 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Introduction Introduction
Important aspects of distributed query processing Important aspects of distributed query processing
Meta data management
Data localization
Global query optimization
Post-processing
Katja Hose Distributed Database Systems November 10, 2011 51 / 167 Katja Hose Distributed Database Systems November 10, 2011 52 / 167
14. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Meta data management Meta data management
Workflow for distributed query processing Meta data management
Prerequisites to perform query optimization
Meta data must be available
Meta data is stored in the catalog
Catalog provides information about the data distribution
Use this information to decide, for instance, if it is worthwhile to execute a
selection very early.
Katja Hose Distributed Database Systems November 10, 2011 53 / 167 Katja Hose Distributed Database Systems November 10, 2011 54 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Meta data management Meta data management
Meta data management Meta data management
Typical contents of a catalog for distributed database management systems
Database schema Where to store the catalog in a distributed system?
Definitions of tables, views, constraints, keys,. . . Central node
Partitioning schema Simple solution, bottleneck
Information about how the schema is partitioned and how tables can Replicated at all nodes
be reconstructed Updates are expensive
Allocation schema
Fragmented
Information about which fragment can be found at which node
In rare cases, the catalog may become very large
(including information about replication)
Catalog has to be fragmented and allocated
Network information
Caching
Information about node connections, network model
Replicate only needed parts of a central catalog, anticipate potential
Additional physical information
inconsistencies
Information about indexes, data statistics (histograms, etc.),
hardware resources (processing & storage),. . .
Katja Hose Distributed Database Systems November 10, 2011 55 / 167 Katja Hose Distributed Database Systems November 10, 2011 56 / 167
15. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Meta data management Meta data management
Meta data management Meta data management
Centralized catalog Replicated catalog
One instance of the global catalog at a central node Full copy of the global catalog at each node
Advantages Advantages
No need to update copies Little communication overhead for queries
Little memory consumption Good availability
Disadvantages Disadvantages
Communication with central node for each query High update costs
Central node potentially represents a bottleneck
Katja Hose Distributed Database Systems November 10, 2011 57 / 167 Katja Hose Distributed Database Systems November 10, 2011 58 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Meta data management Meta data management
Meta data management Meta data management
Fragmented catalog Caching catalog data
Partitioning the global catalog and assigning partitions to nodes Caching non-local catalog data
Advantages Advantages
Sharing load among nodes Avoiding remote access to frequently needed catalog data
Reducing update overhead Reducing communication overhead
Disadvantages Disadvantages
Localizing necessary partitions of the global catalog Coherency control
Invalidating cached copies in the presence of updates
Katja Hose Distributed Database Systems November 10, 2011 59 / 167 Katja Hose Distributed Database Systems November 10, 2011 60 / 167
16. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Meta data management Data localization
Meta data management Workflow for distributed query processing
Caching catalog data
Explicit invalidation
Owner of catalog data remembers nodes with local copies
In case of updates: sending an invalidation message to nodes with local
copies
Implicit invalidation
Identifying old catalog data during runtime (adding version numbers
and time stamps to query messages)
Katja Hose Distributed Database Systems November 10, 2011 61 / 167 Katja Hose Distributed Database Systems November 10, 2011 62 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Data localization Data localization
Data localization Example – horizontal reduction
Objective Schema
Creating subqueries in consideration of the data distribution Projects1 = σBudget≤150.000 (Projects)
Projects2 = σ150.000<Budget≤200.000 (Projects)
Assumptions Projects3 = σBudget>200.000 (Projects)
Fragmentation is defined by fragmentation expressions Reconstruction expression (horizontal fragmentation)
Each fragment is allocated only at one node (no replication) Projects = Projects1 ∪ Projects2 ∪ Projects3
Fragmentation expressions and locations of the fragments are stored Example query
in the catalog
σLocation= Saarbr. ∧Budget≤100.000 (Projects)
Main tasks After replacing references to global relations
Replace access to global relations with accesses to the fragments σLocation= Saarbr. ∧Budget≤100.000 (Projects1 ∪ Projects2 ∪
Insert reconstruction expression into algebra query Projects3 )
Basic algebraic simplifications of the query Further optimization is possible!
Katja Hose Distributed Database Systems November 10, 2011 63 / 167 Katja Hose Distributed Database Systems November 10, 2011 64 / 167
17. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Data localization Data localization
Query simplification – horizontal reduction Example – horizontal reduction
Objective
Query with fragmentation expression
Eliminate non-necessary subqueries σLocation= Saarbr. ∧Budget≤100.000 (Projects1 ∪ Projects2 ∪ Projects3 )
Horizontal reduction rule Fragment definitions
Projects1 = σBudget≤150.000 (Projects)
Given fragments of R as FR = {R1 , . . . , Rn } with Ri = σpi (R) Projects2 = σ150.000<Budget≤200.000 (Projects)
All fragments Ri for which σps (Ri ) = ∅ can be removed Projects3 = σBudget>200.000 (Projects)
with ps denoting the query’s selection predicate
Because of
σps (Ri ) = ∅ ⇐ ∀x ∈ R : ¬(ps (x) ∧ (pi (x)) σBudget≤100.000 (Projects2 ) = ∅, σBudget≤100.000 (Projects3 ) = ∅
The selection with the query predicate ps on fragment Ri is empty if
ps contradicts the fragmentation predicate pi of Ri , i.e., ps and pi are We obtain the reduced query
never true at the same time for all tuples in Ri σLocation= Saarbr. (σBudget≤100.000 (Projects1 ))
Katja Hose Distributed Database Systems November 10, 2011 65 / 167 Katja Hose Distributed Database Systems November 10, 2011 66 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Data localization Data localization
Query simplification – join reduction Example – join reduction
Join Reductions Schema
Larger joins are replaced by multiple partial joins on fragments Projects(PNo, PName, Budget, Location)
Distributive law: (R1 ∪ R2 ) S = (R1 S) ∪ (R2 S) Projects1 = σP N o= P 1 ∨P N o= P 2 (Projects)
Projects2 = σP N o= P 3 (Projects)
Eliminate all union fragments that will return an empty result
Projects3 = σP N o= P 4 (Projects)
Expectations
Assignment(ENo, PNo, Duration)
Elimination of partial joins producing empty results Assignment1 = σP N o= P 1 ∨P N o= P 2 (Assignment)
Depends on fragmentation optimality Assignment2 = σP N o= P 3 ∨P N o= P 4 (Assignment)
Many joins on small relations have lower resource costs than one large
Example query
join
Depends on fragmentation and applied join algorithms select * from Projects p, Assignment a where p.PNo = a.PNo
Smaller joins can be executed in parallel In relational algebra
Might decrease response time but might also increase communication Projects Assignment
costs
Katja Hose Distributed Database Systems November 10, 2011 67 / 167 Katja Hose Distributed Database Systems November 10, 2011 68 / 167
18. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Data localization Data localization
Example – join reduction Query simplification – join reduction
Query
Projects Assignment Join reduction rule
Given fragments of R as FR = {R1 , . . . , Rn } and fragments of S as
After replacing global relations with reconstruction expressions FS = {S1 , . . . , Sn }
(Projects1 ∪ Projects2 ∪ Projects3 ) (Assignment1 ∪ Assignment2 ) Apply distributive law, e.g.:
(R1 ∪ R2 ) (S1 ∪ S2 ) = (R1 S1 ) ∪ (R1 S2 ) ∪ (R2 S1 ) ∪ (R2 S2 )
After applying the distributive law All partial joins between fragments Ri and Sj for which Ri Sj = ∅
can be removed
(Projects1 Assignment1 ) ∪ (Projects1 Assignment2 ) ∪
Ri Sj = ∅ ⇐ ∀x ∈ Ri , y ∈ Sj : ¬(pi (x) ∧ pj (y))
(Projects2 Assignment1 ) ∪ (Projects2 Assignment2 ) ∪
The join between fragments Ri and Rj is empty if their respective
(Projects3 Assignment1 ) ∪ (Projects3 Assignment2 ) fragmentation predicates (on the join attribute) contradict, i.e., there
is no tuple combination x and y such that both partitioning
Further optimization is possible! predicates are fulfilled at the same time.
Katja Hose Distributed Database Systems November 10, 2011 69 / 167 Katja Hose Distributed Database Systems November 10, 2011 70 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Data localization Data localization
Example – join reduction Query simplification – join reduction for horizontal
fragmentation
Query with fragmentation expression
(Projects1 Assignment1 ) ∪ (Projects1 Assignment2 ) ∪ The easiest join reduction case follows from derived horizontal
(Projects2 Assignment1 ) ∪ (Projects2 Assignment2 ) ∪ fragmentation
(Projects3 Assignment1 ) ∪ (Projects3 Assignment2 ) For each fragment of the first relation, there is exactly one matching
fragment of the second relation
Some of these partial joins are empty, e.g.:
Simply use the information contained in the reconstruction expression
Projects1 Assignment2 = ∅ instead of comparing the reconstruction predicates to each other
Because their fragmentation expressions contradict: Join reduction for arbitrary horizontal partitioning might not be beneficial
Projects1 = σP N o= P 1 ∨P N o= P 2 (Projects) and
Assignment2 = σP N o= P 3 ∨P N o= P 4 (Assignment)
Reduced query
(Projects1 Assignment1 ) ∪ (Projects2 Assignment2 ) ∪
(Projects3 Assignment2 )
Katja Hose Distributed Database Systems November 10, 2011 71 / 167 Katja Hose Distributed Database Systems November 10, 2011 72 / 167
19. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Data localization Data localization
Query simplification – join reduction for derived Query simplification – join reduction for derived
horizontal fragmentation horizontal fragmentation
Example After replacing global relations with reconstruction expressions
Projects(PNo, PName, Budget, Location)
(Projects1 ∪ Projects2 ) (Assignment1 ∪ Assignment2 )
Projects1 = σP N o= P 1 ∨P N o= P 2 (Projects)
Projects2 = σP N o= P 3 ∨P N o= P 4 (Projects) After applying the distributive law
Assignment(ENo, PNo, Duration) (Projects1 Assignment1 ) ∪ (Projects1 Assignment2 ) ∪
Assignment1 = Assignment Projects1 (Projects2 Assignment1 ) ∪ (Projects2 Assignment2 )
Assignment2 = Assignment Projects2 Reduced query (using information about fragmentation of relation Assignment
directly)
Query in relational algebra
Projects Assignment (Projects1 Assignment1 ) ∪ (Projects2 Assignment2 )
Katja Hose Distributed Database Systems November 10, 2011 73 / 167 Katja Hose Distributed Database Systems November 10, 2011 74 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Data localization Data localization
Query simplification – vertical reduction Example – vertical reduction
Schema
Projects(PNo, PName, Budget, Location)
Projects1 = πP N o,P N ame,Location (Projects)
Projects2 = πP N o,Budget (Projects)
Vertical fragmentation rule
Reconstruction expression
Given fragments of R as FR = {R1 , . . . , Rn } with Ri = πβi (R) with
Projects = Projects1 Projects2
βi representing the enumeration of a subset of R’s attributes
Avoid joining fragments containing “useless” attributes, i.e., Example query
fragments containing only attributes that are not referenced in the πP N ame (Projects)
query and not output in the result
After replacing references to global relations
πP N ame (Projects1 Projects2 )
After removing unnecessary fragments
πP N ame (Projects1 )
Katja Hose Distributed Database Systems November 10, 2011 75 / 167 Katja Hose Distributed Database Systems November 10, 2011 76 / 167
20. Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Data localization Data localization
Query simplification – hybrid fragmentation Qualified relations
Supporting algebraic optimization of queries involving fragments
Annotating fragments and intermediate relations with predicates
Estimating the size of a relation
The reconstruction expression introduces combinations of joins and Extension of relational algebra
unions
General guidelines Definition: qualified relation
Remove empty relations generated by contradicting relations on A qualified relation is a pair [R : qR ] where R is a relation and qR is a
horizontal fragments predicate.
Remove useless relations generated by vertical fragments
Break and distribute joins, eliminate empty fragment joins Example
Representing horizontal fragments as qualified relations where the
qualification predicate corresponds to the fragmentation expression
[Projects : σP N o= P 1 ∨P N o= P 2 ]
Katja Hose Distributed Database Systems November 10, 2011 77 / 167 Katja Hose Distributed Database Systems November 17, 2011 78 / 167
Distributed Database Systems Distributed Database Systems
Basics of distributed query processing Basics of distributed query processing
Data localization Data localization
Qualified relations Qualified relations
Example query
σ100.000≤Budget≤200.000 (Projects)
Extended relational algebra Qualified relations
E1 = σ100.000≤Budget≤200.000 [Projects1 : Budget ≤ 150.000]
(1) E := σF [R : qR ] → [E : F ∧ qR ]
[E1 : (100.000 ≤ Budget ≤ 200.000) ∧ (Budget ≤ 150.000)]
(2) E := πA [R : qR ] → [E : qR ]
[E1 : 100.000 ≤ Budget ≤ 150.000]
(3) E := [R : qR ] × [S : qS ] → [E : qR ∧ qS ]
(4) E := [R : qR ] − [S : qS ] → [E : qR ] E2 = σ1000≤Budget≤200.000 [Projects2 : 150.000 < Budget ≤ 200.000]
(5) E := [R : qR ] ∪ [S : qS ] → [E : qR ∨ qS ] [E2 : (100.000 ≤ Budget ≤ 200.000) ∧
(6) E := [R : qR ] F [S : qS ] → [E : qR ∧ qS ∧ F ] (150.000 < Budget ≤ 200.000)]
[E2 : 150.000 < Budget ≤ 200.000]
E3 = σ100.000≤Budget≤200.000 [Projects3 : Budget > 200.000]
[E3 : (100.000 ≤ Budget ≤ 200.000) ∧ (Budget > 200.000)]
E3 = ∅
Katja Hose Distributed Database Systems November 17, 2011 79 / 167 Katja Hose Distributed Database Systems November 17, 2011 80 / 167
21. Distributed Database Systems Distributed Database Systems
Global query optimization Global query optimization
1 Motivation Join order optimization
Total time models
2 Detour on centralized query processing Response time models
Translating SQL into relational algebra
Phases of centralized query processing
Query parsing
Query transformation
Query optimization
3 Basics of distributed query processing
Phases of distributed query processing
Introduction
5 Summary
Meta data management
Data localization
4 Global query optimization
Main questions
Global query optimizer
Distributed cost model
Katja Hose Distributed Database Systems November 17, 2011 81 / 167 Katja Hose Distributed Database Systems November 17, 2011 82 / 167
Distributed Database Systems Distributed Database Systems
Global query optimization Global query optimization
Main questions Main questions
Workflow for distributed query processing Introduction to global query optimization
Main questions
When to optimize?
What criteria to optimize?
Where to execute the query?
Katja Hose Distributed Database Systems November 17, 2011 83 / 167 Katja Hose Distributed Database Systems November 17, 2011 84 / 167
22. Distributed Database Systems Distributed Database Systems
Global query optimization Global query optimization
Main questions Main questions
When to optimize? When to optimize?
Full compile time optimization Fully dynamic optimization
The full query execution plan is computed at compile time Each query is optimized individually at runtime
Assumption
This technique heavily relies on heuristics, learning algorithms, and
Applications use canned queries
luck
Prepared and parameterized SQL statements
Pros
Pros
Might produce very good plans
Queries can be executed directly
Uses current network state
Cons Also usable for ad-hoc queries
Complex to model Cons
Much information unknown or too expensive to gather
Result quality might be very unpredictable
Collecting statistics on all nodes?
Complex algorithms and heuristics
Statistics outdated
Difficult to keep statistics up-to-date
Especially machine load and network properties are very volatile
Katja Hose Distributed Database Systems November 17, 2011 85 / 167 Katja Hose Distributed Database Systems November 17, 2011 86 / 167
Distributed Database Systems Distributed Database Systems
Global query optimization Global query optimization
Main questions Main questions
When to optimize? When to optimize?
Semi-dynamic optimization Hierarchical optimization
Pre-optimize the query Plans are created in multiple stages
During query execution, test if execution runs as expected during Global-Local-Plans
optimization Global query optimizer creates a global query plan
e.g., are tuples/fragments delivered in time?, does the network adhere Focus on data transfer: which intermediate results are to be computed
by which node? How should intermediate results be shipped?
to the predicted properties?, are there any bad network latencies?, etc.
Local query optimizers create local query plans
If execution shows severe deviations, compute a new query plan for all Decide on query plan layout, algorithms, indexes, etc. to deliver the
parts that have not yet been executed requested intermediate result
Makes only sense for queries that run for a longer time Two-Step-Plans
Katja Hose Distributed Database Systems November 17, 2011 87 / 167 Katja Hose Distributed Database Systems November 17, 2011 88 / 167
23. Distributed Database Systems Distributed Database Systems
Global query optimization Global query optimization
Main questions Main questions
When to optimize? What criteria to optimize?
Hierarchical optimization Important aspects for global optimization
Plans are created in multiple stages
Communication operators
Global-Local-Plans
Two-Step-Plans Fragment cardinalities
During compile time, only stable parts of the plan are computed Order of operations
Join order, join methods, access paths, etc. Join ordering
During query execution, all missing plan elements are added Because permutations of the joins within the query may lead to
Node selection, transfer policies, etc.
Both steps can be performed using traditional query optimization improvements of orders of magnitude
techniques Most important alternative optimization criteria
Plan enumeration with dynamic programming
Complexity is manageable as each optimization problem is much easier Query response time
than a full optimization Resource consumption
During runtime optimization, fresh statistics are available
Total query execution costs
Most distributed database management systems use semi-dynamic or
hierarchical optimization techniques (or both) ...
Katja Hose Distributed Database Systems November 17, 2011 89 / 167 Katja Hose Distributed Database Systems November 17, 2011 90 / 167
Distributed Database Systems Distributed Database Systems
Global query optimization Global query optimization
Main questions Main questions
Where to execute the query? Global query optimization
Global query optimization. . .
Query optimizer has to decide which parts of the query have to be . . . deals with finding the “best” ordering of operations in the query
shipped to which node (cost model) (extended by fragmentation expressions and including communication
operations) that minimizes a cost function.
In heavily replicated scenarios, clever hybrid shipping can effectively
be used for load balancing Input
Move expensive computations to lightly loaded nodes, avoid an algebraic query extended by fragmentation expressions
expensive communication Output
an algebraic query or query execution plan with communication
operations
Katja Hose Distributed Database Systems November 17, 2011 91 / 167 Katja Hose Distributed Database Systems November 17, 2011 92 / 167