Distributed_Database_System

Distributed Database Systems Distributed Database Systems

Contents I
1 Motivation

Distributed Database Systems 2 Detour on centralized query processing
Translating SQL into relational algebra
Distributed Query Processing
Phases of centralized query processing
Query parsing
Katja Hose, Ralf Schenkel Query transformation
Query optimization
Max-Planck-Institut f¨r Informatik, Cluster of Excellence MMCI
u 3 Basics of distributed query processing
Phases of distributed query processing
November 10, 2011 Introduction
November 17, 2011 Meta data management
Data localization
4 Global query optimization
Main questions
Katja Hose Distributed Database Systems November 10, 2011 1 / 167 Katja Hose Distributed Database Systems November 10, 2011 2 / 167

Motivation

Contents II Motivation
Global query optimizer
Distributed cost model The task of query processing is . . .
Join order optimization . . . to answer user queries
Total time models
Response time models Example
How many students are at Saarland University?
Answer: 18.000
Additional constraints
5 Summary Low response times
High query throughput
Eﬃcient hardware usage
...


Motivation Detour on centralized query processing

Motivation 1 Motivation
2 Detour on centralized query processing
Diﬀerences to centralized query processing Query parsing
Considering the physical data distribution during query optimization Query transformation
Query optimization
Considering communication costs
3 Basics of distributed query processing
Assumptions
Data is distributed among multiple nodes Introduction
Existence of a global conceptual schema, which is used by all nodes Meta data management
Data localization
Queries are formulated on the global schema
Main questions
Distributed cost model

Detour on centralized query processing Detour on centralized query processing

Join order optimization Translating SQL into relational algebra
Total time models
Response time models SQL query structure:

select distinct a1 , . . . , an
from R1 , . . . , Rn
where p

Algorithm:
5 Summary 1 Translating the from clause
Let R1 , . . . , Rk be the relations in the from clause of the query
Construct expression:

R1 if k = 1
R=
((. . . (R1 × R2 ) × . . . ) × Rk ) otherwise


Translating SQL into relational algebra Translating SQL into relational algebra

Translating SQL into relational algebra Translating SQL into relational algebra

Algorithm : Algorithm :
2 Translating the where clause 3 Translating the select clause

Let F be the predicate in the where clause of the query (if a where clause Let a1 , . . . , an (or “*”) be the projection in the select clause of the query
exists) Construct expression:
Construct expression:
W if the projection is “*”
S=
R if there is no where clause πa1 ,...,an (W ) otherwise
W =
σF (R) otherwise Output:
S


Translating SQL into relational algebra Phases of centralized query processing

Translating SQL into relational algebra Workﬂow for centralized query processing
Example query
select distinct e.EN ame, s.Salary
from Employees e, Salary s
where e.T itle = s.T itle and s.Salary ≥ 60.000

R1 if k = 1
R=
((. . . (R1 × R2 ) × . . . ) × Rk ) otherwise

R = Employees × Salary

R if there is no where clause
W =
σF (R) otherwise

Query parsing Query parsing

Query parsing Example

Transform a declarative query into an internal representation
Query formulated using a declarative query language, e.g., SQL Example
The Parser translates the query into an internal representation Database managing information about employees and projects
Called naive query plan Employees(EID, EN ame, T itle)
Plan described by an operator tree of relational algebra operators Assignment(EN o, P N o, Duration)
Query: return the names of all employees working for project ’P1’
SELECT EName
FROM Employees e, Assignment a
WHERE e.EID = ENo AND PNo=’P1’


Query parsing Query parsing

Example Operator tree

πEN ame σP N o= P 1 ∧Employees.EID=Assignment.EN o Employees × Assignment
Query
SELECT EName
FROM Employees e, Assignment a
WHERE e.EID = ENo AND PNo=’P1’
Translation into relational algebra
πEN ame σP N o= P 1 ∧Employees.EID=Assignment.EN o Employees ×
Assignment
In contrast to the SQL statement, the algebra statement already contains
the required basic evaluation operators
Operator tree


Query transformation Query transformation

Workflow for centralized query processing Query transformation

Steps
1 Name resolution
Transforming object names into internal names
2 Semantic analysis
Checking for global relations and attributes, view expansion, global
access control
3 Normalization
Transforming predicates into a canonical format
4 Simple algebraic rewriting
Application of heuristics to eliminate bad plans



Semantic analysis Normalization

Objective
Check if the global schema defines all attributes and relations
Simplification of the following optimization by transforming the query
referenced in the query
into a canonical format
If the query is formulated on a view, replace references to
Selection and join predicates
relations/attributes with references to global relations/attributes
Conjunctive normal form vs. disjunctive normal form
Perform simple integrity checks, e.g., are the types of attributes Conjunctive normal form:
used in comparison predicates of the same type? (p11 ∨ p12 ∨ · · · ∨ p1n ) ∧ · · · ∧ (pm1 ∨ pm2 ∨ · · · ∨ pmn )
Initial check if the query has the rights to access referenced Disjunctive normal form:
(p11 ∧ p12 ∧ · · · ∧ p1n ) ∨ · · · ∨ (pm1 ∧ pm2 ∧ · · · ∧ pmn )
relations/attributes
Transformation based on equivalence rules for logical operators



Normalization Normalization
Example
SELECT EName
Equivalence rules FROM Employees e, Assignment a
p1 ∧ p2 ⇐⇒ p2 ∧ p1 and p1 ∨ p2 ⇐⇒ p2 ∨ p1 WHERE e.EID = a.ENo AND Duration ≥ 3 AND (PNo=’P1’ OR
PNo=’P2’)
p1 ∧ (p2 ∧ p3 ) ⇐⇒ (p1 ∧ p2 ) ∧ p3 and p1 ∨ (p2 ∨ p3 ) ⇐⇒ (p1 ∨ p2 ) ∨ p3
p1 ∧ (p2 ∨ p3 ) ⇐⇒ (p1 ∧ p2) ∨ (p1 ∧ p3 ) and Selection condition in disjunctive normal form
p1 ∨ (p2 ∧ p3 ) ⇐⇒ (p1 ∨ p2) ∧ (p1 ∨ p3 )
(EID = ENo ∧ Duration ≥ 3 ∧ PNo=’P1’) ∨
¬(p1 ∧ p2 ) ⇐⇒ ¬p1 ∨ ¬p2 and ¬(p1 ∨ p2 ) ⇐⇒ ¬p1 ∧ ¬p2
(EID = ENo ∧ Duration ≥ 3 ∧ PNo=’P2’)
¬(¬p1 ) ⇐⇒ p1
Selection condition in conjunctive normal form

EID = ENo ∧ Duration ≥ 3 ∧ (PNo=’P1’ ∨ PNo=’P2’)


Query transformation Query optimization

Simple algebraic rewriting Workflow for centralized query processing

Simple optimizations that are always beneficial regardless of system state
Elimination of redundant predicates
Simplification of expressions
Unnesting of subqueries and views
Tasks
Recognize and simplify all
expressions/operations/subqueries that
are “obviously” unnecessary, redundant,
or contradictory.
Do not consider system state
information, e.g., size of tables,
existence of indexes, etc.


Query optimization Query optimization

Query optimization Heuristics

Steps Use simple heuristics which usually lead to better performance
1 Algebraic optimization
Not the optimal plan is needed, but the really bad ones should be
Find a good relational algebra operator tree avoided
Heuristic query optimization
Heuristics
Cost-based query optimization
Statistical query optimization Break selections
Complex selection criteria should be broken into multiple parts
2 Physical optimization Push projection and push selection
Find suitable algorithms for implementing the operations Cheap selections and projections should be performed as early as
possible to reduce the sizes of intermediate results
Force joins
In most cases, using a join is much cheaper than using a Cartesian
product and a selection



Algebraic optimization rules Algebraic optimization rules

Operator is commutative:

r1 r2 ⇐⇒ r2 r1
Combinations of selections σ can be combined using logical and (∧). The
Operator is associative: order of the selections is arbitrary:

(r1 r2 ) r3 ⇐⇒ r1 (r2 r3 ) σF1 (σF2 (r1 )) ⇐⇒ σF1 ∧F2 (r1 ) ⇐⇒ σF2 (σF1 (r1 ))

For operator π in combination with another operator π, the “outer” Exploiting commutativity of ∧
parameter dominates the “inner” one:

πX (πY (r1 )) ⇐⇒ πX (r1 ) if X ⊆ Y



Operators σ and commute if all selection attributes are contained in the same
relation:
Operators π and σ commute if predicate F is defined based on the σF (r1 r2 ) ⇐⇒ σF (r1 ) r2 if attr(F ) ⊆ R1
projection attributes: A selection predicate can be split up in conjunction with a join (F = F1 ∧ F2 ) if
the attributes referred to by F1 and F2 are contained in different relations:
σF (πX (r1 )) ⇐⇒ πX (σF (r1 )) if attr(F ) ⊆ X
σF (r1 r2 ) ⇐⇒ σF1 (r1 ) σF2 (r2 )
Alternatively, change in ordering possible if the projection is extended by
if attr(F1 ) ⊆ R1 and attr(F2 ) ⊆ R2
all necessary attributes:
In any case, part of a selection can be split up by separating predicates F1
πX1 (σF (r1 )) ⇐⇒ πX1 (σF (πX1 ,X2 (r1 ))) if attr(F ) ⊇ X2 referencing attributes of R1 only, F2 contains the remaining predicates referencing
attributes of both relations

σF (r1 r2 ) ⇐⇒ σF2 (σF1 (r1 ) r2 ) if attr(F1 ) ⊆ R1




Commutativity of σ and ∪: Commutativity of π and :

σF (r1 ∪ r2 ) ⇐⇒ σF (r1 ) ∪ σF (r2 ) πX (r1 r2 ) ⇐⇒ πX (πY1 (r1 ) πY2 (r2 ))

Commutativity of σ and −: with
Y1 = (X ∩ R1 ) ∪ (R1 ∩ R2 )
σF (r1 − r2 ) ⇐⇒ σF (r1 ) − σF (r2 )
and
or in case F only references tuples in r1 : Y2 = (X ∩ R2 ) ∪ (R1 ∩ R2 )

σF (r1 − r2 ) ⇐⇒ σF (r1 ) − r2 Pushing a projection is possible if all Yi are defined in such a way that they
preserve all attributes necessary to perform the join.



Algebraic optimization rules Heuristic algebraic optimization – Example

Further rules
Commutativity of π and ∪:

πX (r1 ∪ r2 ) ⇐⇒ πX (r1 ) ∪ πX (r2 )
Use algebraic optimization heuristics
Distributive law for and ∪, distributive law for and −,
Commutativity of renaming β with other operators, . . . Force join
Idempotence, e.g., A ∨ A ⇐⇒ A Push selection and projection
Operations involving empty relations
Commutative and associative laws for , ∪ und ∩



Cost-based algebraic query optimization Physical query optimization

Physical optimization
Most non-distributed RDBMS strongly rely on cost-based optimizations
Input:
Aim for better optimized plan with respect to system and data Optimized query plan consisting of algebra operators
characteristics Choose an algorithm to compute a particular algebra operator
Join order optimization
Join:
Basic approach Block-Nested-Loop join, hash join, merge join, . . .
Establish a cost model for various operations
Enumerate all query plans and compute costs Select:
Pick the best query plan Full table scan, index lookup, ad-hoc index generation & lookup, . . .
Usually, dynamic programming techniques are used to keep Tasks
computational eﬀort manageable
Translating a query plan into an execution plan
Physical and algebraic optimization are often interleaved


Detour on centralized query processing Basics of distributed query processing
Query optimization

Query optimization example 1 Motivation
2 Detour on centralized query processing
Output: query execution plan
Query parsing
Query transformation
Query optimization
Introduction
Meta data management
Data localization
Main questions

Basics of distributed query processing Basics of distributed query processing

Join order optimization Workﬂow for distributed query processing
Total time models
Response time models

5 Summary


Introduction Introduction

Basic considerations Basic considerations
Costs are more difficult to predict
Distributed query processing
Join selectivity: is it worthwhile to push down a selection?
Shares the same properties of centralized query processing
Data is distributed: difficult to get meaningful statistics
Similar problem but with different objectives and constraints
Network latency is very hard to predict
Objectives for centralized query processing Current workload at nodes, load shedding
Minimize the number of disk accesses Additional cost factors and constraints
Minimize computational time Extension of relational algebra (sending/receiving data)
Objectives for distributed query processing Data localization (which node holds relevant data)
Minimize resource consumption Replication and caching (where to compute an operation)
Minimize response time Network models
Maximize throughput Response-time models
Data and structural heterogeneity (federated databases . . . )


Consequences Example

Query
Optimization is much more difficult than in the central case
Return the names of all employees working for project ’P1’
Statistics and costs change over time, e.g., workload at a node,
network load πEN ame (πEID,EN ame (Employees) Employees.EID=Assignment.EN o
πEN o (σP N o= P 1 (Assignment)))
More conflicting optimization goals
Increase throughput → reduce replication and parallelization, Problems
increase query response time → increase parallelization Relations are fragmented and distributed among five nodes
More cost factors and constraints The Employees relation uses primary horizontal fragmentation
Consequences One fragment located at node 1, the other at node 2, no replication
Adaptive query plans (create an initial plan and optimize it on-the-fly) The Assignment relation uses derived horizontal fragmentation
One fragment located at node 3, the other at node 4, no replication
Do not aim for the best plan, but for a good plan
The query originates from node 5



Example Example
Cost model and statistics
Accessing a tuple costs 1 unit (acc)
Transferring a tuple costs 10 units (trans)
There are 400 employees and 1000 assignments
20 assignments for project ‘P1’
All tuples are uniformly distributed, i.e., nodes 3 and 4 provide 10
assignments for project ‘P1’ each
There are local indexes on attribute P N o at nodes 3 and 4 (as well as
indexes on primary keys at all nodes)
Direct tuple access is possible on local sites, no scanning
All nodes can directly communicate with each other
Simpliﬁcation: no costs for unions and projections



Example Example
Simple execution plan - Version B
Simple execution plan - Version A
Ship intermediate results
Transfer all data to Node 5



Example Example
Costs plan B: 440 units
Costs plan A: 23.000 units



Important aspects of distributed query processing Important aspects of distributed query processing

Data localization
Global query optimization
Post-processing


Meta data management Meta data management

Workﬂow for distributed query processing Meta data management

Prerequisites to perform query optimization
Meta data must be available
Meta data is stored in the catalog
Catalog provides information about the data distribution
Use this information to decide, for instance, if it is worthwhile to execute a
selection very early.



Typical contents of a catalog for distributed database management systems
Database schema Where to store the catalog in a distributed system?
Deﬁnitions of tables, views, constraints, keys,. . . Central node
Partitioning schema Simple solution, bottleneck
Information about how the schema is partitioned and how tables can Replicated at all nodes
be reconstructed Updates are expensive
Allocation schema
Fragmented
Information about which fragment can be found at which node
In rare cases, the catalog may become very large
(including information about replication)
Catalog has to be fragmented and allocated
Network information
Caching
Information about node connections, network model
Replicate only needed parts of a central catalog, anticipate potential
Additional physical information
inconsistencies
Information about indexes, data statistics (histograms, etc.),
hardware resources (processing & storage),. . .



Centralized catalog Replicated catalog
One instance of the global catalog at a central node Full copy of the global catalog at each node
Advantages Advantages
No need to update copies Little communication overhead for queries
Little memory consumption Good availability
Disadvantages Disadvantages
Communication with central node for each query High update costs
Central node potentially represents a bottleneck




Fragmented catalog Caching catalog data

Partitioning the global catalog and assigning partitions to nodes Caching non-local catalog data
Advantages Advantages
Sharing load among nodes Avoiding remote access to frequently needed catalog data
Reducing update overhead Reducing communication overhead
Disadvantages Disadvantages
Localizing necessary partitions of the global catalog Coherency control
Invalidating cached copies in the presence of updates


Meta data management Data localization

Meta data management Workflow for distributed query processing

Caching catalog data
Explicit invalidation
Owner of catalog data remembers nodes with local copies
In case of updates: sending an invalidation message to nodes with local
copies
Implicit invalidation
Identifying old catalog data during runtime (adding version numbers
and time stamps to query messages)


Data localization Data localization

Data localization Example – horizontal reduction
Objective Schema

Creating subqueries in consideration of the data distribution Projects1 = σBudget≤150.000 (Projects)
Projects2 = σ150.000<Budget≤200.000 (Projects)
Assumptions Projects3 = σBudget>200.000 (Projects)
Fragmentation is defined by fragmentation expressions Reconstruction expression (horizontal fragmentation)
Each fragment is allocated only at one node (no replication) Projects = Projects1 ∪ Projects2 ∪ Projects3
Fragmentation expressions and locations of the fragments are stored Example query
in the catalog
σLocation= Saarbr. ∧Budget≤100.000 (Projects)
Main tasks After replacing references to global relations
Replace access to global relations with accesses to the fragments σLocation= Saarbr. ∧Budget≤100.000 (Projects1 ∪ Projects2 ∪
Insert reconstruction expression into algebra query Projects3 )
Basic algebraic simplifications of the query Further optimization is possible!


Query simplification – horizontal reduction Example – horizontal reduction

Objective
Query with fragmentation expression
Eliminate non-necessary subqueries σLocation= Saarbr. ∧Budget≤100.000 (Projects1 ∪ Projects2 ∪ Projects3 )

Horizontal reduction rule Fragment definitions
Projects1 = σBudget≤150.000 (Projects)
Given fragments of R as FR = {R1 , . . . , Rn } with Ri = σpi (R) Projects2 = σ150.000<Budget≤200.000 (Projects)
All fragments Ri for which σps (Ri ) = ∅ can be removed Projects3 = σBudget>200.000 (Projects)
with ps denoting the query’s selection predicate
Because of
σps (Ri ) = ∅ ⇐ ∀x ∈ R : ¬(ps (x) ∧ (pi (x)) σBudget≤100.000 (Projects2 ) = ∅, σBudget≤100.000 (Projects3 ) = ∅
The selection with the query predicate ps on fragment Ri is empty if
ps contradicts the fragmentation predicate pi of Ri , i.e., ps and pi are We obtain the reduced query
never true at the same time for all tuples in Ri σLocation= Saarbr. (σBudget≤100.000 (Projects1 ))



Query simplification – join reduction Example – join reduction
Join Reductions Schema
Larger joins are replaced by multiple partial joins on fragments Projects(PNo, PName, Budget, Location)
Distributive law: (R1 ∪ R2 ) S = (R1 S) ∪ (R2 S) Projects1 = σP N o= P 1 ∨P N o= P 2 (Projects)
Projects2 = σP N o= P 3 (Projects)
Eliminate all union fragments that will return an empty result
Projects3 = σP N o= P 4 (Projects)
Expectations
Assignment(ENo, PNo, Duration)
Elimination of partial joins producing empty results Assignment1 = σP N o= P 1 ∨P N o= P 2 (Assignment)
Depends on fragmentation optimality Assignment2 = σP N o= P 3 ∨P N o= P 4 (Assignment)
Many joins on small relations have lower resource costs than one large
Example query
join
Depends on fragmentation and applied join algorithms select * from Projects p, Assignment a where p.PNo = a.PNo
Smaller joins can be executed in parallel In relational algebra
Might decrease response time but might also increase communication Projects Assignment
costs


Example – join reduction Query simplification – join reduction
Query
Projects Assignment Join reduction rule
Given fragments of R as FR = {R1 , . . . , Rn } and fragments of S as
After replacing global relations with reconstruction expressions FS = {S1 , . . . , Sn }
(Projects1 ∪ Projects2 ∪ Projects3 ) (Assignment1 ∪ Assignment2 ) Apply distributive law, e.g.:
(R1 ∪ R2 ) (S1 ∪ S2 ) = (R1 S1 ) ∪ (R1 S2 ) ∪ (R2 S1 ) ∪ (R2 S2 )
After applying the distributive law All partial joins between fragments Ri and Sj for which Ri Sj = ∅
can be removed
(Projects1 Assignment1 ) ∪ (Projects1 Assignment2 ) ∪
Ri Sj = ∅ ⇐ ∀x ∈ Ri , y ∈ Sj : ¬(pi (x) ∧ pj (y))
The join between fragments Ri and Rj is empty if their respective
(Projects3 Assignment1 ) ∪ (Projects3 Assignment2 ) fragmentation predicates (on the join attribute) contradict, i.e., there
is no tuple combination x and y such that both partitioning
Further optimization is possible! predicates are fulfilled at the same time.



Example – join reduction Query simplification – join reduction for horizontal
fragmentation
Query with fragmentation expression
(Projects1 Assignment1 ) ∪ (Projects1 Assignment2 ) ∪ The easiest join reduction case follows from derived horizontal
(Projects2 Assignment1 ) ∪ (Projects2 Assignment2 ) ∪ fragmentation
(Projects3 Assignment1 ) ∪ (Projects3 Assignment2 ) For each fragment of the first relation, there is exactly one matching
fragment of the second relation
Some of these partial joins are empty, e.g.:
Simply use the information contained in the reconstruction expression
Projects1 Assignment2 = ∅ instead of comparing the reconstruction predicates to each other
Because their fragmentation expressions contradict: Join reduction for arbitrary horizontal partitioning might not be beneficial
Projects1 = σP N o= P 1 ∨P N o= P 2 (Projects) and
Assignment2 = σP N o= P 3 ∨P N o= P 4 (Assignment)
Reduced query
(Projects3 Assignment2 )


Query simplification – join reduction for derived Query simplification – join reduction for derived
horizontal fragmentation horizontal fragmentation

Example After replacing global relations with reconstruction expressions
Projects(PNo, PName, Budget, Location)
(Projects1 ∪ Projects2 ) (Assignment1 ∪ Assignment2 )
Projects1 = σP N o= P 1 ∨P N o= P 2 (Projects)
Projects2 = σP N o= P 3 ∨P N o= P 4 (Projects) After applying the distributive law

Assignment(ENo, PNo, Duration) (Projects1 Assignment1 ) ∪ (Projects1 Assignment2 ) ∪
Assignment1 = Assignment Projects1 (Projects2 Assignment1 ) ∪ (Projects2 Assignment2 )
Assignment2 = Assignment Projects2 Reduced query (using information about fragmentation of relation Assignment
directly)
Query in relational algebra
Projects Assignment (Projects1 Assignment1 ) ∪ (Projects2 Assignment2 )



Query simplification – vertical reduction Example – vertical reduction
Schema
Projects(PNo, PName, Budget, Location)
Projects1 = πP N o,P N ame,Location (Projects)
Projects2 = πP N o,Budget (Projects)
Vertical fragmentation rule
Reconstruction expression
Given fragments of R as FR = {R1 , . . . , Rn } with Ri = πβi (R) with
Projects = Projects1 Projects2
βi representing the enumeration of a subset of R’s attributes
Avoid joining fragments containing “useless” attributes, i.e., Example query
fragments containing only attributes that are not referenced in the πP N ame (Projects)
query and not output in the result
After replacing references to global relations
πP N ame (Projects1 Projects2 )

After removing unnecessary fragments
πP N ame (Projects1 )



Query simplification – hybrid fragmentation Qualified relations
Supporting algebraic optimization of queries involving fragments
Annotating fragments and intermediate relations with predicates
Estimating the size of a relation
The reconstruction expression introduces combinations of joins and Extension of relational algebra
unions
General guidelines Definition: qualified relation
Remove empty relations generated by contradicting relations on A qualified relation is a pair [R : qR ] where R is a relation and qR is a
horizontal fragments predicate.
Remove useless relations generated by vertical fragments
Break and distribute joins, eliminate empty fragment joins Example
Representing horizontal fragments as qualified relations where the
qualification predicate corresponds to the fragmentation expression

[Projects : σP N o= P 1 ∨P N o= P 2 ]


Qualified relations Qualified relations
Example query
σ100.000≤Budget≤200.000 (Projects)
Extended relational algebra Qualified relations
E1 = σ100.000≤Budget≤200.000 [Projects1 : Budget ≤ 150.000]
(1) E := σF [R : qR ] → [E : F ∧ qR ]
[E1 : (100.000 ≤ Budget ≤ 200.000) ∧ (Budget ≤ 150.000)]
(2) E := πA [R : qR ] → [E : qR ]
[E1 : 100.000 ≤ Budget ≤ 150.000]
(3) E := [R : qR ] × [S : qS ] → [E : qR ∧ qS ]
(4) E := [R : qR ] − [S : qS ] → [E : qR ] E2 = σ1000≤Budget≤200.000 [Projects2 : 150.000 < Budget ≤ 200.000]
(5) E := [R : qR ] ∪ [S : qS ] → [E : qR ∨ qS ] [E2 : (100.000 ≤ Budget ≤ 200.000) ∧
(6) E := [R : qR ] F [S : qS ] → [E : qR ∧ qS ∧ F ] (150.000 < Budget ≤ 200.000)]
[E2 : 150.000 < Budget ≤ 200.000]
E3 = σ100.000≤Budget≤200.000 [Projects3 : Budget > 200.000]
[E3 : (100.000 ≤ Budget ≤ 200.000) ∧ (Budget > 200.000)]
E3 = ∅


Global query optimization Global query optimization

1 Motivation Join order optimization
Total time models
2 Detour on centralized query processing Response time models
Query parsing
Query transformation
Query optimization
Introduction
5 Summary
Data localization
Main questions

Main questions Main questions

Workﬂow for distributed query processing Introduction to global query optimization

Main questions
When to optimize?
What criteria to optimize?
Where to execute the query?



When to optimize? When to optimize?

Full compile time optimization Fully dynamic optimization
The full query execution plan is computed at compile time Each query is optimized individually at runtime
Assumption
This technique heavily relies on heuristics, learning algorithms, and
Applications use canned queries
luck
Prepared and parameterized SQL statements
Pros
Pros
Might produce very good plans
Queries can be executed directly
Uses current network state
Cons Also usable for ad-hoc queries
Complex to model Cons
Much information unknown or too expensive to gather
Result quality might be very unpredictable
Collecting statistics on all nodes?
Complex algorithms and heuristics
Statistics outdated
Diﬃcult to keep statistics up-to-date
Especially machine load and network properties are very volatile



When to optimize? When to optimize?

Semi-dynamic optimization Hierarchical optimization
Pre-optimize the query Plans are created in multiple stages
During query execution, test if execution runs as expected during Global-Local-Plans
optimization Global query optimizer creates a global query plan
e.g., are tuples/fragments delivered in time?, does the network adhere Focus on data transfer: which intermediate results are to be computed
by which node? How should intermediate results be shipped?
to the predicted properties?, are there any bad network latencies?, etc.
Local query optimizers create local query plans
If execution shows severe deviations, compute a new query plan for all Decide on query plan layout, algorithms, indexes, etc. to deliver the
parts that have not yet been executed requested intermediate result
Makes only sense for queries that run for a longer time Two-Step-Plans



When to optimize? What criteria to optimize?
Hierarchical optimization Important aspects for global optimization
Plans are created in multiple stages
Communication operators
Global-Local-Plans
Two-Step-Plans Fragment cardinalities
During compile time, only stable parts of the plan are computed Order of operations
Join order, join methods, access paths, etc. Join ordering
During query execution, all missing plan elements are added Because permutations of the joins within the query may lead to
Node selection, transfer policies, etc.
Both steps can be performed using traditional query optimization improvements of orders of magnitude
techniques Most important alternative optimization criteria
Plan enumeration with dynamic programming
Complexity is manageable as each optimization problem is much easier Query response time
than a full optimization Resource consumption
During runtime optimization, fresh statistics are available
Total query execution costs
Most distributed database management systems use semi-dynamic or
hierarchical optimization techniques (or both) ...


Where to execute the query? Global query optimization

Global query optimization. . .
Query optimizer has to decide which parts of the query have to be . . . deals with ﬁnding the “best” ordering of operations in the query
shipped to which node (cost model) (extended by fragmentation expressions and including communication
operations) that minimizes a cost function.
In heavily replicated scenarios, clever hybrid shipping can eﬀectively
be used for load balancing Input
Move expensive computations to lightly loaded nodes, avoid an algebraic query extended by fragmentation expressions
expensive communication Output
an algebraic query or query execution plan with communication
operations


Distributed_Database_System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Distributed_Database_System

Similar to Distributed_Database_System (20)

More from Philip Zhong

More from Philip Zhong (14)

Recently uploaded

Recently uploaded (20)

Distributed_Database_System