Distributed Database Systems                                                                            Distributed Databa...
Distributed Database Systems                                                                            Distributed Databa...
Distributed Database Systems                                                                                 Distributed D...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                                       Distrib...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                                    Distribute...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                             Distributed Datab...
Distributed Database Systems                                                                                    Distribute...
Distributed Database Systems                                                                                  Distributed ...
Distributed Database Systems                                                                                 Distributed D...
Distributed Database Systems                                                                 Distributed Database Systems ...
Distributed Database Systems                                                                     Distributed Database Syst...
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Distributed_Database_System
Upcoming SlideShare
Loading in...5
×

Distributed_Database_System

1,092

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,092
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
31
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Distributed_Database_System

  1. 1. Distributed Database Systems Distributed Database Systems Contents I 1 Motivation Distributed Database Systems 2 Detour on centralized query processing Translating SQL into relational algebra Distributed Query Processing Phases of centralized query processing Query parsing Katja Hose, Ralf Schenkel Query transformation Query optimization Max-Planck-Institut f¨r Informatik, Cluster of Excellence MMCI u 3 Basics of distributed query processing Phases of distributed query processing November 10, 2011 Introduction November 17, 2011 Meta data management Data localization 4 Global query optimization Main questions Katja Hose Distributed Database Systems November 10, 2011 1 / 167 Katja Hose Distributed Database Systems November 10, 2011 2 / 167Distributed Database Systems Distributed Database Systems Motivation Contents II Motivation Global query optimizer Distributed cost model The task of query processing is . . . Join order optimization . . . to answer user queries Total time models Response time models Example How many students are at Saarland University? Answer: 18.000 Additional constraints5 Summary Low response times High query throughput Efficient hardware usage ... Katja Hose Distributed Database Systems November 10, 2011 3 / 167 Katja Hose Distributed Database Systems November 10, 2011 4 / 167
  2. 2. Distributed Database Systems Distributed Database Systems Motivation Detour on centralized query processing Motivation 1 Motivation 2 Detour on centralized query processing Translating SQL into relational algebra Phases of centralized query processingDifferences to centralized query processing Query parsing Considering the physical data distribution during query optimization Query transformation Query optimization Considering communication costs 3 Basics of distributed query processingAssumptions Phases of distributed query processing Data is distributed among multiple nodes Introduction Existence of a global conceptual schema, which is used by all nodes Meta data management Data localization Queries are formulated on the global schema 4 Global query optimization Main questions Global query optimizer Distributed cost model Katja Hose Distributed Database Systems November 10, 2011 5 / 167 Katja Hose Distributed Database Systems November 10, 2011 6 / 167Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Translating SQL into relational algebra Join order optimization Translating SQL into relational algebra Total time models Response time models SQL query structure: select distinct a1 , . . . , an from R1 , . . . , Rn where p Algorithm:5 Summary 1 Translating the from clause Let R1 , . . . , Rk be the relations in the from clause of the query Construct expression: R1 if k = 1 R= ((. . . (R1 × R2 ) × . . . ) × Rk ) otherwise Katja Hose Distributed Database Systems November 10, 2011 7 / 167 Katja Hose Distributed Database Systems November 10, 2011 8 / 167
  3. 3. Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Translating SQL into relational algebra Translating SQL into relational algebra Translating SQL into relational algebra Translating SQL into relational algebraAlgorithm : Algorithm : 2 Translating the where clause 3 Translating the select clauseLet F be the predicate in the where clause of the query (if a where clause Let a1 , . . . , an (or “*”) be the projection in the select clause of the queryexists) Construct expression:Construct expression: W if the projection is “*” S= R if there is no where clause πa1 ,...,an (W ) otherwise W = σF (R) otherwise Output: S Katja Hose Distributed Database Systems November 10, 2011 9 / 167 Katja Hose Distributed Database Systems November 10, 2011 10 / 167Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Translating SQL into relational algebra Phases of centralized query processing Translating SQL into relational algebra Workflow for centralized query processingExample query select distinct e.EN ame, s.Salary from Employees e, Salary s where e.T itle = s.T itle and s.Salary ≥ 60.000 R1 if k = 1 R= ((. . . (R1 × R2 ) × . . . ) × Rk ) otherwise R = Employees × Salary R if there is no where clause W = σF (R) otherwise Katja Hose Distributed Database Systems November 10, 2011 11 / 167 Katja Hose Distributed Database Systems November 10, 2011 12 / 167
  4. 4. Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query parsing Query parsing Query parsing ExampleTransform a declarative query into an internal representation Query formulated using a declarative query language, e.g., SQL Example The Parser translates the query into an internal representation Database managing information about employees and projects Called naive query plan Employees(EID, EN ame, T itle) Plan described by an operator tree of relational algebra operators Assignment(EN o, P N o, Duration) Query: return the names of all employees working for project ’P1’ SELECT EName FROM Employees e, Assignment a WHERE e.EID = ENo AND PNo=’P1’ Katja Hose Distributed Database Systems November 10, 2011 13 / 167 Katja Hose Distributed Database Systems November 10, 2011 14 / 167Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query parsing Query parsing Example Operator tree πEN ame σP N o= P 1 ∧Employees.EID=Assignment.EN o Employees × AssignmentQuery SELECT EName FROM Employees e, Assignment a WHERE e.EID = ENo AND PNo=’P1’Translation into relational algebra πEN ame σP N o= P 1 ∧Employees.EID=Assignment.EN o Employees × AssignmentIn contrast to the SQL statement, the algebra statement already containsthe required basic evaluation operators Operator tree Katja Hose Distributed Database Systems November 10, 2011 15 / 167 Katja Hose Distributed Database Systems November 10, 2011 16 / 167
  5. 5. Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query transformation Query transformation Workflow for centralized query processing Query transformation Steps 1 Name resolution Transforming object names into internal names 2 Semantic analysis Checking for global relations and attributes, view expansion, global access control 3 Normalization Transforming predicates into a canonical format 4 Simple algebraic rewriting Application of heuristics to eliminate bad plans Katja Hose Distributed Database Systems November 10, 2011 17 / 167 Katja Hose Distributed Database Systems November 10, 2011 18 / 167Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query transformation Query transformation Semantic analysis Normalization Objective Check if the global schema defines all attributes and relations Simplification of the following optimization by transforming the query referenced in the query into a canonical format If the query is formulated on a view, replace references to Selection and join predicates relations/attributes with references to global relations/attributes Conjunctive normal form vs. disjunctive normal form Perform simple integrity checks, e.g., are the types of attributes Conjunctive normal form: used in comparison predicates of the same type? (p11 ∨ p12 ∨ · · · ∨ p1n ) ∧ · · · ∧ (pm1 ∨ pm2 ∨ · · · ∨ pmn ) Initial check if the query has the rights to access referenced Disjunctive normal form: (p11 ∧ p12 ∧ · · · ∧ p1n ) ∨ · · · ∨ (pm1 ∧ pm2 ∧ · · · ∧ pmn ) relations/attributes Transformation based on equivalence rules for logical operators Katja Hose Distributed Database Systems November 10, 2011 19 / 167 Katja Hose Distributed Database Systems November 10, 2011 20 / 167
  6. 6. Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query transformation Query transformation Normalization Normalization Example SELECT ENameEquivalence rules FROM Employees e, Assignment a p1 ∧ p2 ⇐⇒ p2 ∧ p1 and p1 ∨ p2 ⇐⇒ p2 ∨ p1 WHERE e.EID = a.ENo AND Duration ≥ 3 AND (PNo=’P1’ OR PNo=’P2’) p1 ∧ (p2 ∧ p3 ) ⇐⇒ (p1 ∧ p2 ) ∧ p3 and p1 ∨ (p2 ∨ p3 ) ⇐⇒ (p1 ∨ p2 ) ∨ p3 p1 ∧ (p2 ∨ p3 ) ⇐⇒ (p1 ∧ p2) ∨ (p1 ∧ p3 ) and Selection condition in disjunctive normal form p1 ∨ (p2 ∧ p3 ) ⇐⇒ (p1 ∨ p2) ∧ (p1 ∨ p3 ) (EID = ENo ∧ Duration ≥ 3 ∧ PNo=’P1’) ∨ ¬(p1 ∧ p2 ) ⇐⇒ ¬p1 ∨ ¬p2 and ¬(p1 ∨ p2 ) ⇐⇒ ¬p1 ∧ ¬p2 (EID = ENo ∧ Duration ≥ 3 ∧ PNo=’P2’) ¬(¬p1 ) ⇐⇒ p1 Selection condition in conjunctive normal form EID = ENo ∧ Duration ≥ 3 ∧ (PNo=’P1’ ∨ PNo=’P2’) Katja Hose Distributed Database Systems November 10, 2011 21 / 167 Katja Hose Distributed Database Systems November 10, 2011 22 / 167Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query transformation Query optimization Simple algebraic rewriting Workflow for centralized query processingSimple optimizations that are always beneficial regardless of system state Elimination of redundant predicates Simplification of expressions Unnesting of subqueries and viewsTasks Recognize and simplify all expressions/operations/subqueries that are “obviously” unnecessary, redundant, or contradictory. Do not consider system state information, e.g., size of tables, existence of indexes, etc. Katja Hose Distributed Database Systems November 10, 2011 23 / 167 Katja Hose Distributed Database Systems November 10, 2011 24 / 167
  7. 7. Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query optimization Query optimization Query optimization HeuristicsSteps Use simple heuristics which usually lead to better performance 1 Algebraic optimization Not the optimal plan is needed, but the really bad ones should be Find a good relational algebra operator tree avoided Heuristic query optimization Heuristics Cost-based query optimization Statistical query optimization Break selections Complex selection criteria should be broken into multiple parts 2 Physical optimization Push projection and push selection Find suitable algorithms for implementing the operations Cheap selections and projections should be performed as early as possible to reduce the sizes of intermediate results Force joins In most cases, using a join is much cheaper than using a Cartesian product and a selection Katja Hose Distributed Database Systems November 10, 2011 25 / 167 Katja Hose Distributed Database Systems November 10, 2011 26 / 167Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query optimization Query optimization Algebraic optimization rules Algebraic optimization rulesOperator is commutative: r1 r2 ⇐⇒ r2 r1 Combinations of selections σ can be combined using logical and (∧). TheOperator is associative: order of the selections is arbitrary: (r1 r2 ) r3 ⇐⇒ r1 (r2 r3 ) σF1 (σF2 (r1 )) ⇐⇒ σF1 ∧F2 (r1 ) ⇐⇒ σF2 (σF1 (r1 ))For operator π in combination with another operator π, the “outer” Exploiting commutativity of ∧parameter dominates the “inner” one: πX (πY (r1 )) ⇐⇒ πX (r1 ) if X ⊆ Y Katja Hose Distributed Database Systems November 10, 2011 27 / 167 Katja Hose Distributed Database Systems November 10, 2011 28 / 167
  8. 8. Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query optimization Query optimization Algebraic optimization rules Algebraic optimization rules Operators σ and commute if all selection attributes are contained in the same relation:Operators π and σ commute if predicate F is defined based on the σF (r1 r2 ) ⇐⇒ σF (r1 ) r2 if attr(F ) ⊆ R1projection attributes: A selection predicate can be split up in conjunction with a join (F = F1 ∧ F2 ) if the attributes referred to by F1 and F2 are contained in different relations: σF (πX (r1 )) ⇐⇒ πX (σF (r1 )) if attr(F ) ⊆ X σF (r1 r2 ) ⇐⇒ σF1 (r1 ) σF2 (r2 )Alternatively, change in ordering possible if the projection is extended by if attr(F1 ) ⊆ R1 and attr(F2 ) ⊆ R2all necessary attributes: In any case, part of a selection can be split up by separating predicates F1 πX1 (σF (r1 )) ⇐⇒ πX1 (σF (πX1 ,X2 (r1 ))) if attr(F ) ⊇ X2 referencing attributes of R1 only, F2 contains the remaining predicates referencing attributes of both relations σF (r1 r2 ) ⇐⇒ σF2 (σF1 (r1 ) r2 ) if attr(F1 ) ⊆ R1 Katja Hose Distributed Database Systems November 10, 2011 29 / 167 Katja Hose Distributed Database Systems November 10, 2011 30 / 167Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query optimization Query optimization Algebraic optimization rules Algebraic optimization rulesCommutativity of σ and ∪: Commutativity of π and : σF (r1 ∪ r2 ) ⇐⇒ σF (r1 ) ∪ σF (r2 ) πX (r1 r2 ) ⇐⇒ πX (πY1 (r1 ) πY2 (r2 ))Commutativity of σ and −: with Y1 = (X ∩ R1 ) ∪ (R1 ∩ R2 ) σF (r1 − r2 ) ⇐⇒ σF (r1 ) − σF (r2 ) andor in case F only references tuples in r1 : Y2 = (X ∩ R2 ) ∪ (R1 ∩ R2 ) σF (r1 − r2 ) ⇐⇒ σF (r1 ) − r2 Pushing a projection is possible if all Yi are defined in such a way that they preserve all attributes necessary to perform the join. Katja Hose Distributed Database Systems November 10, 2011 31 / 167 Katja Hose Distributed Database Systems November 10, 2011 32 / 167
  9. 9. Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query optimization Query optimization Algebraic optimization rules Heuristic algebraic optimization – ExampleFurther rules Commutativity of π and ∪: πX (r1 ∪ r2 ) ⇐⇒ πX (r1 ) ∪ πX (r2 ) Use algebraic optimization heuristics Distributive law for and ∪, distributive law for and −, Commutativity of renaming β with other operators, . . . Force join Idempotence, e.g., A ∨ A ⇐⇒ A Push selection and projection Operations involving empty relations Commutative and associative laws for , ∪ und ∩ Katja Hose Distributed Database Systems November 10, 2011 33 / 167 Katja Hose Distributed Database Systems November 10, 2011 34 / 167Distributed Database Systems Distributed Database Systems Detour on centralized query processing Detour on centralized query processing Query optimization Query optimization Cost-based algebraic query optimization Physical query optimization Physical optimizationMost non-distributed RDBMS strongly rely on cost-based optimizations Input: Aim for better optimized plan with respect to system and data Optimized query plan consisting of algebra operators characteristics Choose an algorithm to compute a particular algebra operator Join order optimization Join: Basic approach Block-Nested-Loop join, hash join, merge join, . . . Establish a cost model for various operations Enumerate all query plans and compute costs Select: Pick the best query plan Full table scan, index lookup, ad-hoc index generation & lookup, . . . Usually, dynamic programming techniques are used to keep Tasks computational effort manageable Translating a query plan into an execution plan Physical and algebraic optimization are often interleaved Katja Hose Distributed Database Systems November 10, 2011 35 / 167 Katja Hose Distributed Database Systems November 10, 2011 36 / 167
  10. 10. Distributed Database Systems Distributed Database Systems Detour on centralized query processing Basics of distributed query processing Query optimization Query optimization example 1 Motivation 2 Detour on centralized query processing Translating SQL into relational algebra Phases of centralized query processingOutput: query execution plan Query parsing Query transformation Query optimization 3 Basics of distributed query processing Phases of distributed query processing Introduction Meta data management Data localization 4 Global query optimization Main questions Global query optimizer Distributed cost model Katja Hose Distributed Database Systems November 10, 2011 37 / 167 Katja Hose Distributed Database Systems November 10, 2011 38 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Phases of distributed query processing Join order optimization Workflow for distributed query processing Total time models Response time models5 Summary Katja Hose Distributed Database Systems November 10, 2011 39 / 167 Katja Hose Distributed Database Systems November 10, 2011 40 / 167
  11. 11. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Introduction Introduction Basic considerations Basic considerations Costs are more difficult to predictDistributed query processing Join selectivity: is it worthwhile to push down a selection? Shares the same properties of centralized query processing Data is distributed: difficult to get meaningful statistics Similar problem but with different objectives and constraints Network latency is very hard to predictObjectives for centralized query processing Current workload at nodes, load shedding Minimize the number of disk accesses Additional cost factors and constraints Minimize computational time Extension of relational algebra (sending/receiving data)Objectives for distributed query processing Data localization (which node holds relevant data) Minimize resource consumption Replication and caching (where to compute an operation) Minimize response time Network models Maximize throughput Response-time models Data and structural heterogeneity (federated databases . . . ) Katja Hose Distributed Database Systems November 10, 2011 41 / 167 Katja Hose Distributed Database Systems November 10, 2011 42 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Introduction Introduction Consequences Example QueryOptimization is much more difficult than in the central case Return the names of all employees working for project ’P1’ Statistics and costs change over time, e.g., workload at a node, network load πEN ame (πEID,EN ame (Employees) Employees.EID=Assignment.EN o πEN o (σP N o= P 1 (Assignment))) More conflicting optimization goals Increase throughput → reduce replication and parallelization, Problems increase query response time → increase parallelization Relations are fragmented and distributed among five nodes More cost factors and constraints The Employees relation uses primary horizontal fragmentationConsequences One fragment located at node 1, the other at node 2, no replication Adaptive query plans (create an initial plan and optimize it on-the-fly) The Assignment relation uses derived horizontal fragmentation One fragment located at node 3, the other at node 4, no replication Do not aim for the best plan, but for a good plan The query originates from node 5 Katja Hose Distributed Database Systems November 10, 2011 43 / 167 Katja Hose Distributed Database Systems November 10, 2011 44 / 167
  12. 12. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Introduction Introduction Example Example Cost model and statistics Accessing a tuple costs 1 unit (acc) Transferring a tuple costs 10 units (trans) There are 400 employees and 1000 assignments 20 assignments for project ‘P1’ All tuples are uniformly distributed, i.e., nodes 3 and 4 provide 10 assignments for project ‘P1’ each There are local indexes on attribute P N o at nodes 3 and 4 (as well as indexes on primary keys at all nodes) Direct tuple access is possible on local sites, no scanning All nodes can directly communicate with each other Simplification: no costs for unions and projections Katja Hose Distributed Database Systems November 10, 2011 45 / 167 Katja Hose Distributed Database Systems November 10, 2011 46 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Introduction Introduction Example Example Simple execution plan - Version BSimple execution plan - Version A Ship intermediate resultsTransfer all data to Node 5 Katja Hose Distributed Database Systems November 10, 2011 47 / 167 Katja Hose Distributed Database Systems November 10, 2011 48 / 167
  13. 13. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Introduction Introduction Example Example Costs plan B: 440 unitsCosts plan A: 23.000 units Katja Hose Distributed Database Systems November 10, 2011 49 / 167 Katja Hose Distributed Database Systems November 10, 2011 50 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Introduction Introduction Important aspects of distributed query processing Important aspects of distributed query processing Meta data management Data localization Global query optimization Post-processing Katja Hose Distributed Database Systems November 10, 2011 51 / 167 Katja Hose Distributed Database Systems November 10, 2011 52 / 167
  14. 14. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Meta data management Meta data management Workflow for distributed query processing Meta data management Prerequisites to perform query optimization Meta data must be available Meta data is stored in the catalog Catalog provides information about the data distribution Use this information to decide, for instance, if it is worthwhile to execute a selection very early. Katja Hose Distributed Database Systems November 10, 2011 53 / 167 Katja Hose Distributed Database Systems November 10, 2011 54 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Meta data management Meta data management Meta data management Meta data managementTypical contents of a catalog for distributed database management systems Database schema Where to store the catalog in a distributed system? Definitions of tables, views, constraints, keys,. . . Central node Partitioning schema Simple solution, bottleneck Information about how the schema is partitioned and how tables can Replicated at all nodes be reconstructed Updates are expensive Allocation schema Fragmented Information about which fragment can be found at which node In rare cases, the catalog may become very large (including information about replication) Catalog has to be fragmented and allocated Network information Caching Information about node connections, network model Replicate only needed parts of a central catalog, anticipate potential Additional physical information inconsistencies Information about indexes, data statistics (histograms, etc.), hardware resources (processing & storage),. . . Katja Hose Distributed Database Systems November 10, 2011 55 / 167 Katja Hose Distributed Database Systems November 10, 2011 56 / 167
  15. 15. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Meta data management Meta data management Meta data management Meta data managementCentralized catalog Replicated catalog One instance of the global catalog at a central node Full copy of the global catalog at each node Advantages Advantages No need to update copies Little communication overhead for queries Little memory consumption Good availability Disadvantages Disadvantages Communication with central node for each query High update costs Central node potentially represents a bottleneck Katja Hose Distributed Database Systems November 10, 2011 57 / 167 Katja Hose Distributed Database Systems November 10, 2011 58 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Meta data management Meta data management Meta data management Meta data managementFragmented catalog Caching catalog data Partitioning the global catalog and assigning partitions to nodes Caching non-local catalog data Advantages Advantages Sharing load among nodes Avoiding remote access to frequently needed catalog data Reducing update overhead Reducing communication overhead Disadvantages Disadvantages Localizing necessary partitions of the global catalog Coherency control Invalidating cached copies in the presence of updates Katja Hose Distributed Database Systems November 10, 2011 59 / 167 Katja Hose Distributed Database Systems November 10, 2011 60 / 167
  16. 16. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Meta data management Data localization Meta data management Workflow for distributed query processingCaching catalog data Explicit invalidation Owner of catalog data remembers nodes with local copies In case of updates: sending an invalidation message to nodes with local copies Implicit invalidation Identifying old catalog data during runtime (adding version numbers and time stamps to query messages) Katja Hose Distributed Database Systems November 10, 2011 61 / 167 Katja Hose Distributed Database Systems November 10, 2011 62 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Data localization Data localization Data localization Example – horizontal reductionObjective Schema Creating subqueries in consideration of the data distribution Projects1 = σBudget≤150.000 (Projects) Projects2 = σ150.000<Budget≤200.000 (Projects)Assumptions Projects3 = σBudget>200.000 (Projects) Fragmentation is defined by fragmentation expressions Reconstruction expression (horizontal fragmentation) Each fragment is allocated only at one node (no replication) Projects = Projects1 ∪ Projects2 ∪ Projects3 Fragmentation expressions and locations of the fragments are stored Example query in the catalog σLocation= Saarbr. ∧Budget≤100.000 (Projects)Main tasks After replacing references to global relations Replace access to global relations with accesses to the fragments σLocation= Saarbr. ∧Budget≤100.000 (Projects1 ∪ Projects2 ∪ Insert reconstruction expression into algebra query Projects3 ) Basic algebraic simplifications of the query Further optimization is possible! Katja Hose Distributed Database Systems November 10, 2011 63 / 167 Katja Hose Distributed Database Systems November 10, 2011 64 / 167
  17. 17. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Data localization Data localization Query simplification – horizontal reduction Example – horizontal reductionObjective Query with fragmentation expression Eliminate non-necessary subqueries σLocation= Saarbr. ∧Budget≤100.000 (Projects1 ∪ Projects2 ∪ Projects3 )Horizontal reduction rule Fragment definitions Projects1 = σBudget≤150.000 (Projects) Given fragments of R as FR = {R1 , . . . , Rn } with Ri = σpi (R) Projects2 = σ150.000<Budget≤200.000 (Projects) All fragments Ri for which σps (Ri ) = ∅ can be removed Projects3 = σBudget>200.000 (Projects) with ps denoting the query’s selection predicate Because of σps (Ri ) = ∅ ⇐ ∀x ∈ R : ¬(ps (x) ∧ (pi (x)) σBudget≤100.000 (Projects2 ) = ∅, σBudget≤100.000 (Projects3 ) = ∅ The selection with the query predicate ps on fragment Ri is empty if ps contradicts the fragmentation predicate pi of Ri , i.e., ps and pi are We obtain the reduced query never true at the same time for all tuples in Ri σLocation= Saarbr. (σBudget≤100.000 (Projects1 )) Katja Hose Distributed Database Systems November 10, 2011 65 / 167 Katja Hose Distributed Database Systems November 10, 2011 66 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Data localization Data localization Query simplification – join reduction Example – join reductionJoin Reductions Schema Larger joins are replaced by multiple partial joins on fragments Projects(PNo, PName, Budget, Location) Distributive law: (R1 ∪ R2 ) S = (R1 S) ∪ (R2 S) Projects1 = σP N o= P 1 ∨P N o= P 2 (Projects) Projects2 = σP N o= P 3 (Projects) Eliminate all union fragments that will return an empty result Projects3 = σP N o= P 4 (Projects)Expectations Assignment(ENo, PNo, Duration) Elimination of partial joins producing empty results Assignment1 = σP N o= P 1 ∨P N o= P 2 (Assignment) Depends on fragmentation optimality Assignment2 = σP N o= P 3 ∨P N o= P 4 (Assignment) Many joins on small relations have lower resource costs than one large Example query join Depends on fragmentation and applied join algorithms select * from Projects p, Assignment a where p.PNo = a.PNo Smaller joins can be executed in parallel In relational algebra Might decrease response time but might also increase communication Projects Assignment costs Katja Hose Distributed Database Systems November 10, 2011 67 / 167 Katja Hose Distributed Database Systems November 10, 2011 68 / 167
  18. 18. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Data localization Data localization Example – join reduction Query simplification – join reductionQuery Projects Assignment Join reduction rule Given fragments of R as FR = {R1 , . . . , Rn } and fragments of S asAfter replacing global relations with reconstruction expressions FS = {S1 , . . . , Sn } (Projects1 ∪ Projects2 ∪ Projects3 ) (Assignment1 ∪ Assignment2 ) Apply distributive law, e.g.: (R1 ∪ R2 ) (S1 ∪ S2 ) = (R1 S1 ) ∪ (R1 S2 ) ∪ (R2 S1 ) ∪ (R2 S2 )After applying the distributive law All partial joins between fragments Ri and Sj for which Ri Sj = ∅ can be removed (Projects1 Assignment1 ) ∪ (Projects1 Assignment2 ) ∪ Ri Sj = ∅ ⇐ ∀x ∈ Ri , y ∈ Sj : ¬(pi (x) ∧ pj (y)) (Projects2 Assignment1 ) ∪ (Projects2 Assignment2 ) ∪ The join between fragments Ri and Rj is empty if their respective (Projects3 Assignment1 ) ∪ (Projects3 Assignment2 ) fragmentation predicates (on the join attribute) contradict, i.e., there is no tuple combination x and y such that both partitioning Further optimization is possible! predicates are fulfilled at the same time. Katja Hose Distributed Database Systems November 10, 2011 69 / 167 Katja Hose Distributed Database Systems November 10, 2011 70 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Data localization Data localization Example – join reduction Query simplification – join reduction for horizontal fragmentationQuery with fragmentation expression (Projects1 Assignment1 ) ∪ (Projects1 Assignment2 ) ∪ The easiest join reduction case follows from derived horizontal (Projects2 Assignment1 ) ∪ (Projects2 Assignment2 ) ∪ fragmentation (Projects3 Assignment1 ) ∪ (Projects3 Assignment2 ) For each fragment of the first relation, there is exactly one matching fragment of the second relationSome of these partial joins are empty, e.g.: Simply use the information contained in the reconstruction expression Projects1 Assignment2 = ∅ instead of comparing the reconstruction predicates to each otherBecause their fragmentation expressions contradict: Join reduction for arbitrary horizontal partitioning might not be beneficial Projects1 = σP N o= P 1 ∨P N o= P 2 (Projects) and Assignment2 = σP N o= P 3 ∨P N o= P 4 (Assignment)Reduced query (Projects1 Assignment1 ) ∪ (Projects2 Assignment2 ) ∪ (Projects3 Assignment2 ) Katja Hose Distributed Database Systems November 10, 2011 71 / 167 Katja Hose Distributed Database Systems November 10, 2011 72 / 167
  19. 19. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Data localization Data localization Query simplification – join reduction for derived Query simplification – join reduction for derived horizontal fragmentation horizontal fragmentationExample After replacing global relations with reconstruction expressions Projects(PNo, PName, Budget, Location) (Projects1 ∪ Projects2 ) (Assignment1 ∪ Assignment2 ) Projects1 = σP N o= P 1 ∨P N o= P 2 (Projects) Projects2 = σP N o= P 3 ∨P N o= P 4 (Projects) After applying the distributive law Assignment(ENo, PNo, Duration) (Projects1 Assignment1 ) ∪ (Projects1 Assignment2 ) ∪ Assignment1 = Assignment Projects1 (Projects2 Assignment1 ) ∪ (Projects2 Assignment2 ) Assignment2 = Assignment Projects2 Reduced query (using information about fragmentation of relation Assignment directly)Query in relational algebra Projects Assignment (Projects1 Assignment1 ) ∪ (Projects2 Assignment2 ) Katja Hose Distributed Database Systems November 10, 2011 73 / 167 Katja Hose Distributed Database Systems November 10, 2011 74 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Data localization Data localization Query simplification – vertical reduction Example – vertical reduction Schema Projects(PNo, PName, Budget, Location) Projects1 = πP N o,P N ame,Location (Projects) Projects2 = πP N o,Budget (Projects)Vertical fragmentation rule Reconstruction expression Given fragments of R as FR = {R1 , . . . , Rn } with Ri = πβi (R) with Projects = Projects1 Projects2 βi representing the enumeration of a subset of R’s attributes Avoid joining fragments containing “useless” attributes, i.e., Example query fragments containing only attributes that are not referenced in the πP N ame (Projects) query and not output in the result After replacing references to global relations πP N ame (Projects1 Projects2 ) After removing unnecessary fragments πP N ame (Projects1 ) Katja Hose Distributed Database Systems November 10, 2011 75 / 167 Katja Hose Distributed Database Systems November 10, 2011 76 / 167
  20. 20. Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Data localization Data localization Query simplification – hybrid fragmentation Qualified relations Supporting algebraic optimization of queries involving fragments Annotating fragments and intermediate relations with predicates Estimating the size of a relation The reconstruction expression introduces combinations of joins and Extension of relational algebra unions General guidelines Definition: qualified relation Remove empty relations generated by contradicting relations on A qualified relation is a pair [R : qR ] where R is a relation and qR is a horizontal fragments predicate. Remove useless relations generated by vertical fragments Break and distribute joins, eliminate empty fragment joins Example Representing horizontal fragments as qualified relations where the qualification predicate corresponds to the fragmentation expression [Projects : σP N o= P 1 ∨P N o= P 2 ] Katja Hose Distributed Database Systems November 10, 2011 77 / 167 Katja Hose Distributed Database Systems November 17, 2011 78 / 167Distributed Database Systems Distributed Database Systems Basics of distributed query processing Basics of distributed query processing Data localization Data localization Qualified relations Qualified relations Example query σ100.000≤Budget≤200.000 (Projects)Extended relational algebra Qualified relations E1 = σ100.000≤Budget≤200.000 [Projects1 : Budget ≤ 150.000](1) E := σF [R : qR ] → [E : F ∧ qR ] [E1 : (100.000 ≤ Budget ≤ 200.000) ∧ (Budget ≤ 150.000)](2) E := πA [R : qR ] → [E : qR ] [E1 : 100.000 ≤ Budget ≤ 150.000](3) E := [R : qR ] × [S : qS ] → [E : qR ∧ qS ](4) E := [R : qR ] − [S : qS ] → [E : qR ] E2 = σ1000≤Budget≤200.000 [Projects2 : 150.000 < Budget ≤ 200.000](5) E := [R : qR ] ∪ [S : qS ] → [E : qR ∨ qS ] [E2 : (100.000 ≤ Budget ≤ 200.000) ∧(6) E := [R : qR ] F [S : qS ] → [E : qR ∧ qS ∧ F ] (150.000 < Budget ≤ 200.000)] [E2 : 150.000 < Budget ≤ 200.000] E3 = σ100.000≤Budget≤200.000 [Projects3 : Budget > 200.000] [E3 : (100.000 ≤ Budget ≤ 200.000) ∧ (Budget > 200.000)] E3 = ∅ Katja Hose Distributed Database Systems November 17, 2011 79 / 167 Katja Hose Distributed Database Systems November 17, 2011 80 / 167
  21. 21. Distributed Database Systems Distributed Database Systems Global query optimization Global query optimization1 Motivation Join order optimization Total time models2 Detour on centralized query processing Response time models Translating SQL into relational algebra Phases of centralized query processing Query parsing Query transformation Query optimization3 Basics of distributed query processing Phases of distributed query processing Introduction 5 Summary Meta data management Data localization4 Global query optimization Main questions Global query optimizer Distributed cost model Katja Hose Distributed Database Systems November 17, 2011 81 / 167 Katja Hose Distributed Database Systems November 17, 2011 82 / 167Distributed Database Systems Distributed Database Systems Global query optimization Global query optimization Main questions Main questions Workflow for distributed query processing Introduction to global query optimization Main questions When to optimize? What criteria to optimize? Where to execute the query? Katja Hose Distributed Database Systems November 17, 2011 83 / 167 Katja Hose Distributed Database Systems November 17, 2011 84 / 167
  22. 22. Distributed Database Systems Distributed Database Systems Global query optimization Global query optimization Main questions Main questions When to optimize? When to optimize?Full compile time optimization Fully dynamic optimization The full query execution plan is computed at compile time Each query is optimized individually at runtime Assumption This technique heavily relies on heuristics, learning algorithms, and Applications use canned queries luck Prepared and parameterized SQL statements Pros Pros Might produce very good plans Queries can be executed directly Uses current network state Cons Also usable for ad-hoc queries Complex to model Cons Much information unknown or too expensive to gather Result quality might be very unpredictable Collecting statistics on all nodes? Complex algorithms and heuristics Statistics outdated Difficult to keep statistics up-to-date Especially machine load and network properties are very volatile Katja Hose Distributed Database Systems November 17, 2011 85 / 167 Katja Hose Distributed Database Systems November 17, 2011 86 / 167Distributed Database Systems Distributed Database Systems Global query optimization Global query optimization Main questions Main questions When to optimize? When to optimize?Semi-dynamic optimization Hierarchical optimization Pre-optimize the query Plans are created in multiple stages During query execution, test if execution runs as expected during Global-Local-Plans optimization Global query optimizer creates a global query plan e.g., are tuples/fragments delivered in time?, does the network adhere Focus on data transfer: which intermediate results are to be computed by which node? How should intermediate results be shipped? to the predicted properties?, are there any bad network latencies?, etc. Local query optimizers create local query plans If execution shows severe deviations, compute a new query plan for all Decide on query plan layout, algorithms, indexes, etc. to deliver the parts that have not yet been executed requested intermediate resultMakes only sense for queries that run for a longer time Two-Step-Plans Katja Hose Distributed Database Systems November 17, 2011 87 / 167 Katja Hose Distributed Database Systems November 17, 2011 88 / 167
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×