Distributed Database Systems

          22-11-2012
You must remember!
You must also remember!
• Relation data languages are based on
  relational algebra
• Relational algebra consist of a set of operators
  on relations, which include:
  – Selection
  – Projection
  – Union
  – Cartesian product
Cartesian Product
• The Cartesian product of two relations R of
  degree k1 and S of degree k2 is the set of
  (k1+k2)-tuples, where each result tuple is a
  concatenation of one tuple of R with one
  tuple of S, for all tuples of R and S (R X S)
• Consider the relation EMP and PAY, EMPXPAY
  is:
Cartesian Product (EMPXPAY)
Joins
• Join is a derivative of Cartesian Product
• There are various forms of joins
  – Join
     • Inner join
           – Theta join
           – Equi-join
     • Outer join
           – Left join
           – Right join
           – Full join
  – Semi join
Theta Join
• Consider the relation EMP, the theta-join of
  relation EMP and ASG over the join predicate
  EMP.ENO=ASG.ENO
Equi-Join
• This example demonstrate a special case of
  theta-join called equi-join
Semi-Join
• The semi-join of relation R, defined over the
  set of attributes A, by relation S, defined over
  the set of attributes B, is the subset of the
  tuples of R that participate in the join of R
  with S
• The advantage of semi-join is that it decreases
  the number of tuples that need to be handled
  to form the join
Semi-Join
• In centralized database systems, this is
  important because it usually results in a
  decreased number of secondary storage
  accesses by making better use of the memory.
• It is even more important in distributed
  databases since it usually reduces the amount
  of data that needs to be transmitted between
  sites in order to evaluate a query.
Semi-Join
• To demonstrate the difference between join
  and semi-join, lets consider the semi-join of
  EMP with PAY over the predicate EMP.TITLE =
  PAY.TITLE that is
Semi-Join
Derived Horizontal Fragmentation
• A derived horizontal fragmentation is defined
  on a member relation of a link according to a
  selection operation specified on its owner
• It is important to remember two points
  – First, the link between the owner and the member
    relations is defined as an equi-join
  – Second, an equi-join can be implemented by
    means of semi-join
Derived Horizontal Fragmentation
• Accordingly, given a link L where owner(L) = S
  and member(L) = R, the derived horizontal
  fragments of R are defined as:

• Where w is the maximum number of
  fragments that will be defined on R, and
                       S

  where Fi is the formula according to which
  the primary horizontal fragment Si is defined
Derived Horizontal Fragmentation
• To carry out a derived horizontal
  fragmentation, three inputs are needed:
  – The set of partitions of the owner relation (PAY1,
    PAY2)
  – The member relation
  – The set of semi join predicates between the
    owner and member (EMP.TITLE=PAY.TITLE)
Example
Example
• Consider L1, where owner(L1) = PAY and
  member (L1) = EMP
• We can group engineers into two groups
  according to their salary: those making less
  then or equal to $30,000, and those making
  more then $30,000
• The two fragments EMP1 and EMP2 are
  defined as:
Example
• The result of this fragmentation is depicted as:
Derived Horizontal Fragmentation
• One potential complication that need
  attention
• In a database schema if there are two link into
  a relation R, there could be more than one
  possible derived horizontal fragmentation of R
• The choice of candidate fragmentation is
  based on two criteria
  – The fragmentation with better join characteristics
  – The fragmentation used in more applications
The fragmentation used in more
             Applications
• It is quite straight forward if we take into
  consideration the frequency with which
  application access some data
• The access of the heavy users can minimize
  the total impact on system performance
The Fragmentation with better join
           characteristics
• Consider the last example, the effect of this
  fragmentation is that the join of the EMP and
  PAY relations to answer the query is assisted
  – By performing it on smaller relations
  – By potentially performing joins in parallel
The Fragmentation with better join
           characteristics
• The first point is obvious, the fragments of EMP
  are smaller than EMP itself
• Therefore, it will be faster to join any fragment of
  PAY with any fragment of EMP than to work with
  the relations themselves
• The second point is however, more important and
  is at the heart of distributed databases
• If, besides executing a number of queries at
  different sites, we can parallelize execution of one
  join query, the response time or throughput of
  the system can be expected to improve
The Fragmentation with better join
            characteristics
• In the case of joins, this is possible under certain
  circumstances
• Consider the join graph between the fragments of EMP
  and PAY, there is only one link coming in or going out of
  a fragment
• Such a join graph is called a simple graph
• The advantage of a design where the join relationship
  between fragments is simple is that the member and
  owner link can be allocated to one site and the joins
  between different pairs of fragments can proceed
  independently and in parallel
The Fragmentation with better join
         characteristics
The Fragmentation with better join
           characteristics
• Unfortunately, obtaining simple join graphs may
  not always be possible
• In that case the next desirable alternative is to
  have a design that results in a partitioned join
  graph
• A partitioned graph consist of two or more sub-
  graphs with no links between them
• Fragments so obtained may not be distributed for
  parallel execution as easily as those obtained via
  simple join graphs, but the allocation is still
  possible
The Fragmentation with better join
            characteristics
• Let us continue with the distribution design of the database
  we started before
• We already decided on the fragmentation of relation EMP
  according to the fragmentation of PAY
• Lets now consider ASG, assume that there are two
  applications
   – The first application finds the names of engineers who work at
     certain places, it turns on all three sites and accesses the
     information about the engineer who work on local projects with
     higher probability than those of projects at other locations
   – At each administrative sites where employee records are
     managed, users would like to access the responsibilities on the
     projects that these employee work on and learn how they will
     work on those projects
The Fragmentation with better join
           characteristics
• The first application results in a fragmentation
  of ASG according to the fragments PROJ1,
  PROJ3, PROJ4 and PROJ6 of PROJ obtained
  before
The Fragmentation with better join
           characteristics
• Therefore, the derived fragmentation of ASG
  according to {PROJ1, PROJ3, PROJ4, PROJ6} is
  defined as:

• The fragment instances are:
The Fragmentation with better join
           characteristics
• The second query can be specified in SQL as:



• Where i=1 or i=2, depending on the site where
  the query is issued
• The derived fragmentation of ASG according
  to the fragmentation of EMP is defined as:
The Fragmentation with better join
         characteristics
The Fragmentation with better join
           characteristics
• The example demonstrate two things:
  – Derived fragmentation may follow a chain where
    one relation is fragmented as a result of another
    one’s design and it, in turn, causes the
    fragmentation of another relation
      (PAY->EMP->ASG)
  – Typically, there will be more than one candidate
    fragmentation for a relation (ASG), the final choice
    of the fragmentation scheme may be a decision
    problem addressed during allocation
Checking of Correctness
• We should now check the fragmentation
  algorithms discussed so far with respect to
  three correctness criteria
  – Completeness
  – Reconstruction
  – Disjointness
Completeness
• The completeness of a primary horizontal
  fragmentation is based on the selection
  predicate used
• As long as the selection predicates are
  complete, the resulting fragmentation is
  guaranteed to be complete as well
Completeness
• The completeness of a derived horizontal
  fragmentation is somewhat more difficult to define




• For example, there should be no ASG tuple which has
  a project number that is not also contained in PROJ,
  this rule is know as referential integrity
Reconstruction
• Reconstruction of a global relation from its
  fragments is performed by the union operator
  in both the primary and the derived horizontal
  fragmentation
• Thus for a relation R with fragmentation
Disjointness
• It is easier to establish Disjointness of
  fragmentation for primary than for derived
  horizontal fragmentation
• In PHF Disjointness is guaranteed as long as
  the minterm predicates determining the
  fragmentation are mutually exclusive
Example
• In derived fragmentation, however, there is a
  semi join involved that adds considerable
  complexity
• Disjointness can be guaranteed if the join graph is
  simple, otherwise it is necessary to investigate
  actual tuple values
• In general we do not want a tuple of a member
  relation to join with two or more tuples of the
  owner relation when these tuples are in different
  fragments of the owner
Example
• In fragmenting relation PAY, the minterm predicates M =
  {m1, m2} where
   m1: SAL<=30000
   m2: SAL>30000
• Since m1 and m2 are mutually exclusive, the fragmentation
  of PAY is disjoint
• For relation EMP, however we require that
   – Each engineer has a single title
   – Each title have a single salary value associated with it
• Since these two rules follow from the semantics of the
  database, the fragmentation of EMP with respect to PAY is
  also disjoint

8 drived horizontal fragmentation

  • 1.
  • 2.
  • 3.
    You must alsoremember! • Relation data languages are based on relational algebra • Relational algebra consist of a set of operators on relations, which include: – Selection – Projection – Union – Cartesian product
  • 4.
    Cartesian Product • TheCartesian product of two relations R of degree k1 and S of degree k2 is the set of (k1+k2)-tuples, where each result tuple is a concatenation of one tuple of R with one tuple of S, for all tuples of R and S (R X S) • Consider the relation EMP and PAY, EMPXPAY is:
  • 5.
  • 6.
    Joins • Join isa derivative of Cartesian Product • There are various forms of joins – Join • Inner join – Theta join – Equi-join • Outer join – Left join – Right join – Full join – Semi join
  • 7.
    Theta Join • Considerthe relation EMP, the theta-join of relation EMP and ASG over the join predicate EMP.ENO=ASG.ENO
  • 8.
    Equi-Join • This exampledemonstrate a special case of theta-join called equi-join
  • 9.
    Semi-Join • The semi-joinof relation R, defined over the set of attributes A, by relation S, defined over the set of attributes B, is the subset of the tuples of R that participate in the join of R with S • The advantage of semi-join is that it decreases the number of tuples that need to be handled to form the join
  • 10.
    Semi-Join • In centralizeddatabase systems, this is important because it usually results in a decreased number of secondary storage accesses by making better use of the memory. • It is even more important in distributed databases since it usually reduces the amount of data that needs to be transmitted between sites in order to evaluate a query.
  • 11.
    Semi-Join • To demonstratethe difference between join and semi-join, lets consider the semi-join of EMP with PAY over the predicate EMP.TITLE = PAY.TITLE that is
  • 12.
  • 13.
    Derived Horizontal Fragmentation •A derived horizontal fragmentation is defined on a member relation of a link according to a selection operation specified on its owner • It is important to remember two points – First, the link between the owner and the member relations is defined as an equi-join – Second, an equi-join can be implemented by means of semi-join
  • 14.
    Derived Horizontal Fragmentation •Accordingly, given a link L where owner(L) = S and member(L) = R, the derived horizontal fragments of R are defined as: • Where w is the maximum number of fragments that will be defined on R, and S where Fi is the formula according to which the primary horizontal fragment Si is defined
  • 15.
    Derived Horizontal Fragmentation •To carry out a derived horizontal fragmentation, three inputs are needed: – The set of partitions of the owner relation (PAY1, PAY2) – The member relation – The set of semi join predicates between the owner and member (EMP.TITLE=PAY.TITLE)
  • 16.
  • 17.
    Example • Consider L1,where owner(L1) = PAY and member (L1) = EMP • We can group engineers into two groups according to their salary: those making less then or equal to $30,000, and those making more then $30,000 • The two fragments EMP1 and EMP2 are defined as:
  • 18.
    Example • The resultof this fragmentation is depicted as:
  • 19.
    Derived Horizontal Fragmentation •One potential complication that need attention • In a database schema if there are two link into a relation R, there could be more than one possible derived horizontal fragmentation of R • The choice of candidate fragmentation is based on two criteria – The fragmentation with better join characteristics – The fragmentation used in more applications
  • 20.
    The fragmentation usedin more Applications • It is quite straight forward if we take into consideration the frequency with which application access some data • The access of the heavy users can minimize the total impact on system performance
  • 21.
    The Fragmentation withbetter join characteristics • Consider the last example, the effect of this fragmentation is that the join of the EMP and PAY relations to answer the query is assisted – By performing it on smaller relations – By potentially performing joins in parallel
  • 22.
    The Fragmentation withbetter join characteristics • The first point is obvious, the fragments of EMP are smaller than EMP itself • Therefore, it will be faster to join any fragment of PAY with any fragment of EMP than to work with the relations themselves • The second point is however, more important and is at the heart of distributed databases • If, besides executing a number of queries at different sites, we can parallelize execution of one join query, the response time or throughput of the system can be expected to improve
  • 23.
    The Fragmentation withbetter join characteristics • In the case of joins, this is possible under certain circumstances • Consider the join graph between the fragments of EMP and PAY, there is only one link coming in or going out of a fragment • Such a join graph is called a simple graph • The advantage of a design where the join relationship between fragments is simple is that the member and owner link can be allocated to one site and the joins between different pairs of fragments can proceed independently and in parallel
  • 24.
    The Fragmentation withbetter join characteristics
  • 25.
    The Fragmentation withbetter join characteristics • Unfortunately, obtaining simple join graphs may not always be possible • In that case the next desirable alternative is to have a design that results in a partitioned join graph • A partitioned graph consist of two or more sub- graphs with no links between them • Fragments so obtained may not be distributed for parallel execution as easily as those obtained via simple join graphs, but the allocation is still possible
  • 26.
    The Fragmentation withbetter join characteristics • Let us continue with the distribution design of the database we started before • We already decided on the fragmentation of relation EMP according to the fragmentation of PAY • Lets now consider ASG, assume that there are two applications – The first application finds the names of engineers who work at certain places, it turns on all three sites and accesses the information about the engineer who work on local projects with higher probability than those of projects at other locations – At each administrative sites where employee records are managed, users would like to access the responsibilities on the projects that these employee work on and learn how they will work on those projects
  • 27.
    The Fragmentation withbetter join characteristics • The first application results in a fragmentation of ASG according to the fragments PROJ1, PROJ3, PROJ4 and PROJ6 of PROJ obtained before
  • 28.
    The Fragmentation withbetter join characteristics • Therefore, the derived fragmentation of ASG according to {PROJ1, PROJ3, PROJ4, PROJ6} is defined as: • The fragment instances are:
  • 29.
    The Fragmentation withbetter join characteristics • The second query can be specified in SQL as: • Where i=1 or i=2, depending on the site where the query is issued • The derived fragmentation of ASG according to the fragmentation of EMP is defined as:
  • 30.
    The Fragmentation withbetter join characteristics
  • 31.
    The Fragmentation withbetter join characteristics • The example demonstrate two things: – Derived fragmentation may follow a chain where one relation is fragmented as a result of another one’s design and it, in turn, causes the fragmentation of another relation (PAY->EMP->ASG) – Typically, there will be more than one candidate fragmentation for a relation (ASG), the final choice of the fragmentation scheme may be a decision problem addressed during allocation
  • 32.
    Checking of Correctness •We should now check the fragmentation algorithms discussed so far with respect to three correctness criteria – Completeness – Reconstruction – Disjointness
  • 33.
    Completeness • The completenessof a primary horizontal fragmentation is based on the selection predicate used • As long as the selection predicates are complete, the resulting fragmentation is guaranteed to be complete as well
  • 34.
    Completeness • The completenessof a derived horizontal fragmentation is somewhat more difficult to define • For example, there should be no ASG tuple which has a project number that is not also contained in PROJ, this rule is know as referential integrity
  • 35.
    Reconstruction • Reconstruction ofa global relation from its fragments is performed by the union operator in both the primary and the derived horizontal fragmentation • Thus for a relation R with fragmentation
  • 36.
    Disjointness • It iseasier to establish Disjointness of fragmentation for primary than for derived horizontal fragmentation • In PHF Disjointness is guaranteed as long as the minterm predicates determining the fragmentation are mutually exclusive
  • 37.
    Example • In derivedfragmentation, however, there is a semi join involved that adds considerable complexity • Disjointness can be guaranteed if the join graph is simple, otherwise it is necessary to investigate actual tuple values • In general we do not want a tuple of a member relation to join with two or more tuples of the owner relation when these tuples are in different fragments of the owner
  • 38.
    Example • In fragmentingrelation PAY, the minterm predicates M = {m1, m2} where m1: SAL<=30000 m2: SAL>30000 • Since m1 and m2 are mutually exclusive, the fragmentation of PAY is disjoint • For relation EMP, however we require that – Each engineer has a single title – Each title have a single salary value associated with it • Since these two rules follow from the semantics of the database, the fragmentation of EMP with respect to PAY is also disjoint