DDBMS (1).pptanmhfvnmbjk bjhjjjhkjkhjkjk

Chapter 25
Distributed Databases and
Client-Server Architectures
Copyright © 2004 Pearson Education, Inc.

Copyright © 2004 Ramez Elmasri and Shamkant Navathe
Elmasri/Navathe, Fundamentals of Database Systems, Fourth Edition Chapter 25-3
Chapter 25 Outline
1 Distributed Database Concepts
2 Data Fragmentation, Replication and Allocation
3 Types of Distributed Database Systems
4 Query Processing
5 Concurrency Control and Recovery
6 3-Tier Client-Server Architecture

Distributed Database Concepts
It is a system to process Unit of execution (a transaction) in
a distributed manner. That is, a transaction can be executed
by multiple networked computers in a unified manner.
It can be defined as
A distributed database (DDB) is a collection of multiple
logically related database distributed over a computer
network, and a distributed database management system
as a software system that manages a distributed
database while making the distribution transparent to
the user.

Distributed Database System
Advantages
1. Management of distributed data with different
levels of transparency: This refers to the physical
placement of data (files, relations, etc.) which is not
known to the user (distribution transparency).
Communications neteork
Site 5
Site 1
Site 2
Site 4
Site 3

Advantages
The EMPLOYEE, PROJECT, and WORKS_ON tables may be
fragmented horizontally and stored with possible replication as shown
below.
Atlanta
San Francisco
EMPLOYEES - All
PROJECTS - All
WORKS_ON - All
Chicago
(headquarters)
New York
EMPLOYEES - New York
PROJECTS - All
WORKS_ON - New York Employees
EMPLOYEES - San Francisco and LA
PROJECTS - San Francisco
WORKS_ON - San Francisco Employees
Los Angeles
EMPLOYEES - LA
PROJECTS - LA and San Francisco
WORKS_ON - LA Employees
EMPLOYEES - Atlanta
PROJECTS - Atlanta
WORKS_ON - Atlanta Employees

Advantages
• Distribution and Network transparency: Users do not have to
worry about operational details of the network. There is Location
transparency, which refers to freedom of issuing command from
any location without affecting its working. Then there is Naming
transparency, which allows access to any names object (files,
relations, etc.) from any location.
• Replication transparency: It allows to store copies of a data at
multiple sites as shown in the above diagram. This is done to
minimize access time to the required data.
• Fragmentation transparency: Allows to fragment a relation
horizontally (create a subset of tuples of a relation) or vertically
(create a subset of columns of a relation).

Advantages
2. Increased reliability and availability: Reliability refers to
system live time, that is, system is running efficiently most of the
time. Availability is the probability that the system is
continuously available (usable or accessible) during a time
interval. A distributed database system has multiple nodes
(computers) and if one fails then others are available to do the
job.
3. Improved performance: A distributed DBMS fragments the
database to keep data closer to where it is needed most. This
reduces data management (access and modification) time
significantly.
4. Easier expansion (scalability): Allows new nodes (computers)
to be added anytime without chaining the entire configuration.

Data Fragmentation, Replication and
Allocation
Data Fragmentation
Split a relation into logically related and correct parts. A
relation can be fragmented in two ways:
Horizontal fragmentation
It is a horizontal subset of a relation which contain those of
tuples which satisfy selection conditions.
Consider the Employee relation with selection condition (DNO
= 5). All tuples satisfy this condition will create a subset which
will be a horizontal fragment of Employee relation.
A selection condition may be composed of several conditions
connected by AND or OR.
Derived horizontal fragmentation: It is the partitioning of a
primary relation to other secondary relations which are related
with Foreign keys.

Allocation
Vertical fragmentation
It is a subset of a relation which is created by a subset of
columns. Thus a vertical fragment of a relation will contain
values of selected columns. There is no selection condition used
in vertical fragmentation.
Consider the Employee relation. A vertical fragment of can be
created by keeping the values of Name, Bdate, Sex, and
Address.
Because there is no condition for creating a vertical fragment,
each fragment must include the primary key attribute of the
parent relation Employee. In this way all vertical fragments of
a relation are connected.

Allocation
Representation
Horizontal fragmentation
Each horizontal fragment on a relation can be specified by a sCi
(R) operation in the relational algebra.
Complete horizontal fragmentation
A set of horizontal fragments whose conditions C1, C2, …, Cn
include all the tuples in R- that is, every tuple in R satisfies (C1
OR C2 OR … OR Cn).
Disjoint complete horizontal fragmentation: No tuple in R
satisfies (Ci AND Cj) where i ≠ j.
To reconstruct R from horizontal fragments a UNION is
applied.

Allocation
Representation
Vertical fragmentation
A vertical fragment on a relation can be specified by a Li(R)
operation in the relational algebra.
Complete vertical fragmentation
A set of vertical fragments whose projection lists L1, L2, …, Ln
include all the attributes in R but share only the primary key of
R. In this case the projection lists satisfy the following two
conditions:
L1  L2  ...  Ln = ATTRS (R)
Li  Lj = PK(R) for any i j, where ATTRS (R) is the set of
attributes of R and PK(R) is the primary key of R.
To reconstruct R from complete vertical fragments a OUTER
UNION is applied.

Allocation
Mixed (Hybrid) fragmentation
A combination of Vertical fragmentation and Horizontal
fragmentation.
This is achieved by SELECT-PROJECT operations which is
represented by Li(sCi (R)).
If C = True (Select all tuples) and L ≠ ATTRS(R), we get a
vertical fragment, and if C ≠ True and L ≠ ATTRS(R), we get a
mixed fragment.
If C = True and L = ATTRS(R), then R can be considered a
fragment.

Allocation
Fragmentation schema
A definition of a set of fragments (horizontal or vertical or
horizontal and vertical) that includes all attributes and tuples
in the database that satisfies the condition that the whole
database can be reconstructed from the fragments by applying
some sequence of UNION (or OUTER JOIN) and UNION
operations.
Allocation schema
It describes the distribution of fragments to sites of distributed
databases. It can be fully or partially replicated or can be
partitioned.

Allocation
Data Replication
Database is replicated to all sites. In full replication the entire
database is replicated and in partial replication some selected
part is replicated to some of the sites. Data replication is
achieved through a replication schema.
Data Distribution (Data Allocation)
This is relevant only in the case of partial replication or
partition. The selected portion of the database is distributed to
the database sites.

Elmasri/Navathe, Fundamentals of Database Systems, Fourth Edition
pter 25-16
Disadvantages of distributed databases
 Complexity —
– Extra work must be done by the DBAs to ensure that the distributed nature of the
system is transparent.
– Extra work must also be done to maintain multiple disparate systems, instead of one
big one.
– Extra database design work must also be done to account for the disconnected nature
of the database — for example, joins become prohibitively expensive when performed
across multiple systems.
 Economics — increased complexity and a more extensive infrastructure means
extra labour costs.
 Security — remote database fragments must be secured, and they are not
centralized so the remote sites must be secured as well. The infrastructure must
also be secured (eg: by encrypting the network links between remote sites).
 Difficult to maintain integrity — in a distributed database enforcing integrity over
a network may require too much networking resources to be feasible.
 No reliance on central site .

Additional Functions of
Distributed Databases
 Distributed query processing
 Distributed transaction management (e.g.,two-phase commit)
 Replication management
 Distributed database recovery
 Distributed security issues

Allocation of fragments to sites.
(a) Relation fragments at site 2
corresponding to department 5.

Allocation of fragments to sites.
(b) Relation fragments at site 3
corresponding to department 4.

Elmasri/Navathe, Fundamentals of Database Systems, Fourth Edition Slide 25-20
FIGURE 25.4
Complete and disjoint fragments of the WORKS_ON relation. (a)
Fragments of WORKS_ON for employees working in department 5
(C=[ESSN IN (SELECT SSN FROM EMPLOYEE WHERE DNO=5)]).

FIGURE 25.4 (continued)
Complete and disjoint fragments of the WORKS_ON relation. (b)

Complete and disjoint fragments of the WORKS_ON relation. (c)

Works_on tuples for
Employees who work
for dept 5
Works_on tuples for
Employees who work
for dept 4
Projects controlled by
dept5
G1,G2,G3,G4,G7
G4,G5,G6,G2,G8
• Union of G1,G2,G3
• Fragments at site2
• Fragments at site 3
Works_on tuples for
Employees who work
for dept 5
Works_on tuples for
Employees who work
for dept 4
Works_on tuples for
Employees who work
for dept 1
Projects controlled by
dept5
G1,G2,G3,G4,G7
G4,G5,G6,G2,G8

Types of Distributed Database Systems
Homogeneous
All sites of the database system have identical setup, i.e., same database system
software. The underlying operating system may be different. For example, all
sites run Oracle or DB2, or Sybase or some other database system. The
underlying operating systems can be a mixture of Linux, Window, Unix, etc.
The clients thus have to use identical client software.
Communications
neteork
Site 5
Site 1
Site 2
Site 3
Oracle Oracle
Oracle
Oracle
Site 4
Oracle
Linux
Linux
Window
Window
Unix

Heterogeneous
Federated: Each site may run different database system but the data
access is managed through a single conceptual schema. This implies
that the degree of local autonomy is minimum. Each site must adhere
to a centralized access policy. There may be a global schema.
Multidatabase: There is no one conceptual global schema. For data
access a schema is constructed dynamically as needed by the
application software.
Communications
network
Site 5
Site 1
Site 2
Site 3
Network
DBMS
Relational
Site 4
Object
Oriented
Linux
Linux
Unix
Hierarchical
Object
Oriented
Relational
Unix
Window

Homogeneous vs heterogeneous
– Hardware
– Data model: relational, object, hierarchical
– Brand DBMS
– Version of query language
Level of local autonomy
Local vs global schema
– Federated: some global schema
– Multidatabase: no (or limited) global schema

Five level architecture of FDBS
 In this one local schema is the conceptual schema
(complete database definition) of a component
database.
 Component schema is derived by translating the local
schema into a common data model for the FDBS.
 The export schema represents the subset of the
component schema that is available to the FDBS.
 The federated schema is the global schema or view,
which is the result of integrating all the shareable
export schemas.
 The external schema define the schema for a user
group or an application.

FIGURE 25.5
The five-level
schema architecture
in a federated
database system
(FDBS). Source:
Adapted from Sheth
and Larson,
Federated Database
Systems for
Managing
Distributed
Heterogeneous
Autonomous
Databases. ACM
Computing Surveys
(Vol. 22: No. 3,
September 1990).

Data Transfer Costs of Distributed Query Processing
Distributed Query Processing Using Semijoin
Query and Update Decomposition
Query Processing in Distributed
Databases

Examples to illustrate volume of data transferred.

Query Processing in Distributed Databases
Cost of transferring data (files and results) over the network.
Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno
Issues
This cost is usually high so some optimization is necessary.
Example relations: Employee at site 1 and Department at Site 2
Dname Dnumber Mgrssn Mgrstartdate
Employee at site 1. 10, 000 rows. Row size = 100 bytes. Table size = 106 bytes.
Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3500 bytes.
Q: For each employee, retrieve employee name and department
nameWhere the employee works.
Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)

Result
Stretagies:
1. Transfer Employee and Department to site 3. Total transfer bytes
= 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3. Query result size = 40 * 10,000 = 400,000 bytes.
Total transfer size = 400,000 + 1,000,000 = 1,400,000 bytes.
The result of this query will have 10,000 tuples, assuming that every
employee is related to a department.
Suppose each result tuple is 40 bytes long. The query is submitted at
site 3 and the result is sent to this site.
Problem: Employee and Department relations are not present at site 3.

Stretagies:
3. Transfer Department relation to site 1, execute the join at site 1,
and send the result to site 3. Total bytes transferred = 400,000 +
3500 = 403,500 bytes.
Optimization criteria: minimizing data transfer.
Preferred approach: strategy 3.
Consider the query
Q’: For each department, retrieve the department name and the
name of the department manager
Relational Algebra expression:
Fname,Lname,Dname (Employee Mgrssn = SSN Department)

The result of this query will have 100 tuples, assuming that every
department has a manager, the execution strategies are:
Stretagies:
1. Transfer Employee and Department to the result site and perorm
the join at site 3. Total bytes transferred = 1,000,000 + 3500 =
1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3. Query result size = 40 * 100 = 4000 bytes. Total
transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1 and
send the result to site 3. Total transfer size = 4000 + 3500 = 7500
bytes.

Preferred strategy: Chose strategy 3.
Now suppose the result site is 2. Possible strategies:
Possible strategies :
1. Transfer Employee relation to site 2, execute the query and present
the result to the user at site 2. Total transfer size = 1,000,000 bytes
for both queries Q and Q’.
2. Transfer Department relation to site 1, execute join at site 1 and
send the result back to site 2. Total transfer size for Q = 400,000 +
3500 = 403,500 bytes and for Q’ = 4000 + 3500 = 7500 bytes.

Semijoin: Objective is to reduce the number of tuples in a relation
before transferring it to another site.
Example execution of Q or Q’:
1. Project the join attributes of Department at site 2, and transfer
them to site 1. For Q, 4 * 100 = 400 bytes are transferred and for
Q’, 9 * 100 = 900 bytes are transferred.
2. Join the transferred file with the Employee relation at site 1, and
transfer the required attributes from the resulting file to site 2.
For Q, 34 * 10,000 = 340,000 bytes are transferred and for Q’, 39 *
100 = 3900 bytes are transferred.
3. Execute the query by joining the transferred file with Department
and present the result to the user at site 2.

Query and Update decomposition
• For queries , query decomposition module must break up
or decompose a query into subqueries that can be executed at the
individual sites.
• A strategy must be there for combining the results of the
subqueries to form the query result.
• Particular replica related to the query must be selected.
• For making such decisions ,DDBMS must maintain catalog
consisting fragmentation, replication and distribution of
information.
• For vertical fragmentation the attribute list for each fragment is
stored.
• For horizontal fragmentation , Guard condition is stored.
• For mixed both are stored.

FIGURE 25.7
Guard conditions and attributes lists for fragments.
(a) Site 2 fragments.

Guard conditions and attributes lists for fragments.
(b) Site 3 fragments.

User wants to insert a new employee tuple
<‘Alex’, ‘B’, ‘coleman’, ‘345671239’, ‘22-
APR-04’, ‘3306 SAND-STONE’, ‘M’,
33000, ‘957654321’, 4>
Decomposed by the DDBMS into two insert
request.
– Insert the full tuple in employee at site 1.
– Insert the projected tuple <‘Alex’, ‘B’,
‘coleman’, ‘345671239’, 33000, ‘957654321’,
4> in the EMPD4 fragment at site 3.
Chapter 25-40

Retrieve the names and hours per week for
each employee who works on some project
controlled by department 5.
 Query is submitted at site 2
– SQL : SELECT fname,lname,hours
FROM employee, projects, works_on
WHERE dnum = 5 and pnumber = pno
and essn = ssn;
Chapter 25-41

 The DDBMS can determine from the guard condition on
projs5 and works_on5 that all tuples satisfying the condition (
dnum = 5 and dnumber = pno ) reside at site2.
 Decompose the query
– T1 ∏ESSN ( PROJS5PNUMBER=PNO WORKS_ON5)
– T2 ∏ESSN, FNAME, LNAME ( T1ESSN=SSN EMPLOYEE)
– RESULT ∏FNAME, LNAME, HOURS ( T2 * WORKS_ON5 )
 EXECUTE USING SEMIJOIN
– Execute T1 in site 2 & send ESSN to site 1
– Execute T2 in site 1 & send result to site 2
– Calculate result at site 2
Chapter 25-42

Two-phase commit protocol
 A global recovery manager (or coordinator) performs the
two-phase commit protocol:
1)When all participating databases signal the coordinator that
the part of the transaction involving each has concluded, the
coordinator sends a message “prepare for commit” to each
participant,
2)If all participating databases reply “OK”, the transaction is
successful and the coordinator sends a “commit” signal to the
participating databases; otherwise it sends a message “roll
back” or UNDO the local effect of the transaction to each
participating database.

Distributed Transactions
 The Two Phase Commit protocol (2PC)
 After prepare phase: participants are ready to commit or to abort; they still
hold locks
 If one of the participants does not reply or is not able to commit for some
reason, the global transaction has to be aborted.
 Problem: if coordinator is unavailable after the prepare phase, resources may
be locked for a long time.

Dealing with multiple copies of data items: The concurrency control must
maintain global consistency. Likewise the recovery mechanism must recover all
copies and maintain consistency after recovery.
Failure of individual sites: Database availability must not be affected due to the
failure of one or two sites and the recovery scheme must recover them before
they are available for use.
Communication link failure: This failure may create network partition which
would affect database availability even though all database sites may be running.
Distributed commit: A transaction may be fragmented and they may be
executed by a number of sites. This require a two-phase commit approach for
transaction commit.
Distributed deadlock: Since transactions are processed at multiple sites, two or
more sites may get involved in deadlock. This must be resolved in a distributed
manner.
Chapter 25-45
Distributed Databases encounter a number of concurrency control and
recovery problems which are not present in centralized databases. Some of
them are listed below.
Concurrency Control and Recovery

1. Distributed Concurrency control based on a distinguished copy of a data
item
Site 5
Site 1
Site 2
Site 4
Site 3
Primary site
 Primary site technique
– A single site is designated as a primary site which serves as a coordinator for
transaction management.
– All locks are kept at that site, and all requests for locking or unlocking are sent
here.

• Transaction management:
– Concurrency control and commit are managed by this site. In two phase
locking, this site manages locking and releasing data items. If all transactions
follow two-phase policy at all sites, then serializability is guaranteed.
• Advantages:
– An extension to the centralized two phase locking so implementation and
management is simple. Data items are locked only at one site but they can be
accessed at any site.
• Disadvantages:
– All transaction management activities go to primary site which is likely to
overload the site. If the primary site fails, the entire system is inaccessible.
• To aid recovery a backup site is designated which behaves as a
shadow of primary site. In case of primary site failure, backup
site can act as primary site.

 Primary Copy Technique
– In this approach, instead of a site, a data item partition is designated as
primary copy. To lock a data item just the primary copy of the data item
is locked.
• Advantages:
– Since primary copies are distributed at various sites, a single site is not
overloaded with locking and unlocking requests.
• Disadvantages:
– Identification of a primary copy is complex. A distributed directory
must be maintained, possibly at all sites.
 Recovery from a coordinator failure
• In both approaches a coordinator site or copy may become unavailable.
This will require the selection of a new coordinator.
• Primary site approach with no backup site: Aborts and restarts all active
transactions at all sites. Elects a new coordinator and initiates transaction
processing.
• Primary site approach with backup site: Suspends all active transactions,
designates the backup site as the primary site and identifies a new back up
site. Primary site receives all transaction management information to
resume processing.
• Primary and backup sites fail or no backup site: Use election process to
select a new coordinator site.

2. Distributed Concurrency control based on voting: There is no primary
copy of coordinator.
• No distinguished copy.
• Send lock request to sites that have data item.
• Each copy maintains it own lock and can grant or deny the
request.
• If majority of sites grant lock then the requesting transaction gets
the data item.
• Locking information (grant or denied) is sent to all these sites.
• To avoid unacceptably long wait, a time-out period is defined. If
the requesting transaction does not get any vote information then
the transaction is aborted.

Distributed recovery
 The recovery process in DDBMS is quite involved.
 It is quite difficult to determine whether a site is down without
exchanging numerous message with other sites.
 Example
– Site X sends a message to site Y and expects a response from Y but
does not receive it.
 The message was not delivered to Y because of communication failure.
 Site Y is down and could not respond.
 Site Y is running and sent a response, but the response was not delivered.
 Another problem with distributed recovery is distributed
commit.
– When a transaction is updating data at several sites, it cannot commit
until it is sure that the effects of the transaction on every site cannot be
lost.

DDBMS (1).pptanmhfvnmbjk bjhjjjhkjkhjkjk

Recommended

Recommended

More Related Content

Similar to DDBMS (1).pptanmhfvnmbjk bjhjjjhkjkhjkjk

Similar to DDBMS (1).pptanmhfvnmbjk bjhjjjhkjkhjkjk (20)

Recently uploaded

Recently uploaded (20)

DDBMS (1).pptanmhfvnmbjk bjhjjjhkjkhjkjk