2. Outline
1 Distributed Database Concepts
2 Data Fragmentation, Replication and Allocation
3 Types of Distributed Database Systems
4 Query Processing
5 Concurrency Control and Recovery
6 3-Tier Client-Server Architecture
3. Distributed Database Concepts
It is a system to process Unit of execution (a transaction) in
a distributed manner. That is, a transaction can be executed
by multiple networked computers in a unified manner.
It can be defined as
A distributed database (DDB) is a collection of multiple
logically related database distributed over a computer
network, and a distributed database management system
as a software system that manages a distributed
database while making the distribution transparent to
the user.
7. Distributed Database System
Advantages
1. Management of distributed data with different levels
of transparency: This refers to the physical placement
of data (files, relations, etc.) which is not known to the
user (distribution transparency).
Communications neteork
Site 5
Site 1
Site 2
Site 4
Site 3
8. Distributed Database System
Advantages
The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented
horizontally and stored with possible replication as shown below.
Communications neteork
Atlanta
San Francisco
EMPLOYEES - All
PROJECTS - All
WORKS_ON - All
Chicago
(headquarters)
New York
EMPLOYEES - New York
PROJECTS - All
WORKS_ON - New York Employees
EMPLOYEES - San Francisco and LA
PROJECTS - San Francisco
WORKS_ON - San Francisco Employees
Los Angeles
EMPLOYEES - LA
PROJECTS - LA and San Francisco
WORKS_ON - LA Employees
EMPLOYEES - Atlanta
PROJECTS - Atlanta
WORKS_ON - Atlanta Employees
9. Distributed Database System
Advantages
• Distribution and Network transparency: Users do not have to
worry about operational details of the network. There is
Location transparency, which refers to freedom of issuing
command from any location without affecting its working. Then
there is Naming transparency, which allows access to any names
object (files, relations, etc.) from any location.
• Replication transparency: It allows to store copies of a data at
multiple sites as shown in the above diagram. This is done to
minimize access time to the required data.
• Fragmentation transparency: Allows to fragment a relation
horizontally (create a subset of tuples of a relation) or vertically
(create a subset of columns of a relation).
10. Distributed Database System
Advantages
2. Increased reliability and availability: Reliability refers to system
live time, that is, system is running efficiently most of the time.
Availability is the probability that the system is continuously
available (usable or accessible) during a time interval. A distributed
database system has multiple nodes (computers) and if one fails then
others are available to do the job.
3. Improved performance: A distributed DBMS fragments the
database to keep data closer to where it is needed most. This reduces
data management (access and modification) time significantly.
4. Easier expansion (scalability): Allows new nodes (computers) to
be added anytime without chaining the entire configuration.
11. Types of Distributed Database Systems
Homogeneous
All sites of the database system have identical setup, i.e., same database
system software. The underlying operating system may be different. For
example, all sites run Oracle or DB2, or Sybase or some other database
system. The underlying operating systems can be a mixture of Linux,
Window, Unix, etc. The clients thus have to use identical client software.
Communications
neteork
Site 5
Site 1
Site 2Site 3
Oracle Oracle
Oracle
Oracle
Site 4
Oracle
LinuxLinux
Window
Window
Unix
12. Types of Distributed Database Systems
Heterogeneous
Federated: Each site may run different database system but the data access
is managed through a single conceptual schema. This implies that the
degree of local autonomy is minimum. Each site must adhere to a
centralized access policy. There may be a global schema.
Multidatabase: There is no one conceptual global schema. For data access
a schema is constructed dynamically as needed by the application software.
Communications
network
Site 5
Site 1
Site 2Site 3
Network
DBMS
Relational
Site 4
Object
Oriented
LinuxLinux
Unix
Hierarchical
Object
Oriented
RelationalUnix
Window
13. Types of Distributed Database Systems
• Differences in data models: Relational, Objected
oriented, hierarchical, network, etc.
• Differences in constraints: Each site may have their
own data accessing and processing constraints.
• Differences in query language: Some site may use
SQL, some may use SQL-89, some may use SQL-92,
and so on.
Federated Database Management Systems Issues
14. Data Fragmentation, Replication and
Allocation
Data Fragmentation
Which site should be used to store which portion of database.
Split a relation into logically related and correct parts.
Simplest logical unit is relation.
A relation can be fragmented in two ways:
15. Horizontal fragmentation
• Divides a relation “horizontally”
• It is a horizontal subset of a relation which contain those of
tuples which satisfy selection conditions.
• Consider the Employee relation with selection condition
(DNO = 5). All tuples satisfy this condition will create a
subset which will be a horizontal fragment of Employee
relation.
• A selection condition may be composed of several conditions
connected by AND or OR.
• Derived horizontal fragmentation: It is the partitioning of a
primary relation to other secondary relations which are related
with Foreign keys.
16. Vertical fragmentation
• Divides a relation vertically by columns.
• It is a subset of a relation which is created by a subset of
columns. Thus a vertical fragment of a relation will contain
values of selected columns. There is no selection condition
used in vertical fragmentation.
• Consider the Employee relation. Two vertical fragments
1)Name, Bdate, Sex, and Address.
2) ENO,salary,supereno,dno
• Because there is no condition for creating a vertical fragment,
each fragment must include the primary key attribute of the
parent relation Employee. In this way all vertical fragments
of a relation are connected.
17. Data Fragmentation, Replication and
Allocation
Representation
Horizontal fragmentation
Each horizontal fragment on a relation can be specified
by a σCi (R) operation in the relational algebra.
Complete horizontal fragmentation
A set of horizontal fragments whose conditions C1, C2,
…, Cn include all the tuples in R- that is, every tuple in
R satisfies (C1 OR C2 OR … OR Cn).
Disjoint complete horizontal fragmentation: No tuple in
R satisfies (Ci AND Cj) where i ≠ j.
To reconstruct R from horizontal fragments a UNION is
applied.
18. Data Fragmentation, Replication and
Allocation
Representation
Vertical fragmentation
A vertical fragment on a relation can be specified by a
ΠLi(R) operation in the relational algebra.
Complete vertical fragmentation
A set of vertical fragments whose projection lists L1, L2,
…, Ln include all the attributes in R but share only the
primary key of R. In this case the projection lists satisfy
the following two conditions:
L1 ∪ L2 ∪ ... ∪ Ln = ATTRS (R)
Li ∩ Lj = PK(R) for any i j, where ATTRS (R) is the set
of attributes of R and PK(R) is the primary key of R.
To reconstruct R from complete vertical fragments a
OUTER JOIN is applied.
19. Data Fragmentation, Replication and
Allocation
Mixed (Hybrid) fragmentation
A combination of Vertical fragmentation and Horizontal
fragmentation.
This is achieved by SELECT-PROJECT operations which is
represented by ΠLi(σCi (R)).
If C = True (Select all tuples) and L ≠ ATTRS(R), we get a
vertical fragment, and if C ≠ True and L ≠ ATTRS(R), we get
a mixed fragment.
If C = True and L = ATTRS(R), then R can be considered a
fragment.
21. Data Fragmentation, Replication and
Allocation
Fragmentation schema
A definition of a set of fragments (horizontal or vertical
or horizontal and vertical) that includes all attributes
and tuples in the database that satisfies the condition
that the whole database can be reconstructed from the
fragments by applying some sequence of OUTER
UNION (or OUTER JOIN) and UNION operations.
Allocation schema
It describes the distribution of fragments to sites of
distributed databases. It can be fully or partially
replicated or can be partitioned.
22. Data Fragmentation, Replication and
Allocation
Data Replication
Database is replicated to all sites. In full replication the
entire database is replicated and in partial replication some
selected part is replicated to some of the sites. Data
replication is achieved through a replication schema.
Full replication improves availability and hence
performance for retrieval queries, but slows down update
operation.
No replication- nonredundant allocation
23. Data Distribution (Data Allocation)
Each fragment-or copy of fragment- must be assigned to a
particular site.
This is relevant only in the case of partial replication or
partition. The selected portion of the database is
distributed to the database sites.
The choice of sites and degree of replication depend on
performance and availability goals
Data Fragmentation, Replication and
Allocation
24. Architectural models of
distributed databases
The systems sharing multiple DBMS are characterized with respect to
Autonomy, distribution, heterogeneity
Autonomy:- this refers to the distribution of control and not of data.
The dimensions of autonomy can be
– Design Autonomy
– Communication Autonomy:-
– Execution autonomy
Distribution:- this refers to the distribution of data
– Client server
– Peer to peer
Heterogeneity:- this ranges from hardware heterogeneity, networking
protocols, data models, query languages and transaction management
protocols
27. Types of client-server
architecture
There are 2 types of client server architecture
Multi-client and single server
Multi-client and multi-server:
– Client manages its own connections to appropriate
server. This architecture simplifies the server code, but
loads client machines with additional responsibilities.
This is heavy client
– Client knows only of the home servers which then
communicates with other servers as required. Loads the
server more and is called as light client
29. Global conceptual schema:- Gives the enterprise
view of data. Union of all local conceptual
schemas
Fragmentation and Allocation schema:-takes care
of fragmentation and allocation of data
Local conceptual schema:-logical organization of
data at each site
External schemas:- user access and user
applications
Peer-to-peer database
architecture
32. Components of DDBMS
Contd…
User interface handler: interpreting user commands as they come
in, and formatting the result data as it is sent to the user.
Semantic data controller: uses the integrity constraints and
authorizations that are defined as part of the global conceptual
schema to check if the user query can be processed.
Global query optimizer and decomposer determine an execution
strategy to minimize a cost function and translate the global
queries into local ones using global and local conceptual
schemas as well as the global directory.
The global execution monitor coordinates the distributed
execution of the user request. It communicates with other
execution monitors
33. The data processor consists of the
Local query processor: this acts as the access path
selector which is responsible for choosing the best
access path to access any data item
Local recovery manager :this is responsible for
making sure that the local database remains
consistent even when failures occur
Runtime support processor: interface to the os and
contains the database buffer manager, which is
responsible for maintaining the main memory
buffers and managing the data accesses.
Components of DDBMS
Contd…
34. Query Processing and
Optimization in
Distributed Databases
• Distributed Query Processing
• Data Transfer Costs of Distributed Query Processing
• Distributed Query Processing Using Semijoin
• Query and Update Decomposition
37. Query Processing in Distributed Databases
Result
Stretagies:
1. Transfer Employee and Department to site 3. Total
transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3. Query result size = 40 * 10,000 =
400,000 bytes. Total transfer size = 400,000 + 1,000,000 =
1,400,000 bytes.
The result of this query will have 10,000 tuples, assuming that
every employee is related to a department.
Suppose each result tuple is 40 bytes long. The query is
submitted at site 3 and the result is sent to this site.
Problem: Employee and Department relations are not present
at site 3.
38. Query Processing in Distributed Databases
Stretagies:
3. Transfer Department relation to site 1, execute the join at
site 1, and send the result to site 3. Total bytes transferred
= 400,000 + 3500 = 403,500 bytes.
Optimization criteria: minimizing data transfer.
Preferred approach: strategy 3.
Consider the query
Q’: For each department, retrieve the department name and
the name of the department manager
Relational Algebra expression:
ΠFname,Lname,Dname (Employee Mgrssn = SSN Department)
39. Query Processing in Distributed Databases
The result of this query will have 100 tuples, assuming that
every department has a manager, the execution strategies are:
Strategies:
1. Transfer Employee and Department to the result site and
perorm the join at site 3. Total bytes transferred =
1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3. Query result size = 40 * 100 = 4000
bytes. Total transfer size = 4000 + 1,000,000 = 1,004,000
bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3. Total transfer size = 4000 +
3500 = 7500 bytes.
40. Query Processing in Distributed Databases
Preferred strategy: Chose strategy 3.
Now suppose the result site is 2. Possible strategies:
Possible strategies :
1. Transfer Employee relation to site 2, execute the query and
present the result to the user at site 2. Total transfer size =
1,000,000 bytes for both queries Q and Q’.
2. Transfer Department relation to site 1, execute join at site 1
and send the result back to site 2. Total transfer size for Q
= 400,000 + 3500 = 403,500 bytes and for Q’ = 4000 +
3500 = 7500 bytes.
41. Query Processing in Distributed Databases
Semijoin: Objective is to reduce the number of tuples in a
relation before transferring it to another site.
Example execution of Q or Q’:
1. Project the join attributes of Department at site 2, and
transfer them to site 1.
For Q, 4 * 100 = 400 bytes are transferred and for
Q’, 9 * 100 = 900 bytes are transferred.
2. Join the transferred file with the Employee relation at site
1, and transfer the required attributes from the resulting file
to site 2.
For Q, 34 * 10,000 = 340,000 bytes are transferred and for
Q’, 39 * 100 = 3900 bytes are transferred.
3. Execute the query by joining the transferred file with
Department and present the result to the user at site 2.
43. Replication
Synchronous
All copies of a modified
relation must be updated
before the modifying
transaction commits.
Before an update transaction
commits, it synchronizes all
copies of modified data.
Asynchronous
Copies of a modified relation
are updated only periodically
and a transaction that reads
different copies of the same
relation may see different
values.
compromises distributed data
independence
can be more efficiently
implemented
44. Synchronous Replication
There are two basic techniques for ensuring that transactions see
the same value regardless of which copy of an object they access.
1) Voting - a transaction must write a majority of copies in order to
modify an object and read at least enough copies to make sure that
one of the copies is current.
Eg. If there are 10 copies and 7 copies are written by update
transactions, then at least 4 copies must be read.
Each copy has a version number, and the copy with the highest
version number is current.
This technique is not attractive in most situations because reading
an object requires reading multiple copies; in most applications,
objects are read much more frequently than they are updated, and
efficient performance on reads is very important.
45. Synchronous Replication
2) Read-any write-all
To read an object, a transaction can read any one copy, but
to write an object, it must write all copies.
Reads are fast, especially if we have a local copy, but
writes are slower, relative to the first technique.
This technique is attractive when reads are much more
frequent than writes, and it is usually adopted for
implementing synchronous replication.
46. Synchronous Replication
Synchronous replication comes at a significant cost.
1. Before an update transaction can commit, it must obtain
exclusive locks on all copies—assuming that the read-any
write-all technique is used—of modified data.
2. The transaction may have to send lock requests to remote sites,
and wait for the locks to be granted, and during this potentially
long period, it continues to hold all its other locks.
3. If sites or communication links fail, the transaction cannot
commit until all sites at which it has modified data recover and
are reachable.
4. Finally, even if locks are obtained readily and there are no
failures, committing a transaction requires several additional
messages to be sent as part of a commit protocol
47. Primary Site versus Peer-to-Peer
Replication
In primary site asynchronous replication, one copy of a
relation is designated as the primary or master copy.
Replicas of the entire relation or of fragments of the relation can
be created at other sites; these are secondary copies, and, unlike
the primary copy, they cannot be updated.
A common mechanism for setting up primary and secondary
copies is that users first register or publish the relation at the
primary site and subsequently subscribe to a fragment of a
registered relation from another (secondary) site.
48. In peer-to-peer asynchronous replication, more than one copy
can be designated as being updatable, that is, a master copy.
In addition to propagating changes, a conflict resolution strategy
must be used to deal with conflicting changes made at different
sites.
For example, Joe’s age may be changed to 35 at one site and to
38 at another. Which value is ‘correct’?
Many more subtle kinds of conflicts can arise and it leads to ad
hoc conflict resolution.
In some situations which it does not lead to conflicts peer-to-
peer replication is best utilized.
Primary Site versus Peer-to-Peer
Replication
49. Naming Transparency
Data items—such as relations, fragments, and replicas—
must have unique names.
This property is easy to ensure in a centralized database.
In a distributed database, however, we must take care to
ensure that two sites do not use the same name for distinct
data items.
50. Naming Transparency
1) Name server. The name server helps to ensure that the
same name does not get used for different data items.
We can also use the name server to locate a data item,
given the name of the item.
Disadvantages.
1. The name server may become a performance bottleneck
resulting in poor performance.
2. If the name server crashes, it may not be possible for any
site in the distributed system to continue to run..
51. 2) Site Prefix
Each site prefix its own site identifier to any name that it
generates.
This approach ensures that no two sites generate the same name.
No central control is required.
Fails to achieve location transparency, since site identifiers are
attached to names.
The account relation might be referred to as site17.account, or
account@site17, rather than as simply account.
Many database systems use the internet address of a site to
identify it.
Naming Transparency
52. Solution: The database system can create a set of alternative
names or aliases for data items.
A user may thus refer to data items by simple names that are
translated by the system to complete names. The mapping of
aliases to the real names can be stored at each site.
With aliases, the user can be unaware of the physical location
of a data item. Furthermore, the user will be unaffected if the
database administrator decides to move a data item from one
site to another.
Users should not have to refer to a specific replica of a data
item. Instead, the system should determine which replica to
reference on a read request, and should update all replicas on
a write request.
This can be ensured by maintaining a catalog table, which the
system uses to determine all replicas for the data item.
53. Concurrency Control and Recovery
Distributed Databases encounter a number of concurrency
control and recovery problems which are not present in
centralized databases. Some of them are listed below.
Dealing with multiple copies of data items: The
concurrency control must maintain global
consistency. Likewise the recovery mechanism
must recover all copies and maintain consistency
after recovery.
Failure of individual sites: Database availability
must not be affected due to the failure of one or
two sites and the recovery scheme must recover
them before they are available for use.
54. Concurrency Control and Recovery
Communication link failure: This failure may
create network partition which would affect
database availability even though all database sites
may be running.
Distributed commit: A transaction may be
fragmented and they may be executed by a number
of sites. This require a two or three-phase commit
approach for transaction commit.
Distributed deadlock: Since transactions are
processed at multiple sites, two or more sites may
get involved in deadlock. This must be resolved in
a distributed manner.
55. Concurrency Control and Recovery
Distributed Concurrency control based on a distributed
copy of a data item
Primary site technique:
A single site is designated as a primary site which serves as a
coordinator for transaction management.
Communications neteork
Site 5
Site 1
Site 2
Site 4
Site 3
Primary site
56. Concurrency Control and Recovery
1. Transaction management: Concurrency control and
commit are managed by this site. In two phase locking,
this site manages locking and releasing data items. If all
transactions follow two-phase policy at all sites, then
serializability is guaranteed.
2. Advantages: An extension to the centralized two phase
locking; so implementation and management is simple.
Data items are locked only at one site but they can be
accessed at any site. Deadlock can be detected easily by
applying a directed graph directly.
3. Disadvantages: All transaction management activities go
to primary site which is likely to overload the site. If the
primary site fails, the entire system is inaccessible.
4. To aid recovery a backup site is designated which behaves
as a shadow of primary site.
57. Primary site with Backup Site:
In case of primary site failure, backup site can act as
primary site. All locking information is maintained at
both primary site and backup site
1. Advantages: Simplifies process of recovery
2. Disadvantages: Slows down process of acquiring locks.
Original problem not solved!
Concurrency Control and Recovery
58. Concurrency Control and Recovery
Primary Copy Technique:
In this approach, instead of a site, a data item partition is
designated as primary copy. To lock a data item just the
primary copy of the data item is locked.
1. Advantages: Since primary copies are distributed at
various sites, a single site is not overloaded with locking
and unlocking requests.
2. Disadvantages: Identification of a primary copy is
complex. A distributed directory must be maintained,
possibly at all sites.
59. Concurrency Control and Recovery
Recovery from a coordinator failure
In both approaches a coordinator site or copy may become
unavailable. This will require the selection of a new
coordinator.
Primary site approach with no backup site: Aborts and
restarts all active transactions at all sites. Elects a new
coordinator and initiates transaction processing.
Primary site approach with backup site: Suspends all active
transactions, designates the backup site as the primary site and
identifies a new back up site. Primary site receives all
transaction management information to resume processing.
Primary and backup sites fail or no backup site: Use
election process to select a new coordinator site.
60. Concurrency Control and Recovery
Concurrency control based on voting (Majority Protocol
There is no primary copy of coordinator.
Send lock request to sites that have data item.
If majority of sites grant lock then the requesting transaction
gets the data item.
Locking information (grant or denied) is sent to all these sites.
To avoid unacceptably long wait, a time-out period is defined.
If the requesting transaction does not get any vote information
then the transaction is aborted.
61. Biased Protocol
The difference from the majority protocol is that requests
for shared locks are given more favorable treatment than
requests for exclusive locks.
Shared locks. When a transaction needs to lock data item
Q, it simply requests a lock on Q from the lock manager at
one site that contains a replica of Q.
Exclusive locks. When a transaction needs to lock data
item Q, it requests a lock on Q from the lock manager at
all sites that contain a replica of Q.
Advantage- less overhead on read operations
The additional overhead on writes is a disadvantage
complexity in handling deadlock.
Concurrency Control and Recovery
62. Quorum Consensus Protocol
It is a generalization of the majority protocol.
It assigns each site a nonnegative weight.
It assigns read and write operations on an item x two integers,
called read quorum Qr and write quorum Qw, S is the total
weight of all sites at which x resides:
Qr + Qw > S and 2 Qw > S∗
To execute a read operation, enough replicas must be read that
their total weight is ≥ Qr.
To execute a write operation, enough replicas must be written so
that their total weight is ≥ Qw
It can permit the cost of either reads or writes to be selectively
reduced by appropriately defining the read and write quorums.
Concurrency Control and Recovery
63. Timestamping
Each transaction is given a unique timestamp that the system
uses in deciding the serialization order.
Two primary methods for generating unique timestamps,
1. Centralized scheme- a single site distributes the timestamps.
The site can use a logical counter or its own local clock for
this purpose.
2. Distributed scheme- each site generates a unique local
timestamp by using either a logical counter or the local clock.
unique global timestamp = unique local timestamp + site
identifier, which also must be unique
Concurrency Control and Recovery
64.
65. System Structure
Each site has its own local transaction manager, whose function
is to ensure the ACID properties of those transactions that
execute at that site.
The various transaction managers cooperate to execute global
transactions.
Each site contains two subsystems:
1) The transaction manager manages the execution of those
transactions (or subtransactions) that access data stored in a
local site.
Each such transaction may be either a local transaction or part
of a global transaction
2) The transaction coordinator coordinates the execution of the
various transactions (both local and global) initiated at that site.
66.
67. Each transaction manager is responsible for
Maintaining a log for recovery purposes
Participating in an appropriate concurrency-control scheme to
coordinate the concurrent execution of the transactions
executing at that site
A transaction coordinator, is responsible for coordinating the
execution of all the transactions initiated at that site.
For each such transaction, the coordinator is responsible for
• Starting the execution of the transaction
• Breaking the transaction into a number of subtransactions and
distributing these subtransactions to the appropriate sites for
execution
• Coordinating the termination of the transaction, which may
result in the transaction being committed at all sites or aborted
at all sites.
68. Commit Protocols
If we are to ensure atomicity, all the sites in which a
transaction T executed must agree on the final outcome of
the execution. T must either commit at all sites, or it must
abort at all sites.
To ensure this property, the transaction coordinator of T
must execute a commit protocol.
Two-Phase Commit
Consider a transaction T initiated at site Si, where the
transaction coordinator is Ci.
When T completes its execution—that is, when all the sites
at which T has executed inform Ci that T has completed—
Ci starts the 2PC protocol.
69. Phase 1. Ci adds the record <prepare T> to the log, and
forces the log onto stable storage.
It then sends a prepare T message to all sites at which T
executed.
On receiving such a message, the transaction manager at
that site determines whether it is willing to commit its
portion of T.
If the answer is no, it adds a record <no T> to the log, and
then responds by sending an abort T message to Ci.
If the answer is yes, it adds a record <ready T> to the log,
and forces the log onto stable storage.
The transaction manager then replies with a ready T
message to Ci.
2 Phase Commit (2PC)
70. 2 Phase Commit (2PC)
Phase 2. When Ci receives responses to the prepare T message from
all the sites, or when a prespecified interval of time has elapsed
since the prepare T message was sent out, Ci can determine
whether the transaction T can be committed or aborted.
Transaction T can be committed if Ci received a ready T message
from all the participating sites.
Otherwise, transaction T must be aborted.
Depending on the verdict, either a record <commit T> or a record
<abort T> is added to the log and the log is forced onto stable
storage.
At this point, the fate of the transaction has been sealed. Following
this point, the coordinator sends either a commit T or an abort T
message to all participating sites.
When a site receives that message, it records the message in the log.
71. A site at which T executed can unconditionally abort T at any
time before it sends the message ready T to the coordinator.
Once the message is sent, the transaction is said to be in the
ready state at the site.
The ready T message is, in effect, a promise by a site to follow
the coordinator’s order to commit T or to abort T.
To make such a promise, the needed information must first be
stored in stable storage.
Otherwise, if the site crashes after sending ready T, it may be
unable to make good on its promise.
Further, locks acquired by the transaction must continue to be
held till the transaction completes.
2 Phase Commit (2PC)
72. The fate of T is sealed as soon as at least one site responds
abort T.
Since the coordinator site Si is one of the sites at which T
executed, the coordinator can decide unilaterally to abort T.
The final verdict regarding T is determined at the time that
the coordinator writes that verdict (commit or abort) to the
log and forces that verdict to stable storage.
In some implementations of the 2PC protocol, a site sends an
acknowledge T message to the coordinator at the end of the
second phase of the protocol.
When the coordinator receives the acknowledge T message
from all the sites, it adds the record <complete T> to the log.
2 Phase Commit (2PC)
73. Handling of Failures
Failure of a participating site.
If the site fails before responding with a ready T message
to Ci, the coordinator assumes that it responded with an
abort T message.
If the site fails after the coordinator has received the ready
T message from the site, the coordinator executes the rest
of the commit protocol in the normal fashion, ignoring the
failure of the site.
74. Failure of a participating site.
When a participating site Sk recovers from a failure, it must
examine its log to determine the fate of those transactions that
were in the midst of execution when the failure occurred. Let T
be one such transaction.
The log contains a <commit T> record. In this case, the site
executes redo(T).
The log contains an <abort T> record. In this case, the site
executes undo(T).
The log contains no control records (abort, commit, ready)
concerning T.
Sk failed before responding to the prepare T message from Ci.
Ci must abort T. Hence, Sk must execute undo(T).
Handling of Failures
75. The log contains a <ready T> record.
The site must consult Ci to determine the fate of T.
If Ci is up, it notifies Sk regarding whether T committed or
aborted. So it executes either redo(T) or undo(T).
If Ci is down, Sk must try to find the fate of T from other sites.
It sends a querystatus T message to all the sites in the system.
On receiving such a message, a site must consult its log to
determine whether T has executed there, and if T has, whether
T committed or aborted.
It then notifies Sk about this outcome.
If no site has the appropriate information then Sk can neither
abort nor commit T. The decision concerning T is postponed
until Sk can obtain the needed information.
Thus, Sk must periodically resend the querystatus message to
the other sites.
76. Failure of the coordinator
If the coordinator fails in the midst of the execution of the
commit protocol for transaction T, then the participating
sites must decide the fate of T.
In certain cases, the participating sites cannot decide
whether to commit or abort T, and therefore these sites must
wait for the recovery of the failed coordinator.
Handling of Failures
77. Failure of the coordinator
If an active site contains a <commit T> record in its log,
then T must be committed.
If an active site contains an <abort T> record in its log,
then T must be aborted.
If some active site does not contain a <ready T> record in
its log, then the failed coordinator Ci cannot have decided
to commit T. However, the coordinator may have decided
to abort T, but not to commit T. Rather than wait for Ci to
recover, it is preferable to abort T.
Handling of Failures
78. If none of the preceding cases holds, then all active sites must
have a <ready T> record in their logs, but no additional
control records (such as <abort T> or <commit T>).
Since the coordinator has failed, it is impossible to determine
whether a decision has been made, and if one has, what that
decision is, until the coordinator recovers.
Thus, the active sites must wait for Ci to recover.
Since the fate of T remains in doubt, T may continue to hold
system resources- blocking problem
Handling of Failures
79. Network partition.
1. The coordinator and all its participants remain in one
partition. In this case, the failure has no effect on the
commit protocol.
2. The coordinator and its participants belong to several
partitions.
From the viewpoint of the sites in one of the partitions, it
appears that the sites in other partitions have failed.
Sites that are not in the coordinator partition- failure of the
coordinator.
The coordinator and the sites in the partition - follow the
usual commit protocol, assuming that the sites in the other
partitions have failed.
Handling of Failures
80. Recovery and
Concurrency Control
The recovery procedure must treat in-doubt transactions
specially.
The recovering site must determine the commit–abort status
of such transactions by contacting other sites.
Normal transaction processing at the site cannot begin until
all in-doubt transactions have been committed or rolled back.
Finding the status of in-doubt transactions can be slow.
Further, recovery potentially could become blocked if 2PC is
used.
As a result, the site performing restart recovery may remain
unusable for a long period.
81. To circumvent this problem, recovery algorithms typically
provide support for noting lock information in the log.
Instead of writing a <ready T> log record, the algorithm
writes a <ready T, L> log record
At recovery time, after performing local recovery actions,
for every in-doubt transaction T, all the write locks noted in
the <ready T,L> log record are reacquired.
After this transaction processing can start at the site, even
before the commit–abort status of the in-doubt transactions
is determined.
The commit or rollback of in-doubt transactions proceeds
concurrently with the execution of new transactions.
Thus, site recovery is faster, and never gets blocked.
Recovery and
Concurrency Control
82. Three-Phase Commit
The three-phase commit (3PC) protocol is an extension of
the two-phase commit protocol that avoids the blocking
problem under certain assumptions.
It is assumed that no network partition occurs, and not more
than k sites fail, where k is some predetermined number.
The protocol avoids blocking by introducing an extra third
phase where multiple sites are involved in the decision to
commit.
Instead of directly noting the commit decision in its
persistent storage, the coordinator first ensures that at least k
other sites know that it intended to commit the transaction.
If the coordinator fails, the remaining sites first select a new
coordinator.
83. This new coordinator checks the status of the protocol
from the remaining sites; if the coordinator had decided to
commit, at least one of the other k sites that it informed
will be up and will ensure that the commit decision is
respected.
The new coordinator restarts the third phase of the
protocol if some site knew that the old coordinator
intended to commit the transaction.
Otherwise the new coordinator aborts the transaction.
Three-Phase Commit
84. While the 3PC protocol has the desirable property of not
blocking unless k sites fail, it has the drawback that a
partitioning of the network will appear to be the same as
more than k sites failing, which would lead to blocking.
The protocol also has to be carefully implemented to
ensure that network partitioning does not result in
inconsistencies, where a transaction is committed in one
partition, and aborted in another.
Because of its overhead, 3PC protocol is not widely used.
Three-Phase Commit
85. Client-Server Database Architecture
It consists of clients running client software, a set of servers
which provide all database functionalities and a reliable
communication infrastructure.
Client 1
Client 3
Client 2
Client n
Server 1
Server 2
Server n
86. Client-Server Database Architecture
Clients reach server for desired service, but server does reach clients.
The server software is responsible for local data management at a
site, much like centralized DBMS software.
The client software is responsible for most of the distribution
function.
The communication software manages communication among
clients and servers.
87. Client-Server Database Architecture
The processing of a SQL queries goes as follows:
Client parses a user query and decomposes it into a number
of independent sub-queries. Each subquery is sent to
appropriate site for execution.
Each server processes its query and sends the result to the
client.
The client combines the results of subqueries and produces
the final result.
Editor's Notes
Some different database system architectures. (a) Shared nothing architecture.
(b) A networked architecture with a centralized database at one of the sites. (c)
A truly distributed database architecture.
We need to distinguish distributed databases from multiprocessor systems that use
shared storage (primary memory or disk). For a database to be called distributed,
the following minimum conditions should be satisfied:
■ Connection of database nodes over a computer network. There are multiple
computers, called sites or nodes. These sites must be connected by an
underlying communication network to transmit data and commands
among sites, as shown later in Figure 25.3(c).
■ Logical interrelation of the connected databases. It is essential that the
information in the databases be logically related.
■ Absence of homogeneity constraint among connected nodes. It is not necessary
that all nodes be identical in terms of data, hardware, and software.
Supports all transparency
A deadlock is a situation in which two or more competing actions are each waiting for the other to finish, and thus neither ever does.
When a statement modifies data, its transaction holds an exclusive lock on data that prevents other transactions from accessing the data.