1. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 1
1
SEMINAR
On
“DISTRIBUTED DATABASE MANAGEMENT SYSTEM”
Submitted by
Name: - Patel Vinaykumar Dineshchandra
Class: - B.C.A (Sem-6)
Seat No: - 1732
Submitted to
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATIONS
SARIGAM (BCA)
Laxmi Institute of Commerce & Computer Applications (BCA) Sarigam
2. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 2
2
SEMINAR REPORT
AS a Partial Requirement
For the Degree of
Bachelor of Computer Applications
(B.C.A)
Academic Year: 2015-16
Submitted by:
Patel Vinaykumar Dineshchandra
Guided by:
Internal: Miss Rucha Nage
3. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 3
3
PREFACE
It is an exciting moment for me to present this seminar report. The proper care was
taken while preparing the report so that it is easy to read & understand. During the
preparation of this seminar report, the Information technology concepts were implemented.
This seminar is part of Third Year study, the final step towards the completion of
BCA Course.
This documentation defines the system function in an understandable manner.
Seminar report consists of different sections like specification of technology, study on it,
Different functions, its features etc. that will help user to understand the particular technology
in brief.
4. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 4
4
ACKNOWLEDGEMENT
I, the student of Laxmi Institute of Commerce & Computer Applications, Sarigam
B.C.A feel full satisfaction and pleasure to pleasure to present the seminar on,
“DISTRIBUTED DATABASE MANAGEMENT SYSTEM”
I have great pleasure in acknowledgement the help given by various individuals
throughout the seminar work. This project is itself an acknowledgement to the inspiration;
drive the technical assistance contributed by many individuals.
I express my sincere and heartfelt gratitude to Dr. Keyur Nayak, Director of the
Department of Computer Applications, for being helpful and c0-operative during this period.
I also express my deep gratitude to the faculty member Miss Rucha Nage, Subject
Faculty and Internal guide for valuable guidance, good suggestions and help in the
completion of this seminar.
I extend my sincere thanks to all the faculty members for providing useful help and
necessary help. Without the support of anyone of them this seminar could not be complete.
Sincerely,
Patel Vinaykumar Dineshchandra
(T.Y.B.C.A)
6. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 6
6
INDEX
“DISTRIBUTED DATABASE MANAGEMENT SYSTEM”
SR.NO. DESCRIPTION PAGE NO.
1. ABSTRACT 7
2. INTRODUCTION 8
3. DEFINITION 9
4. TYPES 10
5. FUNCTIONS
6. ADVANTAGES
7. DISADVANTAGES
7. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 7
7
ABSTRACT
The purpose of this paper is to present an introduction to distributed databases though
two main parts: in the first part, we present a study of the fundamentals of distributed
databases (DDBS).
We discuss issues related to the motivations of Distributed DBS, architecture, design,
performance, and concurrency control, etc.
The topics of this research include, query optimization, distribution optimization,
fragmentation, optimization, and join optimization on the internet.
We include examples and results to demonstrate the topics we are presenting.
8. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 8
8
INTRODUCTION
In today’s world of universal dependence on information systems, all sorts of people
need access to companies’ databases. In addition to a company’s own employees,
these include the company’s customers, potential customers, suppliers, and vendors of
all types. It is possible for a company to have all of its databases concentrated at one
mainframe computer site with worldwide access to this site provided by
telecommunications networks, including the Internet.
Although the management of such a centralized system and its databases can be
controlled in a well-contained manner and this can be advantageous, it poses some
problems as well. For example, if the single site goes down, then everyone is blocked
from accessing the databases until the site comes back up again. Also the
communications costs from the many far PCs and terminals to the central site can be
expensive.
One solution to such problems, and an alternative design to the centralized database
concept, is known as ‘Distributed Database’. The idea is that instead of having one,
centralized database, we are going to spread the data out among the cities on the
distributed network, each of which has its own computer and data storage facilities.
All of this distributed data is still considered to be a single logical database.
When a person or process anywhere on the distributed network queries the database, it
is not necessary to know where on the network the data being sought is located. The
user just issues the query, and the result is returned. This feature is known as
‘Location Transparency’. This can become rather complex very quickly, and it must
be managed by sophisticated software known as a ‘Distributed Database Management
System’ or ‘Distributed DBMS’.
9. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 9
9
DEFENITION
A distributed database (DDB) is a collection of multiple, logically interrelated
databases distributed over a computer network.
A distributed database management system (DDBMS) is the software that manages
the DDB, and provides an access mechanism that makes this distribution transparent
to the user.
Distributed database system (DDBS) is the integration of Distributed DB and
Distributed DBMS.
This integration is achieved through the merging the database and networking
technologies together.
Or it can be described as, a system that runs on a collection of machines that do not
have shared memory, yet looks to the user like a single machine.
10. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 10
10
There are 2 types of it,
1. Homogeneous Distributed DBMS
2. Heterogeneous Distributed DBMS
1. Homogeneous Distributed DBMS: -
All sites of the database system have identical setup, i.e., same database
system software. The underlying operating system may be different. For
example, all sites run Oracle or DB2, or Sybase or some other database
system. The underlying operating systems can be a mixture of Linux,
Window, UNIX, etc. The clients thus have to use identical client software.
2. Heterogeneous Distributed DBMS: -
Federated: Each site may run different database system but the data access
is managed through a single conceptual schema. This implies that the
degree of local autonomy is Minimum. Each site must adhere to a
centralized access policy. There may be a global schema.
Multi database: There is no one conceptual global schema. For data
access a schema is constructed dynamically as needed by the application
software.
11. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 11
11
Communications
network
Site 5
Site 1
Site 2Site 3
Network
DBMS
Relational
Site 4
Object
Oriented
LinuxLinux
Unix
Hierarchical
Object
Oriented
RelationalUnix
Window
Architecture of a DDBMS
Each computer (site) in a distributed system may contain a Transaction Manager
(TM) and a Data Manager (DM) - as we will see later, there is also a Transaction
Coordinator (TC). The TM is responsible for the Transactions received by the
computer. The DM manages the database access on the local computer.
When a Transaction arrives at the TM, the TM divides the transaction into sub
transactions which are transmitted to those DMs containing the data needed by
the Transaction. (In some cases the TC is responsible for this.)
The TM processes the collected received data from the sub-transactions'
responses and produces the final result.
Any TM can communicate with all DMs and vice versa.
12. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 12
12
NOTE: The DMs cannot transmit data to other DMs and the same applies to TMs,
except in certain cases where it is convenient to transfer the total responsibility of
a Transaction from one TM to another (i.e. if a Transaction runs as a local
Transaction on another computer.)
CHARACTERISTICS OF DISTRIBUTED DBMS
A Distributed DBMS developed by a single vendor may contain:
1. Data Independence
2. Concurrency Control
3. Replication facilities
4. Recovery facilities
5. Co-ordinated Data Dictionary
Now I Discuss them in detail,
Data Independence: -
- A database system normally contains a lot of data in addition to users’ data. For
example, it stores data about data, known as metadata, to locate and retrieve data
easily. It is rather difficult to modify or update a set of metadata once it is stored
in the database. But as a DBMS expands, it needs to change over time to satisfy
the requirements of the users. If the entire data is dependent, it would become a
tedious and highly complex job.
- Metadata itself follows a layered architecture, so that when we change data at one
layer, it does not affect the data at another level. This data is independent but
mapped to each other.
a. Logical Data Independence:- Logical data is data about database, that is, it
stores information about how data is managed inside. For example, a table
relation stored in the database and all its constraints, applied on that relation.
Logical data independence is a kind of mechanism, which liberalizes itself
from actual data stored on the disk. If we do some changes on table format, it
should not change the data residing on the disk.
b. Physical Data Independence: - All the schemas are logical, and the actual
data is stored in bit format on the disk. Physical data independence is the
power to change the physical data without impacting the schema or logical
data. For example, in case we want to change or upgrade the storage system
13. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 13
13
itself − suppose we want to replace hard-disks with SSD − it should not have
any impact on the logical data or schemas.
Concurrency Control: -
- Concurrency control is a database management system (DBMS) concept that is
used to address conflicts with the simultaneous accessing or altering of data that
can occur with a multi-user system. It ensures that Database transactions are
performed concurrently without violating the data integrity of the
respective databases. Thus concurrency control is an essential element for
correctness in any system where two or more database transactions, executed with
time overlap, can access the same data.
- Concurrency Control Protocols can be broadly divided into two categories,
a. Lock based protocols
b. Time stamp based protocols
a. Lock based protocols: - A lock is nothing but a mechanism that tells the
DBMS whether a particular data item is being used by any transaction for
read/write purpose. Since there are two types of operations, i.e. read and
write, whose basic nature are different, the locks for read and write
operation may behave differently.
Read operation performed by different transactions on the same data
item. The value of the data item, if constant, can be read by any
number of transactions at any given time. If a transaction is reading the
content of a sharable data item, then any number of other processes can
be allowed to read the content of the same data item.
Write operation is something different. When a transaction writes some
value into a data item, the content of that data item remains in an
inconsistent state, starting from the moment when the writing operation
begins up to the moment the writing operation is over.
But if any transaction is writing into a sharable data item, then no other
transaction will be allowed to read or write that same data item.
14. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 14
14
Database systems equipped with lock-based protocols use a
mechanism by which any transaction cannot read or write data until it
acquires an appropriate lock on it.
Locks are of two kinds,
1) Binary Locks: - A lock on a data item can be in two states; it is
either locked or unlocked.
2) Shared/Exclusive Lock: -
1) Shared Lock: A transaction may acquire shared lock on a data
item in order to read its content. The lock is shared in the sense
that any other transaction can acquire the shared lock on that
same data item for reading purpose.
2) Exclusive Lock: A transaction may acquire exclusive lock on a
data item in order to both read/write into it. The lock is
excusive in the sense that no other transaction can acquire any
kind of lock (either shared or exclusive) on that same data item.
There are four types of lock protocols available,
3) Simplistic Lock Protocol: - Simplistic lock-based protocols
allow transactions to obtain a lock on every object before a
'write' operation is performed. Transactions may unlock the
data item after completing the ‘write’ operation.
4) Pre-claiming Lock Protocol; - Pre-claiming protocols
evaluate their operations and create a list of data items on
which they need locks. Before initiating an execution, the
transaction requests the system for all the locks it needs
beforehand. If all the locks are granted, the transaction executes
and releases all the locks when all its operations are over. If all
the locks are not granted, the transaction rolls back and waits
until all the locks are granted.
b. Timestamp-based Protocols: - A timestamp is a tag that can be attached
to any transaction or any data item, which denotes a specific time on which
the transaction or data item had been activated in any way.
This protocol uses either system time or logical counter as a
timestamp. Every transaction has a timestamp associated with it, and
the ordering is determined by the age of the transaction.
The timestamp of a data item can be of the following two types:
(1) W-timestamp (Q): This means the latest time when the data item
Q has been written into.
(2) R-timestamp (Q): This means the latest time when the data item Q
has been read from.
How should timestamps be used?
For Readoperations: If a transaction T1 issues a read(X)
operation,
15. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 15
15
If TS(T1) < W-timestamp(X) Operation rejected.
If TS(T1) >= W-timestamp(X) Operation executed.
All data-item timestamps updated.
For Write operations: If a transaction T1 issues a write(X)
operation,
If TS(T1) < R-timestamp(X) Operation rejected.
If TS(T1) < W-timestamp(X) Operation rejected and T1 rolled
back.
Otherwise, operation executed.
Replication Facilities: -
- Replication is useful in improving the availability of data by coping data at
multiple sites.
- Either a relation or a fragment can be replicated at one or more sites.
- Fully redundant databases are those in which every site contains a copy of the
entire database.
- Depending on the availability and redundancy factor there are three types of
replications:
a. Full replication.
b. No replication.
c. Partial replication.
Full replication: -
The most extreme case is replication of the whole database at every site in the
distributed system.
This can improve availability remarkably because the system can continue to operate
as long as at least one site is up.
It also improves performance for retrieval of global queries as the result can be
obtained locally at any client.
Disadvantage: Slows the update process as a single update must be performed at
different databases to keep the copies consistent.
No replication: -
16. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 16
16
The other extreme from full replication involves having no replication—that is, each
fragment is stored at exactly one site.
In this case, all fragments must be disjoint, except for the repetition of primary keys
among vertical (or mixed) fragments.
This is also called ‘Non-redundant allocation.’
Partial Replication: -
Here some fragments of the database may be replicated whereas others may not.
The number of copies of each fragment can range from one up to the total number of
sites in the distributed system.
For example:
mobile workers—sales forces, financial planners, carry partially replicated databases
on their laptops and synchronize periodically with the server databases.
A description of the replication of fragments is sometimes called a replication
schema.
Each fragment—or each copy of a fragment—must be assigned to a particular site in
the distributed system. This process is called data distribution (or data allocation).
The choice of sites and the degree of replication depend on the performance and
availability goals of the system and on the types and frequencies of transactions
submitted at each site.
For example, if high availability is required, transactions can be submitted at any site,
and most transactions are retrieval only, a fully replicated database is a good choice.
However, if certain transactions that access particular parts of the database are mostly
submitted at a particular site, the corresponding set of fragments can be allocated at
that site only.
Data that is accessed at multiple sites can be replicated at those sites. If many updates
are performed, it may be useful to limit replication.
17. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 17
17
Recovery Facilities: -
- Recovery protocols bring failed nodes back online.
- Effectiveness of recovery protocol affects availability of the database.
- There are following methods of it,
1 Salvation Program: - A post-crash process that tries to restore the DB to a
valid state. No recovery data used.
2 Incremental Dumping: - Copies updated files to archival storage. Performed
either after TX completion or regular intervals.
3 Audit Trail: - Keeps track of a sequence of actions. Useful for DB restoration
to pre-crash state.
4 Differential Files: - separate files records updates requested for records in a
main file.
5 Backup/Current Version: - current version of DB is stored in currently
existing files with present values.
6 Multiple Copies: - multiple identical copies of the DB files are maintained.
7 Careful Replacement: - Update performed on a copy. Original is deleted
upon commit. Original copy available after a crash during update.
- Dealing with Recovery: -
(1) Lower time to recover.
(2) Reduce amount of recovery data to be transferred from active nodes.
(3) Log-based and version based recovery support.
(4) Support for amnesia phenomenon.
18. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 18
18
Harbor
• Recovery technique for “updatable warehouse” like systems.
• Queries active remote nodes.
• Timestamps determine which tuples to copy or update.
• Allows non-DBA transactions while recovering.
• Lower runtime overhead.
• Performance comparable to ARIES.
• Does not require stable log.
• Exploits replication to support recovery.
• Exploits historical queries.
• Supports recovery in warehouse-like systems that requires fine-granularity insertions
and updates.
• Uses versioning and “time travel.”
• Replicas are kept consistent up to some historical point using check pointing.
• Replication need not be physically identical, but must logically represent the same
data.
• Provides K-safety, i.e. tolerates K simultaneous site failures.
• Augments the tuples with Insert- and Delete-Time to provide versioning.
19. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 19
19
• 3 Stage Algorithm
- Restore to last checkpoint
- Update with Historical Queries
- Update to current time
FUNCTIONS OF DISTRIBUTED DBMS
A DDBMS governs the storage and processing of logically related data over
interconnected computer systems in which both data and processing functions are
distributed among several sites. A DBMS must have at least the following functions
to be classified as distributed:
Application interface to interact with the end user, application programs,
and other DBMSs within the distributed database.
Validation to analyse data requests for syntax correctness.
Transformation to decompose complex requests into atomic data request
components.
Query optimization to find the best access strategy. (Which database
fragments must be accessed by the query, and how must data updates, if
any, be synchronized?)
Mapping to determine the data location of local and remote fragments.
I/O interface to read or write data from or to permanent local storage.
Formatting to prepare the data for presentation to the end user or to an
application program.
Security to provide data privacy at both local and remote databases.
Backup and recovery to ensure the availability and recoverability of the
database in case of a failure.
Backup and recovery to ensure the availability and recoverability of the
database in case of a failure.
DB administration features for the database administrator.
Concurrency control to manage simultaneous data access and to ensure
data consistency across database fragments in the DDBMS.
Transaction management to ensure that the data moves from one consistent
state to another. This activity includes the synchronization of local and
20. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 20
20
remote transactions as well as transactions across multiple distributed
segments.
ADVANTAGES OF DISTRIBUTED DBMS
1. Data are located near the greatest demand site.
The data in a distributed database system are dispersed to match business
requirements which reduce the cost of data access.
2. Faster data access.
End users often work with only a locally stored subset of the company’s data.
3. Faster data processing.
A distributed database system spreads out the systems workload by processing
data at several sites.
4. Growth facilitation.
New sites can be added to the network without affecting the operations of
other sites.
5. Improved communications.
Because local sites are smaller and located closer to customers, local sites
foster better communication among departments and between customers and
company staff.
6. Reduced operating costs.
21. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 21
21
It is more cost-effective to add workstations to a network than to update a
mainframe system. Development work is done more cheaply and more quickly
on low-cost PCs than on mainframes.
7. User-friendly interface.
PCs and workstations are usually equipped with an easy-to-use graphical user
interface (GUI). The GUI simplifies training and use for end users.
8. Less danger of a single-point failure.
When one of the computers fails, the workload is picked up by other
workstations. Data are also distributed at multiple sites.
9. Processor independence.
The end user is able to access any available copy of the data, and an end user's
request is processed by any processor at the data location.
DISADVANTAGES OF DISTRIBUTED DBMS
1. Complexity of management and control.
Applications must recognize data location, and they must be able to stitch
together data from various sites. Database administrators must have the ability
to coordinate database activities to prevent database degradation due to data
anomalies.
2. Technological difficulty.
Data integrity, transaction management, concurrency control, security, backup,
recovery, query optimization, access path selection, and so on, must all be
addressed and resolved.
3. Security.
The probability of security lapses increases when data are located at multiple
sites. The responsibility of data management will be shared by different people
at several sites.
4. Lack of standards.
There are no standard communication protocols at the database level.
(Although TCP/IP is the de facto standard at the network level, there is no
standard at the application level.) For example, different database vendors
employ different—and often incompatible—techniques to manage the
distribution of data and processing in a DDBMS environment.
22. DITRIBUTED DATABASEMANAGEMENT SYSTEM
LAXMI INSTITUTE OF COMMERCE AND COMPUTER APPLICATIONS, SARIGAM(BCA) 22
22
5. Increased storage and infrastructure requirements.
Multiple copies of data are required at different sites, thus requiring additional
disk storage space.
6. Increased training cost.
Training costs are generally higher in a distributed model than they would be
in a centralized model, sometimes even to the extent of offsetting operational
and hardware savings.
7. Costs.
Distributed databases require duplicated infrastructure to operate (physical
location, environment, personnel, software, licensing, etc.)