• One of the most important decisions a distributed database designer has to make is data placement. Proper data placement is a crucial factor in determining the success of a distributed database system.
• There are four basic alternatives: namely,
– centralized,
– replicated,
– partitioned, and
– hybrid.
1. Advance Database Management Systems :30
Data Placement
Prof Neeraj Bhargava
Vaibhav Khanna
Department of Computer Science
School of Engineering and Systems Sciences
Maharshi Dayanand Saraswati University Ajmer
2. Data Placement
• One of the most important decisions a distributed
database designer has to make is data placement.
Proper data placement is a crucial factor in determining
the success of a distributed database system.
• There are four basic alternatives: namely,
– centralized,
– replicated,
– partitioned, and
– hybrid.
• Some of these require additional analysis to fine-tune the
placement of data.
3. Locality of Data Reference
• In deciding among data placement alternatives, the
following factors need to be considered:
• Locality of Data Reference. The data should be placed
at the site where it is used most often. The designer
studies the applications to identify the sites where they
are performed, and attempts to place the data in such a
way that most accesses are local.
4. Reliability of the Data
• Reliability of the Data. By storing multiple
copies of the data in geographically remote
sites, the designer maximizes the probability
that the data will be recoverable in case of
physical damage to any site.
• Data Availability. As with reliability, storing
multiple copies assures users that data items
will be available to them, even if the site from
which the items are normally accessed is
unavailable due to failure of the node or its
only link.
5. Storage Capacities and Costs
• Storage Capacities and Costs. Nodes can have
different storage capacities and storage costs that must
be considered in deciding where data should be kept.
Storage costs are minimized when a single copy of each
data item is kept, but the plunging costs of data storage
make this consideration less important.
6. Distribution of Processing
Load.
• Distribution of Processing Load. One of
the reasons for choosing a distributed
system is to distribute the workload so that
processing power will be used most
effectively.
• This objective must be balanced against
locality of data reference.
7. Communications Costs
• Communications Costs. The designer must
consider the cost of using the
communications network to retrieve data.
• Retrieval costs and retrieval time are
minimized when each site has its own copy of
all the data.
• However, when the data is updated, the
changes must then be sent to all sites.
• If the data is very volatile, this results in high
communications costs for update
synchronization.
8. The Centralized.
• The Centralized. This alternative consists of a single database and DBMS
stored in one location, with users distributed, There is no need for a DDBMS
or global data dictionary, because there is no real distribution of data, only
of processing.
• Retrieval costs are high, because all users, except those at the central site,
use the network for all accesses.
• Storage costs are low, since only one copy of each item is kept.
• There is no need for update synchronization, and the standard concurrency
control mechanism is sufficient.
• Reliability is low and availability is poor, because a failure at the central
node results in the loss of the entire system.
• The workload can be distributed, but remote nodes need to access the
database to perform applications, so locality of data reference is low.
• This alternative is not a true distributed database system
9.
10. Replicated.
• Replicated. With this alternative, a complete copy of the database is kept at
each node.
• Advantages are maximum locality of reference, reliability, data availability,
and processing load distribution.
• Storage costs are highest in this alternative.
• Communications costs for retrievals are low, but the cost of updates is high,
since every site must receive every update.
• If updates are very infrequent, this alternative is a good one.
11. Partitioned.
• Partitioned. only one copy of each data item, but the data is distributed
across nodes.
• To allow this, the database is split into disjoint fragments or parts. If the
database is a relational one, fragments can be vertical table subsets
(formed by projection)
• or horizontal subsets (formed by selection) of global relations.
• In any horizontal fragmentation scheme, each tuple of every relation
must be assigned to one or more fragments such that taking the union of
the fragments results in the original relation; for the horizontally partitioned
case, a tuple is assigned to exactly one fragment.
• In a vertical fragmentation scheme, the projections must be lossless, so
that the original relations can be reconstructed by taking the join of the
fragments.
12. Hybrid.
• Hybrid. In this alternative, different portions of the database are distributed
differently.
• For example, those records with high locality of reference are partitioned,
while those commonly used by all nodes are replicated, if updates are
infrequent.
• Those that are needed by all nodes, but updated so frequently that
synchronization would be a problem, might be centralized.
• This alternative is designed to optimize data placement, so that all the
advantages and none of the disadvantages of the other methods are
possible.
• However, very careful analysis of data and processing is required with this
plan.