3. DATA FRAGMENTATION, REPLICATION, AND ALLOCATION
TECHNIQUES FOR DISTRIBUTED DATABASE DESIGN
⚫ Fragmentation: Breaking up the database into logical units called
fragments and assigned for storage at various sites.
⚫ Data replication: The process of storing fragments in more than one
site
⚫ Data Allocation: The process of assigning a particular fragment to a
particular site in a distributed system.
The information concerning the data fragmentation, allocation and
replication is stored in a global directory.
3
4. DATA FRAGMENTATION
➢ Breaking up a single object (user’s database, a system database,
or a table) into two or more segments called fragments and
assigned for storage at various sites.
➢ Types of Fragmentation
➢ Horizontal Fragmentation
➢ Vertical Fragmentation
➢ Mixed (Hybrid) Fragmentation
➢ Fragmentation Schema
➢ Definition of a set of fragments that include all attributes and tuples in the
database
➢ The whole database can be reconstructed from the fragments
4
5. ⚫ Fragmentation of data can be done according to the DBs and
user requirement. But while fragmenting the data, below
points should be kept in mind :
➢ Completeness: Fragmentation should be performed on whole table’s
data to get the correct result.
➢ Reconstructions: When all the fragments are combined, it should give
whole table’s data. That means whole table should be able to
reconstruct using all fragments
➢ Disjointedness : There should not be any overlapping data in the
fragments. If so, it will be difficult to maintain the consistency of the data.
6. HORIZONTAL FRAGMENTATION
➢ Horizontal fragment is a subset of tuples in that relation
➢ Tuples are specified by a condition on one or more attributes of the relation
➢ Divides a relation horizontally by grouping rows to create subset of tuples
6
7. Example
⚫ Consider the employees working at different locations of the
organization like India, USA, UK etc. number of employees
from all these locations are not a small number.
SELECT * FROM EMPLOYEE WHERE EMP_LOCATION = ‘INDIA;
SELECT * FROM EMPLOYEE WHERE EMP_LOCATION = ‘USA’;
SELECT * FROM EMPLOYEE WHERE EMP_LOCATION = ‘UK;
8. VERTICAL FRAGMENTATION
➢ A vertical fragment keeps only certain attributes of that relation
➢ Divides a relation vertically by columns
➢ It is necessary to include primary key.
➢ The full relation can be reconstructed from the fragments
⚫ Vertical fragmentation can be used to enforce privacy of data.
8
9. Example
⚫ Consider the EMPLOYEE table with ID, Name, Address,
Age, location, DeptID, ProjID.
The vertical fragmentation of this table may be dividing the
table into different tables with one or more columns from
EMPLOYEE.
SELECT EMP_ID, EMP _FIRST_NAME, EMP_LAST_NAME, AGE FROM EMPLOYEE;
SELECT EMP_ID, STREETNUM, TOWN, STATE, COUNTRY, PIN FROM EMPLOYEE;
SELECT EMP_ID, DEPTID FROM EMPLOYEE; SELECT EMP_ID, PROJID FROM EMPLOYEE;
10. MIXED ( Hybrid) FRAGMENTATION
⚫ This is the most flexible fragmentation technique since it generates
fragments with minimal extraneous information. However,
reconstruction of the original table is often an expensive task.
⚫ Intermixing the two types of fragmentation.
⚫ Hybrid fragmentation can be done in two alternative ways.
10
11. Example
SELECT EMP_ID, EMP _FIRST_NAME, EMP_LAST_NAME, AGE FROM EMPLOYEE
WHERE EMP_LOCATION = ‘INDIA;
SELECT EMP_ID, DEPTID FROM EMPLOYEE
WHERE EMP_LOCATION = ‘INDIA;
SELECT EMP_ID, EMP _FIRST_NAME, EMP_LAST_NAME, AGE FROM EMPLOYEE
WHERE EMP_LOCATION = ‘US;
SELECT EMP_ID, PROJID FROM EMPLOYEE
WHERE EMP_LOCATION = ‘US;
12. DATA FRAGMENTATION
➢ Complete Horizontal Fragmentation
➢ Set of horizontal fragments that include all the tuples in a relation
➢ To reconstruct a relation, apply the UNION to the horizontal fragments
➢ Complete Vertical Fragmentation
➢ Set of vertical fragments whose projection lists include all the attributes but
share only the primary key attribute
➢To reconstruct a relation, apply the OUTER UNION to the vertical fragments
12
OUTER UNION
concatenates the query results.
UNION
produces all unique rows from both queries.
13.
14. exercise
Given relation EMP , let
p1: TITLE < “Programmer” and
p2: TITLE > “Programmer” be two simple predicates.
Assume that character strings have an order among them, based on the
alphabetical order.
(a) Perform a horizontal fragmentation of relation EMP with respect to p1; p2.
(b) Is the resulting fragmentation (EMP1, EMP2) fulfill the correctness rules of
fragmentation? Explain Why.
15. DATA REPLICATION
➢ Process of storing data in more than one site
➢ Replication Schema
➢ Description of the replication of fragments
Types of Replications:
1. Fully replicated distributed database : Replicating the whole
database at every site
➢ Improves availability
➢ Improves performance of retrieval
➢ Queries can be executed faster.
➢ Difficult to achieve the concurrency.
➢ Slow processing and execution time.
15
16. DATA REPLICATION (contd…)
2. No replication distributed database
➢ Each fragment is stored exactly at one site
➢ All fragments must be disjoint except primary keys
➢ Also called Non-redundant allocation
3. Partial Replication
➢ Some fragments(only the modified data) may be replicated while
others may not
➢ Number of copies range from one to total number of sites in a
distributed system
16
17. Advantages of Data Replication
⚫ Increase availability .
⚫ Increase reliability.
⚫ Increase performance.
⚫ Retrieval of data or modification of data becomes easier.
⚫ Consistency is maintained across every node of the database.
⚫ Faster processing and execution time ( fast response).
⚫ Less Data Movement over Network
18. Disadvantages of Data Replication
⚫ Storage requirements:
Storage space required gets higher as the replicas needs more space going
through various sites at a time.
⚫ Complexity and cost of updating.
The cost to replicate the data at all sites also gets increased as every site needs to get
updated altogether. Complexity of data increases as well.
⚫ Hard to maintain the consistency of data.
19. Disadvantages of Data Replication
For these disadvantages data replication is favored where most
process requests are read-only and where the data are relatively
static, as in catalogs, telephone directories , train schedules.
Replication is not favored approach for online applications such
as airline reservations, ATM transactions, and other financial
activities
20. Factors influence the decision to use data replication
⚫ Database size: The amount of data replicated will have an impact
on the storage requirements and the data transmission costs.
⚫ Usage frequency: The frequency of data usage determines how
frequently the data needs to be updated.
⚫ Costs: including those for performance, software overhead, and
management associated with synchronizing transactions and their
components vs. fault-tolerance benefits that are associated with
replicated data.
21. DATA ALLOCATION
➢ Each fragment or each copy of the fragment must be assigned to a
particular site ( Also called Data Distribution )
➢ Choice of sites and degree of replication depends on
➢ Performance of the system
➢ Availability goals of the system
➢ Types of transactions: which attributes will be accessed by each of
those transactions.
➢ Frequencies of transactions submitted at any site
➢ Allocation Schema
➢ Describes the allocation of fragments to sites of the DDBs
21
22. Data Delivery Alternatives
•Delivery modes
•Frequency
•Communication Methods
Data are “delivered” from the sites where they are stored to where the query is requested.
23. ➡Pull-only:
• The transfer of data from servers to clients is initiated by a client pull.
• The arrival of new data items or updates to existing data items are carried out at a server without
notification to clients.
• Servers must be interrupted continuously to deal with requests from clients.
Conventional DBMSs offer primarily pull-based data delivery.
➡Push-only:
• The transfer of data from servers to clients is initiated by a server push in the absence of any specific
request from clients.
• Main difficulty is in deciding which data would be of common interest, and when to send them to clients.
• The usefulness of server push depends heavily upon the accuracy of a server to predict the needs of
clients.
• Servers publish information to either an unbounded set of clients (random broadcast) or selective set of
clients (multicast).
➡Hybrid: combines the client-pull and server-push mechanisms.
Delivery modes
24. Data Delivery Alternatives (Cont.)
•Frequency
used to classify the regularity of data delivery.
➡ Periodic: data are sent from the server to clients at regular intervals. These intervals
can be defined by system default or by clients using their profiles. Both pull and push can
be performed in periodic fashion. (Examples)
➡ Conditional: data are sent from servers whenever certain conditions installed by clients in their
profiles are satisfied. Mostly used in the hybrid or push-only delivery systems. (Examples)
➡Ad-hoc or irregular: is irregular and is performed mostly in a pure pull-based system.
Data are pulled from servers to clients in an ad-hoc fashion whenever clients request it.
25. 25
Periodic delivery is carried out on a regular and pre-specified
repeating schedule. A client request for IBM’s stock price every
week is an example of a periodic pull.
An application that sends out stock prices only when they
change is an example of conditional push.
Hybrid conditional push further assumes that missing
some update information is not important to the clients
An example of periodic push is when an application can send
out stock price listing on a regular basis, say every morning.
26. •Communication Methods
These methods determine the various ways in which servers and clients
communicate for delivering information to clients.
➡One-to-one (Unicast) : the server sends data to one client using a particular delivery mode
with some frequency.
➡One-to-many: the server sends data to a number of clients. It may use a multicast or
broadcast protocol.
Ch.1/26
Data Delivery Alternatives (Cont.)