www.folio3.com
Distributed Databases
Name: Mobeen Ahmed
Designation: Lead Software Engineer
www.folio3.com
www.folio3.com
Agenda
• Introduction to distributed databases
• Distributed DBMS (DDBMS)
• Types of DDBMS
• Issues in Distributed Database Design
• Replication
• Types of Replication
• The Publisher/Subscriber Metaphor
• Publication Limitations
• Push Subscriptions
• Pull Subscriptions
• Hands On Lab
www.folio3.com
www.folio3.com
Why distributed databases?
• Some initial motivations:
– The development of computer networks promotes
decentralization.
– In a company, the database organization might reflect
the organizational structure, which is distributed into
units. Each unit maintains its own database.
• Sharing of data can be achieved by developing a
distributed database system which:
– makes data accessible by all units
– stores data close to where it is most frequently used
www.folio3.com
www.folio3.com
Distributed Database
• A logically interrelated collection of shared data
(and a description of this data), physically
distributed over a computer network.
www.folio3.com
www.folio3.com
Distributed DBMS (DDBMS)
• Software system that permits the management of
the distributed database and makes the
distribution transparent to users.
www.folio3.com
www.folio3.com
Advantages of DDBMSs
• Reflects Organizational Structure
• Improved Sharing and Local Autonomy
• Improved Availability
– A failure does not make the entire system inoperable
• Improved Reliability Data may be replicated
• Improved Performance
– Data are local to the site of “greatest demand”
• Economics
– Many small computers cost less than a big one!
• Modular Growth
– easy to add new modules
www.folio3.com
www.folio3.com
Disadvantages of DDBMSs
• Complexity
• Cost
– Especially in system management
• Security
– network must be made secure
• Integrity Control More Difficult
• Lack of Standards
• Lack of Experience
www.folio3.com
www.folio3.com
Types of DDBMS
• Homogeneous DDBMS
– All sites use same DBMS product (eg. Sql Server or
Oracle)
– Fairly easy to design and manage.
• Heterogeneous DDBMS
– Sites may run different DBMS products (eg. Oracle and
Ingress)
– Possibly different underlying data models (eg. relational
DB and OO database)
www.folio3.com
www.folio3.com
Issues in Distributed Database Design
Three key issues we have to consider:
•Data Allocation:
– where are data placed? Data should be stored at site
with "optimal" distribution.
•Fragmentation:
– relation may be divided into a number of sub-relations
(called fragments) , which are stored in different sites.
•Replication:
– copy of fragment may be maintained at several sites
www.folio3.com
www.folio3.com
Data Allocation
• Four strategies regarding placement of data:
– Centralized
• Consists of single database stored at one site with
users distributed across the network. (This is not a
DDB but distributed processing!!)
– Partitioned (or Fragmented)
• Database partitioned into disjoint fragments, each
fragment assigned to one site.
– Complete Replication
• Consists of maintaining complete copy of database at
each site
– Selective Replication
• Combination of partitioning, replication, and
centralization.
www.folio3.com
www.folio3.com
Fragmentation
• A relation R is divided into fragments r1, r2, …rn,
which contain enough information to allow
reconstruction of R
www.folio3.com
www.folio3.com
Replication
• Replication is the process of copying and
maintaining database objects in multiple
databases that make up a distributed database
system. 
• Replication uses a publishing industry metaphor to
represent the components in a replication
topology, which include Publisher, Distributor,
Subscribers, publications, articles, and
subscriptions.
• Replication can improve the performance and
protect the availability of applications because
alternate data access options exist.
 
www.folio3.com
www.folio3.com
Magazine Metaphor
• A magazine publisher produces one or more
publications
• A publication contains articles
• The publisher either distributes the magazine
directly or uses a distributor
• Subscribers receive publications to which they
have subscribed
www.folio3.com
www.folio3.com
Types of Replication
• Merge replication
• Snapshot replication
• Snapshot replication with updating subscribers
• Transactional replication
• Transactional replication with updating
subscribers
www.folio3.com
www.folio3.com
Merge Replication
www.folio3.com
www.folio3.com
Snapshot Replication
www.folio3.com
www.folio3.com
Snapshot Replication with Updating Subscribers
www.folio3.com
www.folio3.com
Transactional Replication
www.folio3.com
www.folio3.com
Transactional Replication with Updating
Subscribers
• Changes written on subscriber can be moved to
publisher
• Guaranteed transactional consistency
• The change will then be converged with other
updating subscribers and then sent back out to all
the subscription databases
• Example: include low-volume reservation systems.
– Subscriber can look through a schedule of availability
and then attempt to make a reservation. After the
reservation has been scheduled, it can be replicated
within a few minutes to all the other subscription
databases.
www.folio3.com
www.folio3.com
The Publisher/Subscriber Metaphor
• The publisher is the owner of the source database
information. The publisher will make data
available for replication and will send changes to
the published data to the distributor.
• The subscriber database receives copies of the
data (snapshot replication) or transactions held in
the distribution database.
• The distributor receives all changes made to
published data. It then stores the data and
forwards it to subscribers at the appropriate time.
A single distribution server can support multiple
publishers and multiple subscribers at the same
time.
www.folio3.com
www.folio3.com
The Publisher/Subscriber Metaphor
• Article An individual collection of replicated data
usually associated with a table. Creating an
article from a table allows the administrator to
filter out columns or rows that they want to
exclude from the replication scenario.
www.folio3.com
www.folio3.com
The Publisher/Subscriber Metaphor
www.folio3.com
www.folio3.com
Publication Limitations
• Tables must have a primary key to ensure
integrity. (The exception is when you are using snapshot replication.)
• You cannot replicate the following databases:
– Master
– model
– msdb
– tempdb
– distribution databases
• Publications might not span multiple databases.
Each publication can contain articles from one
database only.
• IMAGE, TEXT, and NTEXT data have limited
support.
www.folio3.com
www.folio3.com
• When you set up a subscription at the same time
that you create your publications, you are
essentially setting up for a push subscription. This
helps to centralize subscription administration
because the subscription is defined at the
publisher along with the subscribers'
synchronization schedule. All the administration
of the subscription is handled from the publisher.
The data is "pushed" to the subscriber when the
publisher decides to send it.
Push Subscriptions
www.folio3.com
www.folio3.com
• A pull subscription is set up from each individual
subscriber. The subscribers initiate the transfer of
information on a timely basis. This is useful for
applications that can allow for a lower level of
security. The publisher can allow certain
subscribers to pull information, or the publisher
can allow anonymous subscriptions. Pull
subscriptions are also useful in situations in which
there might be a large number of subscribers.
Internet-based solutions are good candidates for
pull subscriptions.
Pull Subscriptions
www.folio3.com
www.folio3.com
What to Publish
• What am I going to publish?
• Do the subscribers receive all the data or just
subsets of my data?
• Should my data be partitioned by region values or
zip codes?
• Should I allow subscribers of my data to send me
updates?
• If I do allow updates, how should they be
implemented?
• Who can have access to my data?
• Are these users online or offline?
www.folio3.com
www.folio3.com
• Are they across the country and connected with
expensive phone lines?
• How often should I synchronize my data with the
subscribers?
• How often do they get changes sent to them?
What to Publish
www.folio3.com
www.folio3.com
Hands On Lab
www.folio3.com
www.folio3.com
End

Distributed Database Management System(DDMS)

  • 1.
    www.folio3.com Distributed Databases Name: MobeenAhmed Designation: Lead Software Engineer
  • 2.
    www.folio3.com www.folio3.com Agenda • Introduction todistributed databases • Distributed DBMS (DDBMS) • Types of DDBMS • Issues in Distributed Database Design • Replication • Types of Replication • The Publisher/Subscriber Metaphor • Publication Limitations • Push Subscriptions • Pull Subscriptions • Hands On Lab
  • 3.
    www.folio3.com www.folio3.com Why distributed databases? •Some initial motivations: – The development of computer networks promotes decentralization. – In a company, the database organization might reflect the organizational structure, which is distributed into units. Each unit maintains its own database. • Sharing of data can be achieved by developing a distributed database system which: – makes data accessible by all units – stores data close to where it is most frequently used
  • 4.
    www.folio3.com www.folio3.com Distributed Database • Alogically interrelated collection of shared data (and a description of this data), physically distributed over a computer network.
  • 5.
    www.folio3.com www.folio3.com Distributed DBMS (DDBMS) •Software system that permits the management of the distributed database and makes the distribution transparent to users.
  • 6.
    www.folio3.com www.folio3.com Advantages of DDBMSs •Reflects Organizational Structure • Improved Sharing and Local Autonomy • Improved Availability – A failure does not make the entire system inoperable • Improved Reliability Data may be replicated • Improved Performance – Data are local to the site of “greatest demand” • Economics – Many small computers cost less than a big one! • Modular Growth – easy to add new modules
  • 7.
    www.folio3.com www.folio3.com Disadvantages of DDBMSs •Complexity • Cost – Especially in system management • Security – network must be made secure • Integrity Control More Difficult • Lack of Standards • Lack of Experience
  • 8.
    www.folio3.com www.folio3.com Types of DDBMS •Homogeneous DDBMS – All sites use same DBMS product (eg. Sql Server or Oracle) – Fairly easy to design and manage. • Heterogeneous DDBMS – Sites may run different DBMS products (eg. Oracle and Ingress) – Possibly different underlying data models (eg. relational DB and OO database)
  • 9.
    www.folio3.com www.folio3.com Issues in DistributedDatabase Design Three key issues we have to consider: •Data Allocation: – where are data placed? Data should be stored at site with "optimal" distribution. •Fragmentation: – relation may be divided into a number of sub-relations (called fragments) , which are stored in different sites. •Replication: – copy of fragment may be maintained at several sites
  • 10.
    www.folio3.com www.folio3.com Data Allocation • Fourstrategies regarding placement of data: – Centralized • Consists of single database stored at one site with users distributed across the network. (This is not a DDB but distributed processing!!) – Partitioned (or Fragmented) • Database partitioned into disjoint fragments, each fragment assigned to one site. – Complete Replication • Consists of maintaining complete copy of database at each site – Selective Replication • Combination of partitioning, replication, and centralization.
  • 11.
    www.folio3.com www.folio3.com Fragmentation • A relationR is divided into fragments r1, r2, …rn, which contain enough information to allow reconstruction of R
  • 12.
    www.folio3.com www.folio3.com Replication • Replication is theprocess of copying and maintaining database objects in multiple databases that make up a distributed database system.  • Replication uses a publishing industry metaphor to represent the components in a replication topology, which include Publisher, Distributor, Subscribers, publications, articles, and subscriptions. • Replication can improve the performance and protect the availability of applications because alternate data access options exist.  
  • 13.
    www.folio3.com www.folio3.com Magazine Metaphor • Amagazine publisher produces one or more publications • A publication contains articles • The publisher either distributes the magazine directly or uses a distributor • Subscribers receive publications to which they have subscribed
  • 14.
    www.folio3.com www.folio3.com Types of Replication •Merge replication • Snapshot replication • Snapshot replication with updating subscribers • Transactional replication • Transactional replication with updating subscribers
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    www.folio3.com www.folio3.com Transactional Replication withUpdating Subscribers • Changes written on subscriber can be moved to publisher • Guaranteed transactional consistency • The change will then be converged with other updating subscribers and then sent back out to all the subscription databases • Example: include low-volume reservation systems. – Subscriber can look through a schedule of availability and then attempt to make a reservation. After the reservation has been scheduled, it can be replicated within a few minutes to all the other subscription databases.
  • 20.
    www.folio3.com www.folio3.com The Publisher/Subscriber Metaphor •The publisher is the owner of the source database information. The publisher will make data available for replication and will send changes to the published data to the distributor. • The subscriber database receives copies of the data (snapshot replication) or transactions held in the distribution database. • The distributor receives all changes made to published data. It then stores the data and forwards it to subscribers at the appropriate time. A single distribution server can support multiple publishers and multiple subscribers at the same time.
  • 21.
    www.folio3.com www.folio3.com The Publisher/Subscriber Metaphor •Article An individual collection of replicated data usually associated with a table. Creating an article from a table allows the administrator to filter out columns or rows that they want to exclude from the replication scenario.
  • 22.
  • 23.
    www.folio3.com www.folio3.com Publication Limitations • Tablesmust have a primary key to ensure integrity. (The exception is when you are using snapshot replication.) • You cannot replicate the following databases: – Master – model – msdb – tempdb – distribution databases • Publications might not span multiple databases. Each publication can contain articles from one database only. • IMAGE, TEXT, and NTEXT data have limited support.
  • 24.
    www.folio3.com www.folio3.com • When youset up a subscription at the same time that you create your publications, you are essentially setting up for a push subscription. This helps to centralize subscription administration because the subscription is defined at the publisher along with the subscribers' synchronization schedule. All the administration of the subscription is handled from the publisher. The data is "pushed" to the subscriber when the publisher decides to send it. Push Subscriptions
  • 25.
    www.folio3.com www.folio3.com • A pullsubscription is set up from each individual subscriber. The subscribers initiate the transfer of information on a timely basis. This is useful for applications that can allow for a lower level of security. The publisher can allow certain subscribers to pull information, or the publisher can allow anonymous subscriptions. Pull subscriptions are also useful in situations in which there might be a large number of subscribers. Internet-based solutions are good candidates for pull subscriptions. Pull Subscriptions
  • 26.
    www.folio3.com www.folio3.com What to Publish •What am I going to publish? • Do the subscribers receive all the data or just subsets of my data? • Should my data be partitioned by region values or zip codes? • Should I allow subscribers of my data to send me updates? • If I do allow updates, how should they be implemented? • Who can have access to my data? • Are these users online or offline?
  • 27.
    www.folio3.com www.folio3.com • Are theyacross the country and connected with expensive phone lines? • How often should I synchronize my data with the subscribers? • How often do they get changes sent to them? What to Publish
  • 28.
  • 29.

Editor's Notes

  • #16 Change in local copy Copied at any time to main copy Inconsistent Should be done where changes of conflict is minimum sites that tend to make changes to their records only (indicated by a location ID in each record), but need the information from all the other locations, are good candidates for merge replication Merge replication allows each site to make changes to its local copy of the replicated data. At some point in time, the changes from the site are sent up to the publishing database, where they are merged with changes from other sites. Sooner or later, all sites will receive the updates from all the other sites. This is known as data convergence.  Transactional consistency is thrown out the window here because different sites might be updating data at different times. A particular site does not wait for its updates to be sent to every other site before continuing its work. In other words, every site is guaranteed to converge to the same resultsets but not necessarily at the same time. Who should use merge replication? Good question. Because of the potential conflicts that can occur, merge replication is better suited to environments in which the chances of these conflicts are minimized. For example, sites that tend to make changes to their records only (indicated by a location ID in each record), but need the information from all the other locations, are good candidates for merge replication. For example, you might create a database that tracks the criminal history of individuals. A large state like Karachi, where every little town might like to have a copy of this criminal history but can't afford to be in contact with the central database at all times, might be an excellent location to implement merge replication. Each town would be autonomous, and latency could be very high. The local police or sheriff could add new criminal information to the database and then send it back to headquarters to be merged with data from many other towns. There might still be conflicts if a criminal is moving from town to town and causing problems, but these conflicts can be detected and the appropriate records can be updated—ahem, converged—and sent back to all the little towns.
  • #17 Whole items are copied from published server to subscriber Easiest setup a high level of site autonomy guarantees transactional consistency subscription database should consider the replicated data as read-only data will not be sent back to the publication database all changes that might have been made to the data will be wiped out when the next snapshot is downloaded  OLAP(online analytical processing) servers are excellent candidates for snapshot replication In snapshot replication, an entire copy of the items to be replicated is copied from the publishing server to the subscribing database, as shown in Figure 17.3. This type of replication is the easiest to set up and maintain. Snapshot replication has a high level of site autonomy. It also guarantees transactional consistency because all transactions are applied at the publication server only. The site autonomy can be very useful for locations that need read-only versions of the data and don't mind a higher amount of latency. When you are using snapshot replication, the subscription database should consider the replicated data as read-only. This is because any changes made to the data will not be sent back to the publication database. In addition, all changes that might have been made to the data will be wiped out when the next snapshot is downloaded. OLAP servers are excellent candidates for snapshot replication. The ad-hoc queries that management information systems (MIS) administrators apply to data are generally read-only, and data that is several hours or even several days old does not affect their queries. For example, a company MIS department might want to do some research on the demographics of items sold two months ago. Information from last week, or even today, won't make any difference in its queries. Furthermore, the department isn't planning to make changes to the data; it just needs the data warehouse. The site autonomy allows the MIS department to implement additional indexes on the data without affecting the OLTP publication database.
  • #18 With this methodology, you have a certain amount of autonomy because the subscription database does not have to be in contact with the publishing database at all times. The only time the subscriber is working with the publisher is when a snapshot is being downloaded or the subscriber is using 2PC to update a transaction at both the local (subscription) location and the publishing database The subscription server can immediately begin working with the changed data because it knows that it has successfully updated the publication server. The publication server will converge the information, and in time all servers involved in the replication will receive the changes.
  • #19 replication is one way a subscriber can make changes to data is directly to the publishing database a medium amount of autonomy subscriber should treat the replicated data as read-only example of transactional replication is found in an order-processing/distribution In transactional replication, the transactions are sent from the publisher to the subscribers. This type of replication is one way. The only way a subscriber can make changes to data is directly to the publishing database. The changes will then be replicated back down to the subscriber at the next synchronization This type of replication allows for a medium amount of autonomy. The subscriber should treat the replicated data as read-only. This is important because changes made on the replicated data might not allow the future replicated transactions to be performed. There is generally a medium amount of latency involved in this type of replication as well. The subscriber does not have to be in touch with the publisher at all times, but regular synchronizations are useful, and the amount of data being moved is relatively small. Remember that snapshot replication must move all the published data from the publisher to the subscriber (whether or not it has been modified). In transactional replication, only the transactions that were performed are sent to the subscribers Transactional replication is most useful in scenarios in which the subscribers can treat their data as read-only, but they need changes to the data with a minimal amount of latency. An excellent example of transactional replication is found in an order-processing/distribution system. In this type of scenario, you might have several different publishing sites taking orders for goods. These orders are then replicated to a central distribution warehouse where pick tickets are created and the orders are filled and shipped. The warehouse can treat the data as read-only, and needs new information in a timely manner.
  • #20 With transactional replication with updating subscribers you lose even more autonomy at the subscription sites, but you minimize latency. With this methodology, you use the transactional replication described in the last section with 2PC. When a subscription database attempts to make changes to data, the change is also written to the publishing database in a 2PC. This means that the change is written to both the subscriber and the publisher at the same time. Because of this, you have guaranteed transactional consistency. The change will then be converged with other updating subscribers and then sent back out to all the subscription databases. This has less latency than using snapshot replication with updating subscribers because the transactions being replicated are much smaller (and quicker to move) than synchronizing an entire snapshot of your data. Useful scenarios for this type of replication include low-volume reservation systems. In this type of system, a subscriber can look through a schedule of availability and then attempt to make a reservation. After the reservation has been scheduled, it can be replicated within a few minutes (or however long you determine) to all the other subscription databases. This updates all their schedules. You might be thinking to yourself, "Yeah that looks good, but what if I try to make a reservation that someone else already has booked, but that booking hasn't been replicated to this subscriber yet?" Remember that because this type of replication uses 2PC, you know that if your reservation is successfully committed, it was available and you didn't overwrite anyone else's reservations.
  • #25 You can set up multiple subscribers at the same time when you are working with push subscriptions. Push subscriptions are most useful when your subscribers need updates sent to them as soon as they occur. Push subscriptions also allow for a higher level of security as the publisher deems who is allowed to subscribe and when. Push subscriptions do take some additional overhead at the distribution database because it does the replication management.
  • #26 Only SQL Server subscribers can pull subscriptions. Other databases like Access, Oracle, and Sybase can use SQL Server 7.0 replication, but only in a push subscription scenario.