Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. CSCI 397C16: Object-Oriented Database Systems Project 1 IBM DB2 Data Replication Fall 1999 Michael C. Morrison IBM Santa Teresa Lab
  2. 2. CSCI 397C16 - MC Morrison IBM DB2 Data Replication: The DB2 solution to the distributed data problem A distributed database is, typically, a database that is not stored in its entirety at a single physical location, but rather is spread across a network of computers that are geographically dispersed and connected via communication links. [Date, p 28] Why distribute data? [Martin, pp 136-9] Ÿ reduce cost: store data where it will be used Ÿ load: separate systems can serve different areas and thus spread the workload Ÿ localize management: allow a local group to manage and maintain their own data rather than rely on a centralized group (who could be across the country), that is, they maintain local autonomy for the data Ÿ smaller, cheaper computers: do the work on PCs and workstations rather than on mainframes or other large centralized servers. Ÿ different types of data within an organization: different organizations within a company have different database needs; let them have their own databases, yet keep them accessible across the enterprise Ÿ improve response time: allow local PCs to store data locally to improve interactions with the user Ÿ availability: if one of the distributed systems fails, users could route their requests to another system in the network, rather than having to wait for a centralized system to be brought back online Ÿ disaster recovery: like availability, but in the event of more serious outages. A key objective for a distributed system is that it should look like a centralized system to the user. That is, the user should not normally need to know where any given piece of data is physically stored. [Date, p 28] If a data model or group of data models is only weakly connected to other models, then the database may be kept organizationally and physically distributed. This can provide considerable advantages in terms of management and flexibility. The management of smaller, homogeneous information areas can be carried out within a department where expertise in the area is available. [Wiederhold, p362] The Problem In today’s enterprises, data is the key to the business, and there is a lot of it. Often, there is more data than a company knows what to do with. As separate organizations within an enterprise work together, they find that they have data they would like to share with each other, but they can’t (or won’t) give people in the other group access to their systems. And increasingly, more data is being generated or gathered away from the traditional centralized database servers that the company would like to include in their enterprise systems, but without jeopardizing the autonomy of those who collect the data. Today’s enterprises also have a work force that is increasingly mobile and diverse in its needs for data. There are sales people and delivery people who must be able to access data while they are traveling. There are other workers who work from remote locations who must be able to access data. There are organizations within the enterprise that must be able to work with the data without corrupting the main database or placing too heavy a demand on the server’s resources. Is there a way to satisfy all of these disparate needs? Solutions The way toward solving these problems is to distribute the data. [Bray, pp.75-87; Ozsu, pp. 78-93] One solution is to give everyone direct access to a central server. Companies traditionally know how to manage central servers, and probably already have one set up. A problem with this approach is that the central server is a single point-of-failure: if it goes down, all the data is unavailable or worse! Another 2
  3. 3. CSCI 397C16 - MC Morrison problem with the centralized approach is that it requires and uses a large amount of network bandwidth to accommodate the users who must access the server, and with increased user traffic, the server itself must be very large. Another solution is to partition the data: carve it up so that it’s not all in one place, but is divided among several different servers. The partitioning can be by database, by table, by column, by row, or a combination of partitioning. Because the data is no longer in one place, there is no single point-of-failure and the network traffic is distributed among the systems that have partitions. However, creating the partitions is not easy because the locations and types of partitions must match the users’ needs, which can and do change over time. Also, because the data is not in one place, a company can no longer centralize its management, but must have administration personnel in each location where the data is. Another solution is to copy, or replicate, the data. The data remains at the central server, but is also copied, wholly or in part, to remote systems. The data can be fully replicated, in which case every remote system has a copy of all the data, or the data can be partially replicated, in which case some remote systems might have all the data while others have a smaller subset of the data. You can combine replication with partitioning, and copy databases, tables, columns, rows, or a combination of any of these. Because the data is no longer in one place, there is no single point-of-failure and the network traffic is distributed among the systems that have copies. However, there is network traffic to copy the data between systems, and increased overhead to ensure the copies are synchronized with the central server. Also, because the data is not in one place, a company can no longer centralize its management, but must have administration personnel in each location where the data is. Reasons for replicating data [Bontempo, p. 141]: Ÿ Improve response time for end users (who might otherwise be constrained by network traffic when accessing remote data) Ÿ Improve data availability (by minimizing reliance on a network and remote systems) Ÿ Create a standby database that can be used if the primary remote system crashes or must be shut down Ÿ Simplify system management issues DB2 Replication Replication is an integrated feature of DB2, that is, on most platforms you don’t have to buy extra software to implement replication. On AS/400 and OS/390, replication is a priced feature of DB2. However, if you want to replicate to a non-DB2 database, you must use DB2 DataJoiner. DB2 can replicate data between DB2 databases on any of the following IBM operating systems: AIX, AS/ 400, OS/2, OS/390 (MVS), VM, and VSE. DB2 can also replicate data between DB2 databases running natively on the following operating systems: HP-UX, Linux, Microsoft Windows (95, 98, or NT), SCO UnixWare, and Sun Solaris. DB2 can replicate data between DB2 databases and any of the following non-DB2 databases (using DB2 DataJoiner as the intermediary): Informix, Microsoft SQL Server, Oracle, Sybase, and Sybase SQL Anywhere. DB2 can also replicate data between DB2 databases and IMS databases or VSAM files (using IMS DataPropagator as the interface to the nonrelational data). With the addition of Lotus NotesPump, you can also replicate data between DB2 databases and Lotus Notes databases. Finally, DB2 supports the occasionally-connected environment by allowing replication to home computers, laptops, and palmtop machines. Specifically, DB2 supports replication to Microsoft Jet databases, DB2 Satellite Edition, and soon DB2 Everywhere. Other companies also support replication for their databases, for example: Informix, Ingres, NonStop SQL, Oracle, Microsoft SQL Server, and Sybase. 3
  4. 4. CSCI 397C16 - MC Morrison DB2 Replication Concepts DB2 Replication requires three logical servers (which can be on 1 to 3 physical servers): a source server, a target server, and a control server. The source server has all the source tables, the target server has all the target tables, and the control server has control tables that keep track of and govern the current state of replication between sources and targets. To capture data from DB2 source tables, DB2 runs the Capture program, which reads the DB2 log and copies data to a staging table: the Change Data (CD) table. To capture data from non-DB2 source tables, DB2 defines Capture triggers for each of the non-DB2 databases. To copy the data from the source tables to the target tables, DB2 uses the Apply program. To ensure that DB2 only replicates transactionally-consistent data, the Apply program joins the CD table and the Unit-of-Work (UOW) table. DB2 can save the results of this join for future use in a Consistent Change Data (CCD) table. [DB2 replication] captures data changes by reading log records (often when still in buffers) and recording relevant information in “staging tables” at the source site. The tables include a change data table to track changes to the source table and a Unit of Work table to record transaction boundaries (including commit points). Because [DB2 replication] reads data from the log as changes occur, it can capture uncommitted work. The product enables customers to either propagate uncommitted work (by reading information in the change data table only) or propagate only committed work (by joining data from the change data table with the Unit of Work table). [Bontempo, p. 204] The Apply program can run at the source server or the target server. When the Apply program runs at the source server or other server, it pushes data to the target tables. When it runs at the target server, it pulls data from the source tables. Pulling data is generally more efficient because DB2 can make better use of the available network bandwidth. Administration To set up and maintain a replication environment for DB2-to-DB2 replication, use the DB2 Control Center, the GUI that comes with DB2 for DB2 administration. Unfortunately, if you want to set up and maintain replication in a heterogeneous environment, you must use the DB2 DataJoiner Replication Administration (DJRA) tool. Although IBM strongly recommends not doing so, you could set up replication by manually or programmatically editing the replication control tables. IBM provides several other tools to administer a replication environment: tracing the Capture program or the Apply program, monitoring currently-running replications, and analyzing the replication setup to ensure it is defined correctly. Replication Sources A replication source can be one table, part of one table, a join of several tables, or a view of one or more tables. When you define a replication source, you must choose which columns to make available for replication. Because you later define a source as a member of a subscription set, wherein you can decide which columns (of the ones available) to replicate, IBM recommends that you include all columns in the replication source. You can also choose whether to replicate before images of any columns; these before images can be useful for auditing or security. Likewise, you can choose whether an SQL UPDATE should be treated as an UPDATE or instead as an INSERT followed by a DELETE. The default replication scenario includes a read-only target table; you can define replication to include a read-write target table: IBM calls this scenario Update Anywhere. Allowing the target table to be updated requires you to decide, while defining the replication source, whether you want DB2 to enforce conflict detection at the target table. You have three options, no conflict detection (useful if you can guarantee that you will have no conflicts, such as in a application/data partitioning scenario), row-level detection (the most popular option), or table-level detection (best, but might not be feasible because of performance requirements). [Stonebraker, pp. 261-266] 4
  5. 5. CSCI 397C16 - MC Morrison You can also decide whether to create the target tables or to use existing target tables. For new tables, you should enable full refresh, wherein the entire source table is copied to the target table. For existing tables, you should use differential refresh, wherein only changes to the source table are copied to the target table. In this case, you can also select a conflict-detection level even if the target table is read- only. It’s most desirable if the system sends only the changes (rather than a complete copy of the table) to the remote sites. This minimizes resource consumption and helps prevent potential performance problems. [Bontempo, p. 146] Subscription Sets In order to allow replication to occur for multiple tables in parallel, you group them into subscription sets. Essentially, the target tables subscribes to changes made to the source tables. For each subscription set, you can define when and how often you want the changes to be replicated. You can set the timing based on the clock or based on some event using a trigger. There are three main approaches to storing data in a distributed system [Silberschatz, p.588]: Ÿ replication: copy data from one system to some others or all others in the network Ÿ fragmentation: partition the data into nonoverlapping fragments and store each fragment at a different site within the network Ÿ replication and fragmentation: partition the data and copy each fragment to some or all systems in the network. For each member of the subscription set, you can define which columns and rows to replicate. If you include only a subset of columns, you create a vertical fragment; if you include a subset of the rows (by including a WHERE clause), you create a primary horizontal fragment. You can also combine these two approaches. If the source table is a join of other tables, and you include only a subset of the rows, you create a derived horizontal fragment. [Silberschatz, pp 589-593; Ozsu, pp. 99-136] Because data is captured from the source tables before it is replicated, you have a chance to manipulate the data before the Apply program copies it to the target tables. One reason to do so is to convert currency from, say, dollars to Yen, or to combine several values from a source table to one aggregate value at the target table. Use of staging tables at the source site enables customers to apply a variety of SQL functions, including aggregate functions, on the data before propagating it. Thus, if one site is interested in seeing only the average opening balance for new checking accounts, it can obtain only this information by writing an appropriate query [in the WHERE clause for the subscription set member]. It does not need to propagate individual rows about all new accounts and then calculate the average opening balance at the target site. [Bontempo, p. 205] Replication Configurations There are four popular replication configurations: data distribution, data consolidation, update anywhere, and occasionally-connected. In a data distribution scenario, changes made to a source table are replicated to read-only target tables. Typically, each target table is a full copy (not a fragment). This scenario is useful for distributed data sharing. In a data consolidation scenario, each read-only target table is a horizontal fragment. This scenario is useful for maintaining decision-support data, where each target table is needed by only a part of the organization. 5
  6. 6. CSCI 397C16 - MC Morrison In the update-anywhere scenario, each target table is read-write. In the case of update conflicts, the source table always wins. You can combine data distribution or data consolidation with update anywhere. In the occasionally-connected scenario, the target servers are not always connected to the network. The Apply program replicates changed data when the target server connects to the source server. This is the typical scenario for home PCs (telecommuting), laptops, and other portable computers. Summary DB2 provides replication for data on many platforms in both the homogeneous and heterogeneous data environments. You set up and maintain DB2 replication using one of two GUIs. Target tables subscribe to changes made to source tables, and can subset the source data into vertical or horizontal fragments. During replication, DB2 can modify the data so that it needn’t match the source data in either format or content (for example, changing the format of a timestamp or a double-byte string to an integer) based on SQL functions. The target tables are often read-only, but can also be read-write. Because target tables are part of subscription sets, and because the Apply program can run as multiple instances, you can replicate large amounts of data in parallel. Bibliography -----. 1999. IBM DB2 Replication Guide and Reference. Version 6. IBM Corporation. (SC26-9642) -----. n.d. IBM DB2 Replication Web site: http://www.ibm.com/software/data/dpropr Bontempo, Charles J. and Cynthia Maro Saracco. 1995. Database Management Principles and Products. Upper Saddle River, NJ: Prentice Hall PTR. Bray, Olin H. 1982. Distributed Database Management Systems. Lexington, MA: D.C. Heath and Co. Cook, Jonathan and Robert Harbus. 1999. The DB2 Replication Certification Guide. Upper Saddle River, NJ: Prentice Hall PTR. Date, C. J. 1981. An Introduction to Database Systems: Volume 1. 3/e. Reading, MA: Addison-Wesley Publishing Co. Martin, James. 1976. Principles of Data-Base Management. Englewood Cliffs, NJ: Prentice-Hall, Inc. Ozsu, M. Tamer and Patrick Valduriez. 1991. Principles of Distributed Database Systems. Upper Saddle River, NJ: Prentice Hall. Silberschatz, Abraham; Henry F. Korth; S. Sudarshan. 1997. Database System Concepts. 3/e. Boston: WCB/McGraw-Hill. Stonebraker, Michael, ed. 1988. Readings in Database Systems. San Mateo, CA: Morgan Kaufman Publishers, Inc. Wiederhold, Gio. 1977. Database Design. New York: McGraw-Hill Book Co. 6