CSCI 397C16: Object-Oriented Database Systems
IBM DB2 Data Replication
Michael C. Morrison
IBM Santa Teresa Lab
CSCI 397C16 - MC Morrison
IBM DB2 Data Replication: The DB2 solution to the distributed data problem
A distributed database is, typically, a database that is not stored in its entirety at a single physical
location, but rather is spread across a network of computers that are geographically dispersed and
connected via communication links. [Date, p 28]
Why distribute data? [Martin, pp 136-9]
Ÿ reduce cost: store data where it will be used
Ÿ load: separate systems can serve different areas and thus spread the workload
Ÿ localize management: allow a local group to manage and maintain their own data rather than rely on
a centralized group (who could be across the country), that is, they maintain local autonomy for
Ÿ smaller, cheaper computers: do the work on PCs and workstations rather than on mainframes or
other large centralized servers.
Ÿ different types of data within an organization: different organizations within a company have different
database needs; let them have their own databases, yet keep them accessible across the
Ÿ improve response time: allow local PCs to store data locally to improve interactions with the user
Ÿ availability: if one of the distributed systems fails, users could route their requests to another system
in the network, rather than having to wait for a centralized system to be brought back online
Ÿ disaster recovery: like availability, but in the event of more serious outages.
A key objective for a distributed system is that it should look like a centralized system to the user. That
is, the user should not normally need to know where any given piece of data is physically stored. [Date, p
If a data model or group of data models is only weakly connected to other models, then the database
may be kept organizationally and physically distributed. This can provide considerable advantages in
terms of management and flexibility. The management of smaller, homogeneous information areas can
be carried out within a department where expertise in the area is available. [Wiederhold, p362]
In today’s enterprises, data is the key to the business, and there is a lot of it. Often, there is more data
than a company knows what to do with. As separate organizations within an enterprise work together,
they find that they have data they would like to share with each other, but they can’t (or won’t) give
people in the other group access to their systems. And increasingly, more data is being generated or
gathered away from the traditional centralized database servers that the company would like to include in
their enterprise systems, but without jeopardizing the autonomy of those who collect the data.
Today’s enterprises also have a work force that is increasingly mobile and diverse in its needs for data.
There are sales people and delivery people who must be able to access data while they are traveling.
There are other workers who work from remote locations who must be able to access data. There are
organizations within the enterprise that must be able to work with the data without corrupting the main
database or placing too heavy a demand on the server’s resources.
Is there a way to satisfy all of these disparate needs?
The way toward solving these problems is to distribute the data. [Bray, pp.75-87; Ozsu, pp. 78-93]
One solution is to give everyone direct access to a central server. Companies traditionally know how to
manage central servers, and probably already have one set up. A problem with this approach is that the
central server is a single point-of-failure: if it goes down, all the data is unavailable or worse! Another
CSCI 397C16 - MC Morrison
problem with the centralized approach is that it requires and uses a large amount of network bandwidth to
accommodate the users who must access the server, and with increased user traffic, the server itself
must be very large.
Another solution is to partition the data: carve it up so that it’s not all in one place, but is divided among
several different servers. The partitioning can be by database, by table, by column, by row, or a
combination of partitioning. Because the data is no longer in one place, there is no single point-of-failure
and the network traffic is distributed among the systems that have partitions. However, creating the
partitions is not easy because the locations and types of partitions must match the users’ needs, which
can and do change over time. Also, because the data is not in one place, a company can no longer
centralize its management, but must have administration personnel in each location where the data is.
Another solution is to copy, or replicate, the data. The data remains at the central server, but is also
copied, wholly or in part, to remote systems. The data can be fully replicated, in which case every
remote system has a copy of all the data, or the data can be partially replicated, in which case some
remote systems might have all the data while others have a smaller subset of the data. You can combine
replication with partitioning, and copy databases, tables, columns, rows, or a combination of any of these.
Because the data is no longer in one place, there is no single point-of-failure and the network traffic is
distributed among the systems that have copies. However, there is network traffic to copy the data
between systems, and increased overhead to ensure the copies are synchronized with the central server.
Also, because the data is not in one place, a company can no longer centralize its management, but must
have administration personnel in each location where the data is.
Reasons for replicating data [Bontempo, p. 141]:
Ÿ Improve response time for end users (who might otherwise be constrained by network traffic when
accessing remote data)
Ÿ Improve data availability (by minimizing reliance on a network and remote systems)
Ÿ Create a standby database that can be used if the primary remote system crashes or must be shut
Ÿ Simplify system management issues
Replication is an integrated feature of DB2, that is, on most platforms you don’t have to buy extra
software to implement replication. On AS/400 and OS/390, replication is a priced feature of DB2.
However, if you want to replicate to a non-DB2 database, you must use DB2 DataJoiner.
DB2 can replicate data between DB2 databases on any of the following IBM operating systems: AIX, AS/
400, OS/2, OS/390 (MVS), VM, and VSE. DB2 can also replicate data between DB2 databases running
natively on the following operating systems: HP-UX, Linux, Microsoft Windows (95, 98, or NT), SCO
UnixWare, and Sun Solaris.
DB2 can replicate data between DB2 databases and any of the following non-DB2 databases (using DB2
DataJoiner as the intermediary): Informix, Microsoft SQL Server, Oracle, Sybase, and Sybase SQL
DB2 can also replicate data between DB2 databases and IMS databases or VSAM files (using IMS
DataPropagator as the interface to the nonrelational data). With the addition of Lotus NotesPump, you
can also replicate data between DB2 databases and Lotus Notes databases.
Finally, DB2 supports the occasionally-connected environment by allowing replication to home
computers, laptops, and palmtop machines. Specifically, DB2 supports replication to Microsoft Jet
databases, DB2 Satellite Edition, and soon DB2 Everywhere.
Other companies also support replication for their databases, for example: Informix, Ingres, NonStop
SQL, Oracle, Microsoft SQL Server, and Sybase.
CSCI 397C16 - MC Morrison
DB2 Replication Concepts
DB2 Replication requires three logical servers (which can be on 1 to 3 physical servers): a source
server, a target server, and a control server. The source server has all the source tables, the target
server has all the target tables, and the control server has control tables that keep track of and govern
the current state of replication between sources and targets.
To capture data from DB2 source tables, DB2 runs the Capture program, which reads the DB2 log and
copies data to a staging table: the Change Data (CD) table. To capture data from non-DB2 source
tables, DB2 defines Capture triggers for each of the non-DB2 databases.
To copy the data from the source tables to the target tables, DB2 uses the Apply program. To ensure
that DB2 only replicates transactionally-consistent data, the Apply program joins the CD table and the
Unit-of-Work (UOW) table. DB2 can save the results of this join for future use in a Consistent Change
Data (CCD) table.
[DB2 replication] captures data changes by reading log records (often when still in buffers) and
recording relevant information in “staging tables” at the source site. The tables include a change data
table to track changes to the source table and a Unit of Work table to record transaction boundaries
(including commit points). Because [DB2 replication] reads data from the log as changes occur, it
can capture uncommitted work. The product enables customers to either propagate uncommitted
work (by reading information in the change data table only) or propagate only committed work (by
joining data from the change data table with the Unit of Work table). [Bontempo, p. 204]
The Apply program can run at the source server or the target server. When the Apply program runs at
the source server or other server, it pushes data to the target tables. When it runs at the target server, it
pulls data from the source tables. Pulling data is generally more efficient because DB2 can make better
use of the available network bandwidth.
To set up and maintain a replication environment for DB2-to-DB2 replication, use the DB2 Control
Center, the GUI that comes with DB2 for DB2 administration. Unfortunately, if you want to set up and
maintain replication in a heterogeneous environment, you must use the DB2 DataJoiner Replication
Administration (DJRA) tool. Although IBM strongly recommends not doing so, you could set up replication
by manually or programmatically editing the replication control tables.
IBM provides several other tools to administer a replication environment: tracing the Capture program or
the Apply program, monitoring currently-running replications, and analyzing the replication setup to
ensure it is defined correctly.
A replication source can be one table, part of one table, a join of several tables, or a view of one or more
tables. When you define a replication source, you must choose which columns to make available for
replication. Because you later define a source as a member of a subscription set, wherein you can
decide which columns (of the ones available) to replicate, IBM recommends that you include all columns
in the replication source. You can also choose whether to replicate before images of any columns; these
before images can be useful for auditing or security. Likewise, you can choose whether an SQL UPDATE
should be treated as an UPDATE or instead as an INSERT followed by a DELETE.
The default replication scenario includes a read-only target table; you can define replication to include a
read-write target table: IBM calls this scenario Update Anywhere. Allowing the target table to be updated
requires you to decide, while defining the replication source, whether you want DB2 to enforce conflict
detection at the target table. You have three options, no conflict detection (useful if you can guarantee
that you will have no conflicts, such as in a application/data partitioning scenario), row-level detection (the
most popular option), or table-level detection (best, but might not be feasible because of performance
requirements). [Stonebraker, pp. 261-266]
CSCI 397C16 - MC Morrison
You can also decide whether to create the target tables or to use existing target tables. For new tables,
you should enable full refresh, wherein the entire source table is copied to the target table. For existing
tables, you should use differential refresh, wherein only changes to the source table are copied to the
target table. In this case, you can also select a conflict-detection level even if the target table is read-
It’s most desirable if the system sends only the changes (rather than a complete copy of the table) to the
remote sites. This minimizes resource consumption and helps prevent potential performance problems.
[Bontempo, p. 146]
In order to allow replication to occur for multiple tables in parallel, you group them into subscription sets.
Essentially, the target tables subscribes to changes made to the source tables. For each subscription
set, you can define when and how often you want the changes to be replicated. You can set the timing
based on the clock or based on some event using a trigger.
There are three main approaches to storing data in a distributed system [Silberschatz, p.588]:
Ÿ replication: copy data from one system to some others or all others in the network
Ÿ fragmentation: partition the data into nonoverlapping fragments and store each fragment at a different
site within the network
Ÿ replication and fragmentation: partition the data and copy each fragment to some or all systems in
For each member of the subscription set, you can define which columns and rows to replicate. If you
include only a subset of columns, you create a vertical fragment; if you include a subset of the rows (by
including a WHERE clause), you create a primary horizontal fragment. You can also combine these two
approaches. If the source table is a join of other tables, and you include only a subset of the rows, you
create a derived horizontal fragment. [Silberschatz, pp 589-593; Ozsu, pp. 99-136]
Because data is captured from the source tables before it is replicated, you have a chance to manipulate
the data before the Apply program copies it to the target tables. One reason to do so is to convert
currency from, say, dollars to Yen, or to combine several values from a source table to one aggregate
value at the target table.
Use of staging tables at the source site enables customers to apply a variety of SQL functions,
including aggregate functions, on the data before propagating it. Thus, if one site is interested in
seeing only the average opening balance for new checking accounts, it can obtain only this
information by writing an appropriate query [in the WHERE clause for the subscription set member].
It does not need to propagate individual rows about all new accounts and then calculate the average
opening balance at the target site. [Bontempo, p. 205]
There are four popular replication configurations: data distribution, data consolidation, update anywhere,
In a data distribution scenario, changes made to a source table are replicated to read-only target tables.
Typically, each target table is a full copy (not a fragment). This scenario is useful for distributed data
In a data consolidation scenario, each read-only target table is a horizontal fragment. This scenario is
useful for maintaining decision-support data, where each target table is needed by only a part of the
CSCI 397C16 - MC Morrison
In the update-anywhere scenario, each target table is read-write. In the case of update conflicts, the
source table always wins. You can combine data distribution or data consolidation with update
In the occasionally-connected scenario, the target servers are not always connected to the network. The
Apply program replicates changed data when the target server connects to the source server. This is the
typical scenario for home PCs (telecommuting), laptops, and other portable computers.
DB2 provides replication for data on many platforms in both the homogeneous and heterogeneous data
environments. You set up and maintain DB2 replication using one of two GUIs. Target tables subscribe
to changes made to source tables, and can subset the source data into vertical or horizontal fragments.
During replication, DB2 can modify the data so that it needn’t match the source data in either format or
content (for example, changing the format of a timestamp or a double-byte string to an integer) based on
SQL functions. The target tables are often read-only, but can also be read-write. Because target tables
are part of subscription sets, and because the Apply program can run as multiple instances, you can
replicate large amounts of data in parallel.
-----. 1999. IBM DB2 Replication Guide and Reference. Version 6. IBM Corporation. (SC26-9642)
-----. n.d. IBM DB2 Replication Web site: http://www.ibm.com/software/data/dpropr
Bontempo, Charles J. and Cynthia Maro Saracco. 1995. Database Management Principles and
Products. Upper Saddle River, NJ: Prentice Hall PTR.
Bray, Olin H. 1982. Distributed Database Management Systems. Lexington, MA: D.C. Heath and Co.
Cook, Jonathan and Robert Harbus. 1999. The DB2 Replication Certification Guide. Upper Saddle
River, NJ: Prentice Hall PTR.
Date, C. J. 1981. An Introduction to Database Systems: Volume 1. 3/e. Reading, MA: Addison-Wesley
Martin, James. 1976. Principles of Data-Base Management. Englewood Cliffs, NJ: Prentice-Hall, Inc.
Ozsu, M. Tamer and Patrick Valduriez. 1991. Principles of Distributed Database Systems. Upper
Saddle River, NJ: Prentice Hall.
Silberschatz, Abraham; Henry F. Korth; S. Sudarshan. 1997. Database System Concepts. 3/e. Boston:
Stonebraker, Michael, ed. 1988. Readings in Database Systems. San Mateo, CA: Morgan Kaufman
Wiederhold, Gio. 1977. Database Design. New York: McGraw-Hill Book Co.