A Coordinated Recovery Process Obtaining Outage Free Recovery ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Coordinated Recovery Process Obtaining Outage Free Recovery ...

  1. 1. A Coordinated Recovery Process: Obtaining Outage Free Recovery Points for IMS and DB2 Disaster Recovery Rick Weaver Product Manager Recovery Management for OS/390 BMC Software, Inc.
  2. 2. Contents Coordinated Recovery Management ............................................................................1 The Alternative – A Coordinated Recovery Process...................................................2 Overview of the Coordinated Recovery Process ........................................................2 The Gory Details .........................................................................................................3 The REXX EXEC - CRPREXX....................................................................................7 Glossary ........................................................................................................................11 USE OF THE ENCLOSED MATERIALS CONSTITUTES YOUR COMPANY’S ACCEPTANCE OF THE FOLLOWING TERMS. IF YOU DO NOT AGREE WITH THESE TERMS, PLEASE RETURN THE MATERIALS TO BMC SOFTWARE. • You may use these materials for your company’s own internal purposes. • BMC Software is providing these materials AS IS, WITHOUT ANY WARRANTIES. • BMC Software disclaims any and all warranties related to these materials, including the implied warranties of merchantability and fitness for a particular purpose. • BMC Software shall not be liable for any consequential, incidental or special damages arising out of or related to these materials.
  3. 3. Coordinated Recovery Management In many large data processing organizations, applications have evolved and established relationships between IMS and DB2 databases. Sometimes these relationships are known to the application (as in the case of a two-phase commit process), but frequently it’s a more casual relationship. A transaction in one database management system may spawn a subsequent transaction in a different database management system. The transactions are tied, but the log data isn’t coordinated. This situation requires a coordinated recovery between IMS and DB2 applications for both local recovery and disaster recovery. To ensure data integrity, transactions and batch jobs on both database management systems are drained or stopped to establish a system-wide point of consistency (SWPOC). This technique is preferable from a data integrity standpoint, but it causes an availability impact that is in conflict with a 24X7 environment. BMC Software has delivered a COORDINATED RECOVERY MANAGER (CRM) facility since 1996. This facility has been delivered at no additional charge with the RECOVERY MANAGER (RMGR) products for IMS, DB2, and OS/390. CRM is effectively shareware that allows you to associate groups from the RMGR products and conduct operations on the resulting larger CRM group. CRM is a useful way to gather together a large number of disparate database objects and consolidate them into a single manageable entity. The supported operations that can be performed on the CRM group include a Hold Point of Consistency (HPC) function, which establishes a SWPOC by causing and coordinating IMS /DBR and DB2 STOP commands along with VSAM file deallocations. The resulting recovery point is registered in the CRM repository and can be used by future recovery events to create a massive Point-In-Time recovery with assured data integrity. Unfortunately, obtaining a SWPOC requires a lengthy outage that most shops are unwilling to tolerate. These shops are looking for an alternative solution. Some shops have reviewed advanced-technology solutions such as disk mirroring or shadowing, log capture with forward recovery, and remote-site tape vaulting. All of these solutions carry a cost- and-risk burden that renders them ineffective in many shops. BMC Software now offers an alternative process for obtaining a recovery point. This process ensures data integrity after system-wide recovery, yet it requires no outage to establish a SWPOC. This unique process is built on the intelligence and innovation in BMC Software’s recovery products for IMS and DB2. This process, as described in this document, replaces the use of CRM for coordinated recovery. BMC Software intends to drop support for COORDINATED RECOVERY MANAGER by June 2001. Page 1
  4. 4. The Alternative – A Coordinated Recovery Process Using exclusive BMC Software technology, it’s possible to capture information from the database repositories and use it to derive a consistent and coordinated recovery point. The probable use for this process is in support of disaster recovery (DR), but some applications (for example, large ERP applications with correlated legacy applications) may make use of the procedure for local recoveries. The following sections explain how to use the coordinated recovery process (CRP). The overview provides a quick look at the process. The overview is followed by a detailed explanation and some REXX code to automate some of the operations. Overview of the Coordinated Recovery Process The coordinated recovery process consists of the following major steps: 1. Use the BMC Software RECOVERY MANAGER (RMGR) for IMS Disaster Recovery RECON Cleanup (DRRCN) utility to report on the RECONs, looking for the latest, safest timestamp that can be used in a Point-In-Time (PIT) recovery. This timestamp is the STOPTIME of the oldest SLDS (the one with the least recent timestamp) in an IMS data-sharing environment. RMGR for IMS ensures that all open logs are properly handled before choosing the PIT timestamp. 2. Pass the captured timestamp to the RMGR for DB2 Timestamp Insertion (ARMBTSI) batch program, which inserts the timestamp into the RMGR CRRDRPT repository table. 3. Execute the RMGR for DB2 Coordinated Disaster Recovery (ARMBCRC) batch program, which formats the timestamp into a relative byte address (RBA) or, in a DB2 data-sharing environment, a log record sequence number (LRSN). The RBA or LRSN is used to generate a DB2 conditional restart control record (CRCR) to that RBA or LRSN. 4. Execute the RMGR for DB2 System Resource Recovery (ARMBSRR) batch program, which generates a DB2 subsystem recovery and restart based on the CRCR RBA or LRSN that was built by the ARMBCRC program in the prior step. The DB2 restart process backs out any in-flight transactions at the RBA or LRSN. 5. Pass the latest IMS timestamp to RMGR for IMS to generate one or more RECOVERY PLUS for IMS job that recover the databases to this timestamp. Transactions that were in-flight at the timestamp are not applied during the recovery. NOTE: This process assumes that the IMS and DB2 systems share a sysplex timer; otherwise, the timestamps will not coincide. Page 2
  5. 5. The net effect to your application is just as if the power was dropped at the local site at the time represented by the PIT timestamp and you restarted all the IMS and DB2 subsystems. You’d lose the work that was in-flight at the time of the power failure. You’d be at the same position at the DR site. You can use the RMGR for IMS Log Analysis function to obtain a report of work that was in progress at the timestamp you used for recovery. The process flow includes application copies before the coordinated recovery process and dumps of the appropriate system files (catalogs, directories, libraries, and so on) after the coordinated recovery process. The Gory Details The following sections explain the coordinated recovery process in detail. Preparation Steps Perform the following preparation steps: Step Description Prepare the DR site. Ensure that valid copies of the IMS and DB2 system libraries and BMC Software RMGR product libraries are available at the DR site. Create and validate Create and populate RMGR groups to be used for RMGR groups. possible DR scenarios. Use the Group Validation function regularly to ensure that groups are kept updated with changes in database allocations. Create an RMGR profile Create an RMGR for IMS DR profile that is tailored for for DR. usage at the DR site (for example, turn on secondary image copies and secondary logs for use in recovery, check output devices that are used for change accumulation, review SORT parameters, and so on). Check recovery assets. Periodically (perhaps monthly) verify the DR process by performing the Check Assets function (to ensure availability of recovery assets) and by using the Create Recovery JCL function (to create practice JCL). Page 3
  6. 6. At the Local Site On a periodic basis (perhaps nightly), perform the following preparation steps: Step Description Copy objects. Copy the IMS and DB2 application objects. (You can create clean or fuzzy copies; it doesn’t matter.) Capture allocation Run the RMGR for IMS Automatic Delete/Define information. (DRAMS) utility Capture function to store allocation information about database data sets in the RMGR for IMS repository. Back up system objects. Back up the IMS RECONs, DB2 catalog and directories, all BMC Software RMGR product repositories, and all IMS and DB2 system libraries and data sets. Prepare these backups for shipment to the remote site. Create system log data Switch the IMS online log data sets (OLDSs) to create sets. system log data sets (SLDSs) on tape, and prepare them for shipment. Generate the DR Execute the RMGR for IMS RECON Cleanup timestamp. (DRRCN) utility Check function to generate a report that contains the LATEST PIT RECOVERY TIME IS: value field. The value is the coordinated recovery disaster recovery timestamp (CR DR TS). Conduct date/time Invoke the CRPREXX REXX EXEC program (provided conversion. in this document) to browse the DRRCN report output and conduct date/time conversion and rounding on the CR DR TS. Capture the timestamp value and ship it to the remote site with the other recovery assets. Pass the converted Pass the converted CR DR TS value to the RMGR for timestamp value to DB2 Timestamp Insertion (ARMBTSI) batch program, ARMBTSI which inserts a row into the CRRDRPT repository table. Translate the timestamp Invoke the RMGR for DB2 Coordinated Disaster to an RBA/LRSN value. Recovery (ARMBCRC) batch program, which takes the timestamp value from the CRRDRPT table and translates it into the proper RBA/LRSN value. This value becomes the input to the Conditional Restart Control Record (CRCR) process that is used later by the RMGR for DB2 System Resource Recovery (ARMBSRR) batch program. Archive the active DB2 Archive the active DB2 logs through the RMGR for logs. DB2 Archive Log Creation (ARMBLOG) batch program. Page 4
  7. 7. Copy the archive logs. Invoke the RMGR for DB2 Archive Log Data Sets (ARMBARC) batch program to copy the logs that are needed for recovery, and ship the copies to the remote site. ARMBARC also serves the very useful purpose of capturing the image copy information for the three ‘special’ DB2 Catalog/Directory objects (DSNDB01.DBD01, DSNDB01.SYSUTILX, and DSNDB06.SYSCOPY. These objects are fundamental to the success of DB2 Disaster Recovery, and image copy information is not stored in the normal fashion on DSNDB06.SYSCOPY. Generate JCL to rebuild Pass the CRCR RBA/LRSN value to the ARMBSRR DB2 at the remote site. batch program, which generates the 200+ job steps to rebuild DB2 at a remote site to the designated RBA/ LRSN (the CRCR established by the ARMBCRC program). The ARMBSRR program also generates recoveries for the RMGR for DB2 repository tables. Ship the output of the ARMBSRR program to the remote site. Generate JCL to recover Invoke the RMGR for DB2 Backup and Recovery JCL application data. (ARMBGEN) batch program to generate recovery JCL for the application data. Ship the output to the remote site. Back up the ICF catalog Back up the operating system catalog (ICF) and the and TMS directory. Tape Management System (TMS) directory, and prepare the backups for shipment to the remote site. Figure 1 shows the process flow at the local site. Note that the process works only if the truck is purple ;-) Figure 1 – The Outage Free Coordinated Recovery Point Capture Process (Local Site) Page 5
  8. 8. At the Remote Site When you are performing disaster recovery at the remote site, the process is relatively straightforward. After the operating system has been restored, you simply restore the IMS RECONs and prepare them for a cold-start of IMS, release the jobs that are generated by the ARMBSRR and ARMBGEN programs, and generate IMS database recoveries to recover the data to the CR DR TS that was captured by the DRRCN utility and that was used by the CRPREXX program. Any in-flight IMS or DB2 transactions that were active at the time of the CR DR TS are not recovered. Perform the following steps: Step Description Restore IMS system and Restore IMS RECON (RECON1) and system libraries, product libraries. as well as the RMGR for IMS data sets. Clean up the RECON. Execute the RMGR for IMS RECON Cleanup (DRRCN) utility Update function to prepare the IMS RECON data sets for startup. Make a note of the value that the utility reports in the LATEST PIT RECOVERY TIME IS: value field. Allocate and REPRO Allocate RECON2 and RECON3 data sets, and RECONs. REPRO the cleaned RECON1 to RECON2. Perform IMS startup. Cold-start IMS, and start RMGR for IMS. Perform IMS change Perform IMS change accumulation if you want to do accumulation (optional). so. Refer to the RECOVERY PLUS for IMS documentation for caveats and recommendations for timestamp selection. Generate PIT recovery Use the value from the LATEST PIT RECOVERY JCL for IMS objects. TIME IS: value field (from the execution of the RECON Cleanup utility at the local site); this value is also the CR DR TS that is used in the CRPREXX program. Enter this value in the RMGR for IMS ISPF dialog to generate the appropriate RECOVERY PLUS for IMS JCL to conduct a PIT recovery to that timestamp. Release PIT recovery Release the generated IMS PIT recovery jobs (the PIT JCL. timestamp is the time that was captured in the RECON Cleanup utility previously in this procedure). Page 6
  9. 9. Recover the DB2 system. Execute the jobs that the ARMBSRR program generates to recover the DB2 system and RMGR for DB2 repository. (The two jobs are stored in one data set). Remember to specify the DEFER ALL parameter in the ZPARMS for the remote site startup; if you do not, disk allocation failures occur as DB2 tries to mount volumes to back out in-flight transactions on data that doesn’t exist … yet. Recover DB2 Release the DB2 application recovery jobs that are applications. generated by the ARMBGEN program, the remote site, or both. Figure 2 shows the processes at the remote site. After the operating system and libraries are restored, the IMS and DB2 activities can run in parallel as long as enough resources are available. CR DR PiT TS is the coordinated recovery point. Figure 2 – Coordinated Recovery Process at the Remote Site. The REXX EXEC - CRPREXX The following REXX program and process description are delivered as is, and no warranties, expressed or implied, are provided with this shareware solution. Use it at your own risk, and feel free to modify as you see fit. Refer to the BMC Software product documentation sets for the most complete details about product usage and other tips. For example, the RECOVERY MANAGER for IMS User Guide provides assistance in planning and implementing the IMS portion of a DR process. The REXX program captures the IMS recovery point timestamp and passes it to the ARMBTSI program. The high-level process flow is as follows: 1. Browse the output of the RMGR for IMS DRRCN utility. Page 7
  10. 10. 2. Obtain the timestamp value from the LATEST PIT RECOVERY TIME IS: value field. 3. Convert and round the timestamp. 4. Present the converted and rounded timestamp to the RMGR for DB2 ARMBTSI program, which inserts the value into a repository table (CRRDRPT). As written, the DRRCN report is expected to be in the BMCIRM.BB.DRRCN.REPORT data set, and the local offset from GMT is expected to be in the BMCIRM.BB.DRRCN.OFFSET data set. The offset should be in the first three columns: +05, -11, and so on. The JCL for ARMBTSI is in the BMCIRM.BB.DRRCN.ARMBTSI.CNTL data set, and the REXX routine updates the TIMESTAMP in the PARM string in place. You can modify the REXX EXEC as appropriate for your shop standards. Here’s the REXX EXEC in all its glory: /* REXX */ /******************************************************************/ /* GET RECOMMENDED PIT TIMESTAMP FROM DRRCN REPORT */ /******************************************************************/ ADDRESS TSO "ALLOC DD(INDD) DSNAME('BMCIRM.BB.DRRCN.REPORT') SHR REUSE" IF RC ¬= 0 THEN DO SAY 'UNABLE TO ALLOCATE DRRCN REPORT - RC = 'RC EXIT END eof = 'NO' DO WHILE eof = 'NO' "EXECIO 1 DISKR INDD" IF RC = 0 THEN DO PULL record IF SUBSTR(record,3,30) = 'LATEST PIT RECOVERY TIME IS :' THEN DO yy = SUBSTR(record,34,2) ddd = SUBSTR(record,36,3) hour = SUBSTR(record,40,2) min = SUBSTR(record,43,2) sec = SUBSTR(record,46,2) tenth = SUBSTR(record,49,1) imsdate = yy||ddd||' '||hour||':'||min||':'||sec||'.'tenth END END ELSE eof = 'YES' END "EXECIO 0 DISKR INDD (FINIS" "FREE DD(INDD)" SAY 'IMS GMT TIMESTAMP IS: ' imsdate Page 8
  11. 11. /******************************************************************/ /* CHECK AND ADJUST FOR LEAP YEAR */ /******************************************************************/ yeardays = 365 pyeardays = 365 chkdate = 2000 + yy IF TRUNC(chkdate/4) = chkdate/4 THEN yeardays = yeardays + 1 IF TRUNC((chkdate-1)/4) = (chkdate-1)/4 THEN pyeardays = pyeardays + 1 /******************************************************************/ /* GET OFFSET & ADJUST GMT TIMESTAMP TO LOCAL TIME */ /******************************************************************/ "ALLOC DD(IND2) DSNAME('BMCIRM.BB.DRRCN.OFFSET') SHR REUSE" IF RC ¬= 0 THEN DO SAY 'UNABLE TO ALLOCATE OFFSET DATA SET - RC = 'RC EXIT END eof = 'NO' DO WHILE eof = 'NO' "EXECIO 1 DISKR IND2" IF RC = 0 THEN DO PULL record offset = SUBSTR(record,2,2) IF SUBSTR(record,1,1) = '-' THEN DO hour = hour - offset IF hour < 0 THEN DO hour = hour + 24 ddd = ddd - 1 END IF ddd = 0 THEN DO ddd = pyeardays yy = yy - 1 END END ELSE DO hour= hour + offset IF hour >= 24 THEN DO hour = hour - 24 ddd = ddd + 1 END IF ddd > yeardays THEN DO ddd = 1 yy = yy + 1 END END END ELSE eof = 'YES' END "EXECIO 0 DISKR IND2 (FINIS" "FREE DD(IND2)" SAY 'OFFSET FROM GM TIME IS: 'SUBSTR(record,1,1)||offset Page 9
  12. 12. /******************************************************************/ /* BUILD DB2 TIMESTAMP */ /******************************************************************/ yyddd = RIGHT(JUSTIFY(yy,LENGTH(yy)),2,'0')||, RIGHT(JUSTIFY(ddd,LENGTH(ddd)),3,'0') hour = RIGHT(JUSTIFY(hour,LENGTH(hour)),2,'0') tmpdate = DATE('S',yyddd,'J') year = SUBSTR(tmpdate,1,4) month = SUBSTR(tmpdate,5,2) day = SUBSTR(tmpdate,7,2) db2date = year||'-'||month||'-'||day||'-'||hour||'.'||min||'.'||sec SAY 'DB2 LOCAL TIMESTAMP IS: ' db2date /******************************************************************/ /* INSERT THE DB2 TIMESTAMP */ /******************************************************************/ "ALLOC DD(IND4) DSNAME('BMCIRM.BB.DRRCN.ARMBTSI.CNTL') OLD REUSE" IF RC ¬= 0 THEN DO SAY 'UNABLE TO ALLOCATE ARMBTSI DATA SET - RC = 'RC EXIT END eof = 'NO' foundit = '0' searchstring = 'PGM=ARMBTSI' DO WHILE (eof = 'NO' & foundit = '0') "EXECIO 1 DISKRU IND4" IF RC = 0 THEN PULL record ELSE eof = 'YES' IF INDEX(record,searchstring) > 0 THEN DO IF searchstring = 'PARM=' THEN foundit = '1' ELSE searchstring = 'PARM=' IF INDEX(record,searchstring) > 0 THEN foundit = '1' END END IF foundit = '1' THEN DO record = OVERLAY(db2date,record,INDEX(record,'PARM=')+11) QUEUE record "EXECIO 1 DISKW IND4" END ELSE SAY 'UNABLE TO INSERT ARMBTSI TIMESTAMP' "EXECIO 0 DISKR IND4 (FINIS" "FREE DD(IND4)" Page 10
  13. 13. Glossary Active Logs – The DB2 first-level logs, stored on disk. If DB2 can’t write to the active logs, it stops! Most shops have at least three active logs (typically between 300 and 540 cylinders), and many shops have dozens. Archive Logs – The DB2 second-level logs. When an active log fills, DB2 switches to another active log data set and copies the full active log to an archive log. The archive log is typically on tape but can be on disk. (In fact, it’s prudent to copy several archive logs to disk at the remote site during DR testing. It will dramatically improve recovery times for parallel recoveries on shared archive logs. This process can be automated as part of the ARMBSRR job definition. See the RECOVERY MANAGER for DB2 Users Guide for details about how to set up ARMBSRR for this process). ARM – The BMC Software three-character product code for the RECOVERY MANAGER for DB2 product. BSDS – The DB2 BootStrap Data Set, which houses lots of critical information for DB2 restart and recovery, including the log ranges for the current active and archive logs (as well as the data set name and volume number if the data sets aren’t cataloged). The BSDS also stores the most recent conditional restart control record RBA/LRSN, which is only used by DB2 during disaster restart (specified by a parameter on the DSNZPARM). CRCR – The DB2 conditional restart control record that is created to indicate the RBA for DB2 recovery while in disaster recovery mode. CRM – The three-character product code for the BMC Software COORDINATED RECOVERY MANAGER facility. It has never been marketed or sold as a stand-alone product; historically, it has been delivered as a free addition to the RECOVERY MANAGER products for DB2, IMS, and OS/390. CRP – The Coordinated Recovery Process that allows you to capture or indicate a particular timestamp as the driver for a coordinated IMS/DB2 recovery process. DR – Disaster recovery. A program or test that is designed to test the ability of a company to recover from a total site outage. Although a real disaster recovery is very rare, it is the single most important driver for most backup and recovery operations decisions. DSNZPARM – The DB2 controlling parameter file that includes, among other things, the specifications under which DB2 operates in a DR situation (DEFER recovery for application objects, refer to remote site copies, and so on). IRM – The three-character product code for the BMC Software RECOVERY MANAGER for IMS product. Page 11
  14. 14. LRSN – The log record sequence number, which is the sequencing technique that is used by DB2 in a data-sharing environment. MRM – The three-character product code for the BMC Software RECOVERY MANAGER for OS/390 product. This product works with OS/390 data sets, including libraries and VSAM data sets. It can be used to generate backups and restores for the sequential files and libraries that are required by the applications, as well as the system data sets and libraries that are needed to rebuild IMS and DB2 at the disaster recovery site. OLDS – The online log data set, which is the IMS first-level log and is typically written to disk. PIT – Point in Time, which is a designation that is reserved for recovery operations back to a prior point in time (that is not the end of the current log). By definition, a PIT recovery results in data loss, as much of the log is bypassed during the operation. A native PIT recovery can only be to a “clean” copy or to a registered recovery point (specifically, an IMS database /DBR or DB2 STOP operation). By using BMC Software recovery technology, any point in time can be made into a valid PIT recovery point. In-flight transactions are bypassed during recovery, so database integrity is assured. PRILOGs – Primarily log data sets, which are the first data sets that are used by the IMS OLDS. The OLDS and SLDS are typically mirrored in case a media or device failure occurs because the contents of the log data sets are absolutely required for most recovery processes. RBA – The Relative Byte Address is the sequencing technique used by the original DB2 logging environment. It’s still used today for non-data-sharing environments. It’s basically the offset from ‘0’, and can go to 2(32), which is really, really big. RECONs – The IMS recovery control data sets that house all of the information required by IMS for restart (normal and emergency, local and remote). The RECONs are a component of the IMS Data Base Recovery Control (DBRC) feature and are required components for the BMC Software RECOVERY MANAGER for IMS product. RPO – Recovery Point Objective, the point to which you want to recover. In some shops, it’s measured in days, and in others it’s measured in minutes or less. RTO – Recovery Time Objective, the amount of time you are willing to spend waiting for system restart after failure. Also measured in days in some shops, minutes in others. SECLOGs – The backup data sets to the IMS OLDS PRILOGs. In other words, if for some reason the primary copy of the disk-based IMS online logging facility doesn’t work, the SECLOGS would be used instead. Page 12
  15. 15. SLDS – System log data set, which is the archive file for a filled IMS OLDS. When the disk data set that houses the OLDS becomes full, it is archived to a SLDS, typically on tape. You can define a secondary SLDS, which might be handy to ship offsite for DR purposes. SWPOC – System Wide Point of Consistency, the (sometimes unachievable) process whereby update processing to a system-wide set of databases is suspended (drained or stopped) so that a valid, consistent recovery point can be obtained. This recovery point ensures data integrity for the databases in the SWPOC group, but it comes at a cost of availability. It can take many minutes to several hours to obtain a SWPOC in some shops (because of round-the-clock update activities). Page 13
  16. 16. For more information visit BMC Software on the Web at www.bmc.com BMC Software, the BMC Software logos and all other BMC Software product or service names are registered trademarks or trademarks of BMC Soft- ware, Inc. All other registered trademarks or trade- marks belong to their respective companies. © 2000, BMC Software, Inc. All rights reserved. 100038338 5/01