• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
IBM
 

IBM

on

  • 626 views

 

Statistics

Views

Total Views
626
Views on SlideShare
626
Embed Views
0

Actions

Likes
0
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    IBM IBM Document Transcript

    • IBM® DB2® Universal Database™ Enterprise - Extended Edition for AIX® and HACMP/ES (TR-74.174) June 22, 2001 Gene Thomas DB2 UDB System Verification Test Andy Beaton DB2 UDB System Verification Test Enzo Cialini DB2 UDB System Verification Test Darrin Woodard DB2 UDB System Verification Test
    • This document contains proprietary information of IBM. It is provided under a license agreement and is protected by copyright law. The information contained is this publication does not include any product warranties, and any statements provided in this document should not be interpreted as such. © Copyright International Business Machines Corporation 2001. All rights reserved. Note to U.S Government Users – Documentation related to restricted rights – Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
    • Contents Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix ITIRC keywords. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix About the authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1. Target configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2. Disk and logical volume manager (LVM) setup . . . . . . . . . . 5 2.1 Setting up the disk drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 3. NFS configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 Set up TCP/IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 NFS export the /homehalocal file system . . . . . . . . . . . . . . . . . . . . . 14 3.3 Mount the /homehalocal file system . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 4. User setup and DB2 installation . . . . . . . . . . . . . . . . . . . . 17 Chapter 5. HACMP setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1 Define the cluster ID and name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2 Define the cluster nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.3 Add the adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.4 Show cluster topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.5 Synchronize cluster topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.6 Add a resource group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.7 Add an application server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.8 Configure resources for the resource group . . . . . . . . . . . . . . . . . . . 28 5.9 Synchronize cluster resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.10 Show resource information by resource group. . . . . . . . . . . . . . . . . 29 5.11 Verify cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 6. Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.1 SQL6048 on db2start command . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.2 SQL6031 returned when issuing db2 “? SQL6031” command . . . . . . 34 6.3 Ethernet IP label instead of the switch IP label in db2nodes.cfg file. . 36 6.4 SQL1032 when using Autoloader after a failback . . . . . . . . . . . . . . . 37 6.5 SQL6072 when using the switch HACMP service IP label . . . . . . . . . 37 6.6 SQL6031 RC=12, not enough port in /etc/serivces . . . . . . . . . . . . . . 38 6.7 SQL6030 RC=15, no port 0 defined in db2nodes.cfg file . . . . . . . . . . 38 6.8 HACMP Returns config to long, stopping the catalog node . . . . . . . . 40 6.9 db2_all with the “;” option loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 7. Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.1 Test environment and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.2 Points of failure for test consideration . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 8. Additional information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Appendix A. Trademarks and service marks . . . . . . . . . . . . . . . . . . . . . .51 © Copyright IBM Corp. 2001 iii
    • iv IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Figures 1. HACMP running on the initial target configuration . . . . . . . . . . . . . . . . . . . .2 2. HACMP after failure of node13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 3. rc.db2pe modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37 4. Sample script for testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 5. Sample clstat screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45 © Copyright IBM Corp. 2001 v
    • vi IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Tables 1. Volume groups and filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 2. Volume group and filesystem relationship for nodes . . . . . . . . . . . . . . . . . .9 3. Resource group to node relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 4. Application server scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 5. Resource group configuration information . . . . . . . . . . . . . . . . . . . . . . . . .28 © Copyright IBM Corp. 2001 vii
    • viii IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Abstract IBM® DB2® Universal Database™ (UDB) is the industry’s first multimedia, Web-ready relational database management system, powerful enough to meet the demands of large corporations and flexible enough to serve medium-sized and small e-businesses. DB2 Universal Database combines integrated power for business intelligence, content management, and e-business with industry-leading performance and reliability. This combination coupled with High Availability Clustered Multi-Processing (HACMP), strengthens the solution by providing a highly available computing environment. HACMP for AIX® provides a highly available computing environment. This facilitates the automatic switching of users, applications, and data from one system to another in the cluster after a hardware or software failure. A complete High Availability (HA) setup includes many parts, one of which is the HACMP software. Other parts of an HA solution come from AIX and the logical volume manager (LVM). As well as tangible items such as hardware and software, a good HA solution includes planning, design, customizing, and change control. An HA solution reduces the amount of time that an application is unavailable by removing single points of failure. This document takes you through a target configuration setup using DB2 UDB Enterprise - Extended Edition (EEE) V7.2 and HACMP/ES 4.3. ITIRC keywords • HA • DB2 • UDB • AIX • HACMP • HACMP/ES • Availability © Copyright IBM Corp. 2001 ix
    • About the authors Andy Beaton has 14 years of database experience, and is a certified DB2 UDB Database Administrator and Advanced Technical Expert in both DB2 for Clusters and DB2 for DRDA. He works at the IBM SWS Toronto Laboratory in the DB2 UDB System Verification Test department. Andy is responsible for testing DB2 UDB in a variety of configurations, including AIX EE, AIX EEE and HACMP. Andy is living proof that a degree in Astronomy is no impediment to having an interesting job. Enzo Cialini has been working with Database Technology at the IBM SWS Toronto Laboratory for over nine years, is certified in DB2 UDB Database Administration and DB2 UDB Database Application Development, is an Advanced Technical Expert in DB2 for DRDA, and an Advanced Technical Expert in DB2 for Clusters. He is currently responsible for managing the DB2 UDB System Verification Test Department, with a focus on High Availability, and has been involved with DB2 and HACMP for many years. His experience ranges from implementing and supporting numerous installations to consulting. Gene Thomas has been with IBM for over 28 years. For the last six years, he has worked as an AIX Systems Programmer/Administrator supporting the DB2 UDB Function in the IBM SWS Toronto Laboratory. He was instrumental in bringing HACMP into the Laboratory. Gene has had formal training from the company that created and supports HACMP, Clam and Associates (now known as Availant), located in Cambridge, Massachusetts. His primary tasks in support of HACMP in the Laboratory have been: setting up the hardware configuration for the HACMP cluster; installation and setup of the AIX operating system; and installation and setup of the HACMP cluster for DB2 UDB testing. He has recently joined the UDB System Verification Test Department and is doing work with HACMP and DB2 UDB. Darrin Woodard has been with IBM for over 10 years, is certified in DB2 UDB Database Administration and DB2 UDB Database Application Development, is an Advanced Technical Expert in DB2 for Clusters, an IBM Certified Specialist in AIX System Support, and an IBM Certified Specialist in AIX HACMP. For the last five years, he has worked in the DB2 UDB System Verification Test Department. Here Darrin was responsible for testing DB2 UDB EE and DB2 UDB EEE in an HACMP environment. The tests have ranged from a 2-node cluster up to a 340-node DB2 UDB EEE database running on 340 physical SP nodes with over 160 clusters. During the previous five years, Darrin was an AIX Systems Administrator supporting the DB2 UDB Function in the IBM SWS Toronto Laboratory, where he was responsible for the installation and setup of the AIX operating system. This included the RS/6000 and the Scalable Power parallel system (SP). x IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Chapter 1. Target configuration The HACMP configuration described in this document involves a six-partition DB2 database and two clusters, each with mutual takeover and cascading resource groups. It uses HACMP/ES 4.3 and DB2 UDB EEE V7.2 running on AIX 4.3.3. The clusters being defined are named cl1314 and cl1516, with cluster IDs of 1314 and 1516 respectively. We arbitrarily selected these numbers because we are using SP nodes 13,14,15 and 16. The cl1314 cluster has two nodes (bf01n013 and bf01n014); two cluster node names (clnode13 and clnode14); two resource groups (rg1314 and rg1413) and two application servers (as1314 and as1413). The cl1516 cluster has two nodes (bf01n015 and bf01n016); two cluster node names (clnode15 and clnode16); two resource groups (rg1516 and rg1615); and two application servers (as1516 and as1615). Each of these nodes will have one SP switch and one ethernet adapter. The nodes within a cluster will have a shared external disk. The cl1314 cluster will have access to two volume groups, havg1314 and havg1413. The cl1516 cluster will have access to two volume groups, havg1516 and havg1615. Table 1. Volume groups and filesystems Cluster Resource group Volume group File system cl1314 rg1314 havg1314 /homehalocal /db1ha/svtha1/NODE0130 /db1ha/svtha1/NODE0131 rg1413 havg1413 /db1ha/svtha1/NODE0140 cl1516 rg1516 havg1516 /db1ha/svtha1/NODE0150 rg1615 havg1615 /db1ha/svtha1/NODE0160 /db1ha/svtha1/NODE0161 In the initial target configuration the db2nodes.cfg will have the following entries: 130 b_sw_013 0 b_sw_013 131 b_sw_013 1 b_sw_013 140 b_sw_014 0 b_sw_014 150 b_sw_015 0 b_sw_015 160 b_sw_016 0 b_sw_016 161 b_sw_016 1 b_sw_016 The initial target configuration is illustrated in Figure 1 on page 2. © Copyright IBM Corp. 2001 1
    • Cluster name: CL1314 Cluster Cluster node: CLNODE13 node: CLNODE14 Disk drive NFS home Volume group havg1314 server active on CLNODE13 DB2 DB2 install install image image Disk drive Volume group havg1413 active on CLNODE14 b_sw_013 bf01n013 b_sw_014 bf01n014 switch ethernet b_sw_015 bf01n015 b_sw_016 bf01n016 Cluster name: CL1516 Cluster Cluster node: CLNODE15 node: CLNODE16 Disk drive Volume group havg1516 active on CLNODE15 DB2 DB2 install install image image Disk drive Volume group havg1615 active on CLNODE16 Figure 1. HACMP running on the initial target configuration If one of the two nodes within the cluster (for example, cl1314) has a failure, the other node in the cluster will acquire the resources that are defined in the resource group. The application server is then started on the node that has taken over the resource group. In our case, the application server that is started is DB2 UDB EEE V7.2 for the instance svtha1. In our example of a failover, DB2 UDB EEE V7.2 is running on a node clnode13; it has an NFS mounted home directory and a database located on the /db1ha/svtha1/NODE0130 and /db1ha/svtha1/NODE131 file systems. This file system is in a volume group called havg1314. The clnode14 node is currently running DB2 for partition 140 and is ready to take over from the 2 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • clnode13 node, if necessary. Suppose someone unplugs the clnode13 node. The clnode14 node detects this event and begins taking over resources from the clnode13 node. These resources include the havg1314 volume group, the file system, and the hostname swserv13. Once the resources are available on the clnode14 node, the application server start script runs. The instance ID can log on to the clnode14 node (now with an additional hostname swserv13) and can connect to the database. Remote clients can also connect to the database, because the hostname swserv13 is now located on the clnode14 node. This example is illustrated in Figure 2. Cluster name: CL1314 Cluster Cluster node: CLNODE13 node: CLNODE14 Disk drive NFS home Volume group havg1314 server active on CLNODE14 DB2 DB2 install install image image Disk drive Volume group havg1413 active on CLNODE14 b_sw_013 bf01n013 b_sw_014 bf01n014 switch ethernet b_sw_015 bf01n015 b_sw_016 bf01n016 Cluster name: CL1516 Cluster Cluster node: CLNODE15 node: CLNODE16 Disk drive Volume group havg1516 active on CLNODE15 DB2 DB2 install install image image Disk drive Volume group havg1615 active on CLNODE16 Figure 2. HACMP after failure of node13 Chapter 1. Target configuration 3
    • 4 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Chapter 2. Disk and logical volume manager (LVM) setup Because shared disks are such an integral part of HACMP setup, this chapter details the steps needed to set up the disks, as well as the various parts of the LVM. All commands described in this chapter must be invoked by the root user. VG Volume group LV Logical volume JFS/File system/FS Journaled file system hdisk# A name for a disk drive or a RAID JFSlog A log that maintains a consistent JFS clnode13, clnode14, clnode15 and clnode16 Cluster node names for the four nodes 2.1 Setting up the disk drives We are going to show how to set up the havg1516 volume group and its components. These steps must be repeated for the other three volume groups. Proceed through the following sections in order to set up shared disk drives and the logical volume manager. 2.1.1 Set up the disk drives To achieve consistent hdisk numbering and a more readable configuration, it is sometimes necessary to define an additional hdisk. If the number of hdisks is not the same on both nodes, define one or more “dummy” disks on the node that has fewer disks, until the number of hdisks is equal. Then attach and configure the external shared disk on both nodes. For example, on the bf01n015 node, there are only three internal disks (hdisk0,1,2) currently defined, whereas on the bf01n016 node, there are four (hdisk0,1,2,3) currently defined. To get the hdisk numbers to match for the external shared disk to be attached, a “dummy” disk must be defined on the bf01n015 node. The external shared disk will be labelled hdisk4, hdisk5, and hdisk6. To define a “dummy” disk on the bf01n015 node, run the following command on the bf01n015 node: # mkdev -c disk -t ’400mb’ -s ’scsi’ -p ’scsi1’ -w ’0,6’ -d Note: To select a disk type, use the lsdev -Pc disk command to list the disk types that are in the Predefined Devices object class. In this example, 400mb was one of the listed types. © Copyright IBM Corp. 2001 5
    • Once the “dummy” disk is defined and the external shared disks are configured, you can list the disks using the lsdev command. From the bf01n015 node: # lsdev -Cc disk hdisk0 Available 00-07-00-0,0 4.5 GB 16 Bit SCSI Disk Drive | internal hdisk1 Available 00-07-00-2,0 2.0 GB 16 Bit SCSI Disk Drive | internal hdisk2 Available 00-08-00-2,0 2.0 GB SCSI Disk Drive | internal hdisk3 Defined 00-01-00-0,6 400 MB SCSI Disk Drive | dummy hdisk4 Available 00-01-00-1,0 7135 Disk Array Device | external hdisk5 Available 00-01-00-1,1 7135 Disk Array Device | external hdisk6 Available 00-01-00-1,2 7135 Disk Array Device | external From the bf01n016 node: # lsdev -Cc disk hdisk0 Available 00-08-00-0,0 4.5 GB 16 Bit SCSI Disk Drive | internal hdisk1 Available 00-08-00-2,0 2.0 GB 16 Bit SCSI Disk Drive | internal hdisk2 Available 00-08-00-3,0 4.5 GB 16 Bit SCSI Disk Drive | internal hdisk3 Available 00-08-00-4,0 4.5 GB 16 Bit SCSI Disk Drive | internal hdisk4 Available 00-01-00-1,0 7135 Disk Array Device | external hdisk5 Available 00-01-00-1,1 7135 Disk Array Device | external hdisk6 Available 00-01-00-1,2 7135 Disk Array Device | external 2.1.2 Create the volume group (VG) Create the VG on the bf01n015 node. The VG must have a unique name and major number for all nodes in the cluster Hint: Create all of the VGs and the corresponding LVs, JFSs, and JFSLogs on one of the two nodes, and import them to the second node. This will produce unique names for all. Check the major numbers on the two nodes. From the bf01n015 node: # /usr/sbin/lvlstmajor 43... From the bf01n016 node: # /usr/sbin/lvlstmajor 45,...77,79... After analyzing this output, you can safely pick a major number of 45 or greater, but not 78, because those major numbers are free on the nodes. For this setup, 67 will be used as the major number. To add a volume group, issue the following command: # smit vg >Add a Volume Group 6 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • and fill in the fields. Following is an example: Add a Volume Group VOLUME GROUP name [havg1516] Physical partition SIZE in megabytes 32 PHYSICAL VOLUME names [hdisk4 hdisk5 hdisk6] Activate volume group AUTOMATICALLY no at system restart? Volume group MAJOR NUMBER [67] Create VG Concurrent Capable? no Auto-varyon in Concurrent Mode? no Alternatively, issue the following command to add a volume group: # mkvg -f -y’havg1516’ -s’32’ ’-n’ -V’67’ hdisk4 hdisk5 hdisk6 Activate the newly created volume group by issuing the following command: # varyonvg havg1516 2.1.3 Create a JFSLog Create the JFSLog with a unique name on the new VG. When creating the first file system on a new VG, AIX will automatically create a JFSLog, with the name of loglv00, loglv01, and so on for each new JFSLog on the machine. By default, AIX creates only one JFSLog per VG. Because a unique name is needed for the JFSLog, it is best to define the JFSLog with the mklv command before creating the first file system. Running the following command on both nodes will list the LV names that are already in use: # lsvg -l $(lsvg) rootvg: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT hd5 boot 1 1 1 closed/syncd N/A hd6 paging 65 65 1 open/syncd N/A hd8 jfslog 1 1 1 open/syncd N/A hd4 jfs 2 2 1 open/syncd / hd2 jfs 196 196 1 open/syncd /usr hd9var jfs 10 10 1 open/syncd /var hd3 jfs 10 10 1 open/syncd /tmp lv03 jfs 3 3 1 open/syncd /ryan homevg: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT homelv jfs 300 300 2 open/syncd /home paging00 paging 64 64 1 open/syncd N/A loglv00 jfslog 1 1 1 open/syncd N/A havg1516: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT To create a JFSLog with the name jfslog15 in the VG havg1516, issue the following command: # mklv -t jfslog -y jfslog15 havg1516 1 To format the JFSLog, issue the following command, and select “y” when asked whether to destroy the LV: Chapter 2. Disk and logical volume manager (LVM) setup 7
    • # logform /dev/jfslog15 logform: destroy /dev/jfslog15 (y)? y To verify that the JFSLog has been created, issue the following command: # lsvg -l havg1516 havg1516: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT jfslog15 jfslog 1 1 1 closed/syncd N/A 2.1.4 Create the LVs and the JFS Create any LVs and JFSs that are needed, and ensure that they have unique names and are not currently defined on any node. Set the file systems so that they are not mounted on restart. To verify the current LV and JFS names, run the following command on both nodes: # lsvg -l $(lsvg) rootvg: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT hd5 boot 1 1 1 closed/syncd N/A hd6 paging 65 65 1 open/syncd N/A hd8 jfslog 1 1 1 open/syncd N/A hd4 jfs 2 2 1 open/syncd / hd2 jfs 196 196 1 open/syncd /usr hd9var jfs 10 10 1 open/syncd /var hd3 jfs 10 10 1 open/syncd /tmp lv03 jfs 3 3 1 open/syncd /ryan homevg: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT homelv jfs 300 300 2 open/syncd /home paging00 paging 64 64 1 open/syncd N/A loglv00 jfslog 1 1 1 open/syncd N/A havg1516: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT jfslog15 jfslog 1 1 1 closed/syncd N/A By analyzing the LV NAME column, we can safely select halv150 as the new LV name, because it is not currently being used. Be sure to check the other node in the cluster (that is, clnode16). To create the LV for the /db1ha/svtha1/NODE0150 file system, issue the following command: # smit lv > Add a Logical Volume > then select: VOLUME GROUP name [havg1516] and fill in the fields. Following is an example. Some of the default entries from the window have been removed to show what parameters have been entered. Add a Logical Volume Logical volume NAME [halv150] VOLUME GROUP name havg1516 Number of LOGICAL PARTITIONS [1] PHYSICAL VOLUME names [hdisk4 hdisk5 hdisk6] RANGE of physical volumes maximum 8 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Alternatively, issue the following command to create the LV for the /db1ha/svtha1/NODE0150 file system: # mklv -y’halv150’ -e’x’ havg1516 1 hdisk4 hdisk5 hdisk6 In this setup, we are not going to mirror the logical volumes, because we are using RAID. If RAID is not being used, it is highly recommended that the LV be mirrored, and that the mirror be on separate physical volumes on a separate bus or path. This will remove the disk drive and the disk adapter as single points of failure. Once the LV has been created, we can create a file system associated with this LV. In this example, we will create a Large File Enabled Journaled File System. We could also create a Standard Journaled File System or Compressed File System. To add a journaled file system on a previously defined logical volume, issue the following command: # smit jfs > Add a Journaled File System on a Previously Defined Logical Volume > Add a Large File Enabled Journaled File System and fill in the fields. Following is an example: Add a Large File Enabled Journaled File System LOGICAL VOLUME name halv150 MOUNT POINT [/db1ha/svtha1/NODE0150] Mount AUTOMATICALLY at system restart? no PERMISSIONS read/write Mount OPTIONS [] Start Disk Accounting? no Number of bytes per inode 4096 Allocation Group Size 64 After creating any file system, be sure to increase its size to a level that is appropriate for the application. To increase the size of a file system, use the smit chfs command. In this example, we need to repeat these steps to create the other volume groups, jfslogs, logical volumes and file systems on the respective nodes, until we have the setup shown in Table 2. Table 2. Volume group and filesystem relationship for nodes Volume Logical Volume and Cluster Primary Sharing VG group filesystem (mount point) node with havg1314 hahomelv cl1314 bf01n013 bf01n014 /homehalocal halv130 /db1ha/svtha1/NODE0130 halv131 /db1ha/svtha1/NODE0131 havg1413 halv140 bf01n014 bf01n013 /db1ha/svtha1/NODE0140 Chapter 2. Disk and logical volume manager (LVM) setup 9
    • Volume Logical Volume and Cluster Primary Sharing VG group filesystem (mount point) node with havg1516 halv150 cl1516 bf01n015 bf01n016 /db1ha/svtha1/NODE0150 havg1615 halv160 bf01n016 bf01n015 /db1ha/svtha1/NODE0160 halv161 /db1ha/svtha1/NODE0161 After a file system is created, it is not automatically mounted. To mount the file system, enter the following mount command: # mount /db1ha/svtha1/NODE0150 For the DB2 instance to be able to write to the file systems, we have to change the ownership. This must be done after the user IDs and groups are defined in User Setup and DB2 Installation. For example, use the following command: # chown svtha1.dbadmin1 /db1ha/svtha1/NODE0150 2.1.5 Unmount all of the file systems and deactivate the VG To do this, invoke the following commands: # unmount /db1ha/svtha1/NODE0150 # varyoffvg havg The volume group is deactivated on the bf01n015 node before it is activated on the bf01n016 node. 2.1.6 Import the VG to the secondary node Import the VG on the bf01n016 node with the same major number, and change the VG so that it is not activated on restart. When the VG is imported on the bf01n016 node, the file systems and logical volumes will be defined on the bf01n016 node. Because the major number for the VG is the same on the bf01n016 node, if we ever need to NFS export the file system, the failover will work. Because the VG is defined to not be activated automatically on reboot, it can be activated when HACMP starts. To import a volume group, issue the following command: # smit vg > Import a Volume Group and fill in the fields. Following is an example: Import a Volume Group VOLUME GROUP name [havg1516] PHYSICAL VOLUME name [hdisk4] Volume Group MAJOR NUMBER [67] Make this VG Concurrent Capable? no Make default varyon of VG Concurrent? no 10 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Note: You need only include one physical volume of a volume group. Alternatively, issue the following command to import a volume group: # importvg -y’havg1516’ -V’67’ hdisk4 then change the VG so that it is not activated on reboot: # smit vg > Set Characteristics of a Volume Group > Change a Volume Group > Then select: VOLUME GROUP name [havg1516] and fill in the fields. Following is an example: Change a Volume Group VOLUME GROUP name havg1516 Activate volume group AUTOMATICALLY no at system restart? A QUORUM of disks required to keep the volume yes group on-line? Convert this VG to Concurrent Capable? no Autovaryon VG in Concurrent Mode? no Alternatively, issue the following command to change a volume group: # chvg -a’n’ -Q’y’ -x’n’ havg1516 2.1.7 Move the active VG back to the primary node The VG is currently active on the bf01n016 node. To move the active VG to the bf01n015 node, run the following on the bf01n016 node: # varyoffvg havg1516 and then run the following on the bf01n015 node # varyonvg havg1516 # mount /db1ha/svtha1/NODE0150 Now repeat these steps for the other three volume groups until you have the setup listed in Table 2 on page 9. Chapter 2. Disk and logical volume manager (LVM) setup 11
    • 12 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Chapter 3. NFS configuration One of the things that differentiates the EEE setup from an EE setup is that the home directory for the instance is NFS mounted to all of the nodes in the EEE setup while on the EE setup, it is just a local file system. We need to configure NFS to have the home directory available to all nodes. 3.1 Set up TCP/IP Before the HACMP cluster is defined, the network adapters must be defined and AIX operating system files must be updated. Update or create the /etc/netsvc.conf file to include the following: hosts=local,bind The local entry refers to using the local /etc/hosts, and the bind entry refers to using the name server. This will force TCP/IP name resolution to check the local /etc/hosts file before going to the name server. Update /etc/hosts with the hostnames and IP addresses for all service, boot, and standby adapters. In our example, the following entries were added to the /etc/hosts file: 9.21.72.13 bf01n013 bf01n013.torolab.ibm.com # Ethernet 9.21.72.14 bf01n014 bf01n014.torolab.ibm.com # Ethernet 9.21.72.15 bf01n015 bf01n015.torolab.ibm.com # Ethernet 9.21.72.16 bf01n016 bf01n016.torolab.ibm.com # Ethernet 9.21.77.13 b_sw_013 b_sw_013.torolab.ibm.com # Base switch name 9.21.77.14 b_sw_014 b_sw_014.torolab.ibm.com # Base switch name 9.21.77.15 b_sw_015 b_sw_015.torolab.ibm.com # Base switch name 9.21.77.16 b_sw_016 b_sw_016.torolab.ibm.com # Base switch name 9.21.77.213 sw_boot_13 sw_boot_13.torolab.ibm.com # switch boot 9.21.77.214 sw_boot_14 sw_boot_14.torolab.ibm.com # switch boot 9.21.77.215 sw_boot_15 sw_boot_15.torolab.ibm.com # switch boot 9.21.77.216 sw_boot_16 sw_boot_16.torolab.ibm.com # switch boot 9.21.77.223 swserv13 swserv13.torolab.ibm.com # switch service 9.21.77.224 swserv14 swserv14.torolab.ibm.com # switch service 9.21.77.225 swserv15 swserv15.torolab.ibm.com # switch service 9.21.77.226 swserv16 swserv16.torolab.ibm.com # switch service Update /.rhosts to include the root user for the hostnames in the cluster: # cat /.rhosts bf01n013 root bf01n014 root bf01n015 root bf01n016 root b_sw_013 root b_sw_014 root b_sw_015 root b_sw_016 root sw_boot_13 root sw_boot_14 root sw_boot_15 root sw_boot_16 root © Copyright IBM Corp. 2001 13
    • swserv13 root swserv14 root swserv15 root swserv16 root Note: Permissions on ~/.rhosts must be no more liberal than -rw-r--r--. See “SQL6048 on db2start command” on page 33 or additional information. 3.2 NFS export the /homehalocal file system In the previous chapter we created a /homehalocal file system; before we can make it available to the other nodes we need to mount it locally. To mount it locally, use the mount command: # mount /homehalocal To make the file system available for the other nodes to NFS mount, we are required to export the file system. To export the file system use: # smit nfs ->Network File System (NFS) -> Add a Directory to Exports List Add a Directory to Exports List PATHNAME of Directory to Export [/homehalocal] MODE to export directory read-write HOSTS & NETGROUPS allowed client access [b_sw_013,swserv13, b_sw_014,swserv14 b_sw_015,swserv15,b_sw_016, swserv16] Anonymous UID [-2] HOSTS allowed root access b_sw_013,swserv13, b_sw_014,swserv14 b_sw_015,swserv15,b_sw_016, swserv16] HOSTNAME list. If exported read-mostly [] Use SECURE OPTION? no Public filesystem? no CHANGE export system restart or both both PATHNAME of alternate Exports file [] Alternatively, issue the exportfs command with the required parameters. 3.3 Mount the /homehalocal file system Before we mount the /homehalocal file system as an NFS mount to the other nodes, we need to setup an alias for the switch adapter. Since the switch adapter is limited to one per machine, we need to use these aliases. For the bf01n013 node, run the following and repeat this for the other nodes: # ifconfig css0 inet 9.21.77.213 netmask 255.255.255.0 alias up # ifconfig css0 inet 9.21.77.223 netmask 255.255.255.0 alias up where the IP address is that for the switch boot and switch service of the respective node. After we set up these aliases, they will be available and can be viewed using the netstat -i command: # netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll 14 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • lo0 16896 link#1 64126 0 64170 0 0 lo0 16896 127 loopback 64126 0 64170 0 0 lo0 16896 ::1 64126 0 64170 0 0 en0 1500 link#2 0.4.ac.49.3a.b7 106008 0 76463 0 0 en0 1500 9.21.72 bf01n013 106008 0 76463 0 0 css0 65520 link#3 84411 0 108278 0 0 css0 65520 9.21.77 b_sw_013 84411 0 108278 0 0 css0 65520 9.21.77 swserv13 84411 0 108278 0 0 css0 65520 9.21.77 sw_boot_13 84411 0 108278 0 0 We now need to login to each of the nodes and NFS mount the /homehalocal file system from the swserv13 host address to a local mount point. This mount point must match the entry in /etc/passwd for the instance’s home directory. In our case, we mount this at /homeha/svtha1. We even do this on the bf01n013 node.To setup the mounts use: # smit nfs ->Network File System (NFS) -> Add a file System for mounting Add a File System for Mounting Type or select values in entry fields. Press Enter AFTER making all desired changes. PATHNAME of mount point [/homeha/svtha1] PATHNAME of Remote Directory [/homehalocal] HOST where remote directory resides [swserv13] Mount type NAME [] Use SECURE mount option? no Remount file system now, both update /etc/filesystems or both? /etc/filesystems entry will mount the directory no on system RESTART. MODE for this NFS file system read-write ATTEMPT mount in background or foreground? background NUMBER of times to attempt mount [] Buffer SIZE for writes [] Buffer SIZE for writes [] .......(more info but not changed) ....... At this point, all nodes should have their local JFS file systems mounted. These are the /homehalocal file system on the bf01n013 node mounted as /homeha/svtha1 on all nodes and the /db1ha/svtha1/NODE0*** file systems on each node. We are now ready to install DB2 UDB and create the instance. Chapter 3. NFS configuration 15
    • 16 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Chapter 4. User setup and DB2 installation Now that the components of the LVM are set up, DB2 can be installed. The db2setup utility can be used to install and configure DB2. To better illustrate the configuration, we will define some of the components manually, and use the db2setup utility only to install the DB2 product and license. All commands described in this chapter must be invoked by the root user. Although the steps used to install DB2 are outlined below, for complete details, refer to the IBM DB2 Universal Database Enterprise - Extended Edition for UNIX Quick Beginnings book, and to the IBM DB2 Universal Database and DB2 Connect Installation and Configuration Supplement book. Before running db2icrt, make sure that the $HOME directory for the instance is available and the svtha1 id can write to the directory. Also make sure that a .profile file exists, as db2icrt will append to the file but will not create a new one. For this example we are using the svtha1 id that already exists on the SP complex. 1. Mount the CD-ROM. Use the crfs and mount commands to create and mount the CD-ROM: # crfs -v cdrfs -p ro -d’cd0’ -m’/cdrom’ An alternative is to use the smit fast path smit crfs. Then mount the CD-ROM using the mount command: # mount /cdrom If the SP node does not have a local CD-ROM, there are two options. The install image can be copied to disk for future installations or the CD-ROM can be NFS exported from the control workstation and mounted on the nodes. 2. Install DB2 and set up the license key. Once the CD-ROM is mounted, change to the corresponding directory and run ./db2setup on all four nodes. Follow the prompts to install DB2 UDB Enterprise - Extended Edition in the /usr/lpp/db2_07_01 directory. Do not use the db2setup utility to create any user IDs or instances. # cd /cdrom # ./db2setup Select DB2 UDB Enterprise - Extended Edition and install. 3. Create the DB2 instance. Run db2icrt to create the instance. This command only needs to be run on one of the four nodes because the $HOME for the instance is NFS mounted from one machine to the others. # cd /usr/lpp/db2_07_01/instance # ./db2icrt -u svtha1 svtha1 Note: If the home file system, /homehalocal, is not NFS exported with root access, then an error will occur. © Copyright IBM Corp. 2001 17
    • 4. Test db2start and file system setup. Since db2icrt only adds one line to the $HOME/sqllib/db2nodes.cfg file, we are required to update the file and add the other nodes, such that the db2nodes.cfg file would look like the following: 130 b_sw_013 0 b_sw_013 131 b_sw_013 1 b_sw_013 140 b_sw_014 0 b_sw_014 150 b_sw_015 0 b_sw_015 160 b_sw_016 0 b_sw_016 161 b_sw_016 1 b_sw_016 We must also create the $HOME/.rhosts file as db2start and other DB2 programs require it to run remote shells from one node to another. The .rhosts file would look like the following in our example: swserv13 svtha1 swserv14 svtha1 swserv15 svtha1 swserv16 svtha1 b_sw_013 svtha1 b_sw_014 svtha1 b_sw_015 svtha1 b_sw_016 svtha1 bf01n013 svtha1 bf01n014 svtha1 bf01n015 svtha1 bf01n016 svtha1 Note: Ensure the permissions on the $HOME/.rhosts files are correct. See “SQL6048 on db2start command” on page 33 for additional information. This is a good place to see if db2start will work. Log on as the db2inst1 instance and run the db2start command. To test the file system setup on each node, try creating a database. Be sure to create it on /db1ha and not in $HOME, which is the default. Use the following command to create the database: $ db2 create database testing on /db1ha Ensure all errors are corrected before proceeding to the next step; also be sure to stop DB2 using the db2stop command before proceeding to the next step. Note: If you get SQL code SQL6031 see “SQL6031 returned when issuing db2 “? SQL6031” command” on page 34 for additional information. 5. Install the DB2 HACMP scripts. Important: Review “HACMP ES Script Files” in Chapter 12 of the IBM DB2 Universal Database Administration Guide: Planning before attempting this section. DB2 UDB EEE supplies sample scripts for failover and user-defined events. These files are located in the /usr/lpp/db2_07_01/samples/hacmp/es directory 18 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • In our example, we copied this directory to a special directory on the control workstation of the SP complex. Our example used /spdata/sys1/hacmp on the control workstation. The db2_inst_ha script is the tool used for installing scripts and events on multiple nodes in an HACMP EEE environment. It was used in the following manner for our examples: # cd /spdata/sys1/hacmp # db2_inst_ha svtha1 . 15-16 TESTDB This will install the scripts in the directory /usr/bin to all the nodes listed (in this case nodes 15 and 16) and prepare them to work with the database TESTDB. Note that the database name needs to be in upper case. The node selection can also be written in the form “15,16”, if you want to copy the files to specific nodes. When the application server is set up and the start and stop scripts are defined, they will call /usr/bin/rc.db2pe with a number of parameters. Note: The start and stop scripts that are called from the application server must exist on both nodes and have the same name. They do not need to have the same content if, for example, some customizing is needed. The db2_inst_ha script also copies over the HACMP/ES event stanzas. These events are defined in the db2_event_stanzas file. One example is the DB2_PROC_DOWN event, which will restart DB2 if it terminates for some reason. Note: DB2 will also restart if terminated by the db2stop or db2stop force commands. To stop DB2 without triggering a failure event, use the ha_db2stop command. For more information about HACMP/ES events, refer to “HACMP ES Event Monitoring and User-defined Events” in Chapter 33; High Availability Cluster Multi-processing, Enhanced Scalability (HACMP ES) for AIX, of the DB2 UDB Administration Guide. 6. Test a failover of the resources on bf01n015 to bf01n016. On the bf01n015 node: # unmount /db1ha/svtha1/NODE0150 # varyoffvg havg1516 On the bf01n016 node: # varyonvg havg1516 # mount /db1ha/svtha1/NODE0150 Note: These are the actual steps that the HACMP software takes during failover of the necessary file systems. Once DB2 HACMP is configured and set up, any changes made (for example, to the ID, groups, AIX system parameters, or the level of DB2 code) must be done on all nodes. Following are some examples: - The HACMP cluster is active on the bf01n015 node, and the password is changed on that node. When failover happens to the bf01n016 node, Chapter 4. User setup and DB2 installation 19
    • and the user tries to log on, the new password will not work. Therefore, the administrator must ensure that passwords are kept synchronized. - If the ulimit parameter on the bf01n015 node is changed, it must also be changed on the bf01n016 node. For example, suppose the file size is set to unlimited on the bf01n015 node. When a failover happens to the bf01n016 node, and the user tries to access a file that is greater than the default size of 1 GB, an error is returned. - If the AIX parameter maxuproc is changed on the bf01n015 node, it also must be changed on the bf01n016 node. When a failover occurs, and DB2 begins running on the bf01n016 node, it may reach the maxuproc value and return errors. - If non-DB2 software is installed on the bf01n015 node but not on the bf01n016 node, the software will not be available when a failover takes place. - Suppose that the database manager configuration parameter srvcname is used, and that /etc/services is updated on the bf01n015 node. If the bf01n016 node does not receive the same update and a failover occurs, the DB2 server will report warnings during db2start and will not start up the TCP/IP communications listeners and DB2 clients will report errors. 20 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Chapter 5. HACMP setup This chapter assumes that HACMP/ES 4.3 has been installed on the two cluster nodes but has not yet been configured. You should be familiar with the following terms: HACMP cluster A group of 2 to 32 IBM RS/6000 servers, configured to provide highly available services. If any resource fails, its function is taken over by another part of the cluster. Client Any system that utilizes the services provided by the cluster. Clients can be connected to the HACMP cluster by TCP/IP networks. The only requirement is that clients be able to access all of the nodes, so that in the event of a failure, the clients can access the nodes that have been taken over. Cluster node Any IBM RS/6000 system that has been configured to function as a highly available server. In the event that a cluster node fails, its resources will be taken over by another cluster node. The nodes that participate in the takeover may mount the failed system’s file systems, start up its applications, and even provide its IP & MAC address so that clients can reconnect to applications without reconfiguration. Resource An object that is protected by HACMP, and may include IP address, file systems, raw devices, or volume groups. A resource group is a set of resources that are grouped together to support a particular application. Application server A name given to the stop and start scripts for the application. In this paper, the application server is the start and stop scripts for DB2. Note: If HACMP is running and a user telnets to a node in the cluster, the connection may be to either one of the two machines that have been set up. There are two ways in which a cluster administrator can tell which machine is actually being used: • Use uname -a and record the unique serial number for each physical machine. • Use netstat -i to see which hostnames are defined on the cluster node. When using the netstat -i command to check the addresses in use with the system running in the default configuration, results similar to the following are returned: bf01n015:/homeha/svtha1 > netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll lo0 16896 link#1 8673 0 8932 0 0 lo0 16896 127 loopback 8673 0 8932 0 0 lo0 16896 ::1 8673 0 8932 0 0 en0 1500 link#2 0.4.ac.49.35.b 613395 0 550200 0 0 en0 1500 9.21.72 bf01n015 613395 0 550200 0 0 css0 65520 link#3 546472 0 545291 0 0 © Copyright IBM Corp. 2001 21
    • css0 65520 9.21.77 b_sw_015 546472 0 545291 0 0 css0 65520 9.21.77 swserv15 546472 0 545291 0 0 After failover, telnet to the service address, swserv15, and run netstat -i. Your results will be similar to the following: bf01n016:/homeha/svtha1 > netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll lo0 16896 link#1 13625 0 13861 0 0 lo0 16896 127 loopback 13625 0 13861 0 0 lo0 16896 ::1 13625 0 13861 0 0 en0 1500 link#2 0.4.ac.49.38.c3 607900 0 546129 0 0 en0 1500 9.21.72 bf01n016 607900 0 546129 0 0 css0 65520 link#3 533248 0 706144 0 0 css0 65520 9.21.77 b_sw_016 533248 0 706144 0 0 css0 65520 9.21.77 swserv16 533248 0 706144 0 0 css0 65520 9.21.77 swserv15 533248 0 706144 0 0 Note that the ethernet hostname and boot address have changed because we are actually on a different host, bf01n016. The service address remains the same, but has been taken over by bf01n016. Using the netstat -i command is a good way to check which machine the service address is currently assigned to. Two types of resource groups are used with HACMP: cascading and rotating resource groups. For more information on these resource groups, refer to the HACMP Concepts and Facilities guide. A cascading resource group is being used in this setup. Note: It is recommended to install the AIX fileset bos.compat.links before running HACMP/ES with DB2 UDB because the product uses symbolic links defined when this fileset is installed. To set up HACMP, proceed through the following sections; we only need to define the HACMP cluster on one node in a cluster and then synchronize it to the other node. Important: The following sections 5.1 to 5.11 are to be executed on nodes bf01n013 and bf01n015 only. Our examples only show information for bf01n015. 5.1 Define the cluster ID and name Enter the cluster ID and cluster name to define a cluster: # smit hacmp > Cluster Configuration > Cluster Topology > Configure Cluster > Add a Cluster Definition Add a Cluster Definition * Cluster ID [1516] * Cluster Name [cl1516] 22 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Alternatively, enter the following command: /usr/sbin/cluster/utilities/claddclstr -i 1516 -n cl1516 5.2 Define the cluster nodes Enter the node names of the nodes forming the cluster: # smit hacmp > Cluster Configuration > Cluster Topology > Configure Nodes > Add Cluster Nodes Add Cluster Nodes * Node Names [bf01n015 bf01n016] Alternatively, enter the following command: /usr/sbin/cluster/utilities/clnodename -a bf01n015 bf01n016 5.3 Add the adapters Enter the adapter attributes: # smit hacmp > Cluster Configuration > Cluster Topology > Configure Adapters > Add an Adapter * Adapter IP Label bf01n015 New Adapter IP Label [] Network Type [ether] Network Name [e1] Network Attribute public Adapter Function service Adapter Identifier [9.21.72.15] Adapter Hardware Address [] Node Name [bf01n015] * Adapter IP Label sw_boot_15 New Adapter IP Label [] Network Type [hps] Network Name [H1] Network Attribute private Adapter Function boot Adapter Identifier [9.21.77.215] Adapter Hardware Address [] Node Name [bf01n015] * Adapter IP Label swserv15 New Adapter IP Label [] Network Type [hps] Network Name [H1] Network Attribute private Adapter Function service Adapter Identifier [9.21.77.225] Adapter Hardware Address [] Node Name [bf01n015] Chapter 5. HACMP setup 23
    • Repeat this for the three adapters for the bf01n016 node. This must be done from the bf01n015 node and will later be synchronized to the bf01n016 node. When cataloging this DB2 node to a remote client, use the service address hostname, that is, swserv15. This will be the address that moves to the node that has DB2 running. 5.4 Show cluster topology Show cluster, node, network, and adapter topology: # smit hacmp > Cluster Configuration > Cluster Topology > Show Cluster Topology > Show Cluster Topology Command: OK stdout: yes stderr: no Before command completion, additional instructions may appear below. Cluster Description of Cluster cl1516 Cluster ID: 1516 Cluster Security Level Standard There were 2 networks defined: H1, e1 There are 2 nodes in this cluster NODE bf01n015: This node has 2 service interface(s): Service Interface swserv15: IP address: 9.21.77.225 Hardware Address: Network: H1 Attribute: private Service Interface swserv15 has a possible boot configuration: Boot (Alternate Service) Interface: sw_boot_15 IP Address: 9.21.77.215 Network: H1 Attribute: private Service Interface swserv15 has no standby interfaces Service Interface bf01n015: IP address: 9.21.72.15 Hardware Address: Network: e1 Attribute: public Service Interface bf01n015 has no standby interfaces 24 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • NODE bf01n016: This node has 2 service interface(s): Service Interface swserv16: IP address: 9.21.77.226 Hardware Address: Network: H1 Attribute: private Service Interface swserv16 has a possible boot configuration: Boot (Alternate Service) Interface: sw_boot_16 IP Address: 9.21.77.216 Network: H1 Attribute: private Service Interface swserv16 has no standby interfaces Service Interface bf01n016: IP address: 9.21.72.16 Hardware Address: Network: e1 Attribute: public Service Interface bf01n016 has no standby interfaces Breakdown of network connections: Connections to network H1 Node bf01n015 is connected to network H1 by these interfaces: sw_boot_15 swserv15 Node bf01n016 is connected to network H1 by these interfaces: sw_boot_16 swserv16 Connections to network e1 Node bf01n015 is connected to network e1 by these interfaces: bf01n015 Node bf01n016 is connected to network e1 by these interfaces: bf01n016 Alternatively, enter the following command: # /usr/sbin/cluster/utilities/cllscf 5.5 Synchronize cluster topology Synchronize cluster topology information on all cluster nodes defined in the local topology database: # smit hacmp > Cluster Configuration > Cluster Topology > Synchronize Cluster Topology Synchronize Cluster Topology Ignore Cluster Verification Errors? [No] * Emulate or Actual? [Actual] Alternatively, enter the following command: # /usr/sbin/cluster/utilities/cldare -t Chapter 5. HACMP setup 25
    • Emulate will check your defined parameters and give a result based on the correctness of those parameters. It will not physically test your setup, and is not a substitute for actually testing the system. 5.6 Add a resource group The order of the nodes is important because a cascading resource group will only be activated on the first node listed when HACMP is started. Note: Listing the bf01n015 node first gives it a higher priority, and it will acquire resources when HACMP starts. The bf01n016 node will acquire resources only after the bf01n015 node fails. Use the entries in Table 3 as a reference for adding the resource groups. Table 3. Resource group to node relationship Cluster RG name Relationship Nodes cl1314 rg1314 cascading bf01n013, bf01n014 rg1413 cascading bf01n014, bf01n013 cl1516 rg1516 cascading bf01n015, bf01n016 rg1615 cascading bf01n016, bf01n015 Add resource groups rg1314 and rg1413 to bf01n013 and resource groups rg1516 and rg1615 to bf01n015: # smit hacmp > Cluster Configuration > Cluster Resources > Define Resource Groups > Add a Resource Group Add a Resource Group * Resource Group Name [rg1516] * Node Relationship cascading * Participating Node Names [bf01n015 bf01n016] Alternatively, enter the following command: /usr/sbin/cluster/utilities/claddgrp -g rg1516 -r ’cascading’ -n bf01n015 bf01n016 5.7 Add an application server With our configuration, we will be adding four application servers, two per cluster, as described in Table 4. Table 4. Application server scripts Cluster Server name start script stop script cl1314 as1314 /usr/bin/rc.db2pe.13.start /usr/bin/rc.db2pe.13.stop as1413 /usr/bin/rc.db2pe.14.start /usr/bin/rc.db2pe.14.stop 26 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Cluster Server name start script stop script cl1516 as1516 /usr/bin/rc.db2pe.15.start /usr/bin/rc.db2pe.15.stop as1615 /usr/bin/rc.db2pe.16.start /usr/bin/rc.db2pe.16.stop The application server scripts must be accessible and executable from both nodes in the cluster. They are not required to have the same content. The scripts must be created by the user as HACMP does not setup the scripts. # smit hacmp > Cluster Configuration > Cluster Resources > Define Application Servers > Add an Application Server Add an Application Server Server Name [as1516] Start Script [/usr/bin/rc.db2pe.15.start] Stop Script [/usr/bin/rc.db2pe.15.stop] When setting up the cl1314 cluster, we need to take into consideration that the $HOME for the instance is a resource in the cluster. With this in mind, we will set up the application servers as follows: The contents of /usr/bin/rc.db2pe.13.start are: /usr/bin/rc.db2pe svtha1 NFS SERVER start /usr/bin/rc.db2pe svtha1 130 140,141 start and the contents of /usr/bin/rc.db2pe.13.stop are: /usr/bin/rc.db2pe svtha1 130 140,141 stop /usr/bin/rc.db2pe svtha1 NFS SERVER stop The contents of /usr/bin/rc.db2pe.14.start are: /usr/bin/rc.db2pe svtha1 140 130,131 start and the contents of /usr/bin/rc.db2pe.14.stop are: /usr/bin/rc.db2pe svtha1 140 130,131 stop The contents of /usr/bin/rc.db2pe.15.start are: /usr/bin/rc.db2pe svtha1 150 160,161 start and the contents of /usr/bin/rc.db2pe.15.stop are: /usr/bin/rc.db2pe svtha1 150 160,161 stop The contents of /usr/bin/rc.db2pe.16.start are: /usr/bin/rc.db2pe svtha1 160,161 150 start and the contents of /usr/bin/rc.db2pe.16.stop are: /usr/bin/rc.db2pe svtha1 160,161 150 stop Chapter 5. HACMP setup 27
    • The syntax of rc.db2pe for DB2 database partitions is: rc.db2pe <instance> <partition for the primary node> <partition for the secondary node> <start | stop> or for the NFS server nodes: rc.db2pe <instance> NFS SERVER < start | stop > The primary purpose of the rc.db2pe script is to construct and execute the db2start command with the correct restart parameters, to enable the DB2 database partitions to recover during a failover and failback. Refer to section 6.7, “SQL6030 RC=15, no port 0 defined in db2nodes.cfg file” on page 38 for additional information on proper use of the db2start command with the restart option. 5.8 Configure resources for the resource group Use Table 5 as a reference when configuring the resources. Table 5. Resource group configuration information Resource Service IP Filesystems Volume Application group group servers rg1314 swserv13 /db1ha/svtha1/NODE0130 havg1314 as1314 /db1ha/svtha1/NODE0131 /homehalocal rg1413 swserv14 /db1ha/svtha1/NODE0140 havg1413 as1413 rg1516 swserv15 /db1ha/svtha1/NODE0150 havg1516 as1516 rg1615 swserv16 /db1ha/svtha1/NODE0160 havg1615 as1615 /db1ha/svtha1/NODE0161 # smit hacmp > Cluster Configuration > Cluster Resources > Change/Show Resources for a Resource Group < select the Resource Group > Note: You only need to identify file systems to be taken over; logical volumes (raw devices) are taken over implicitly as part of the volume group. 5.9 Synchronize cluster resources Synchronize cluster resource information on all cluster nodes defined in the local topology database: > Cluster Configuration > Cluster Resources > Synchronize Cluster Resources 28 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Configure a Resource Group Resource Group Name rg1516 Node Relationship cascading Participating Node Names bf01n015 bf01n016 Service IP Label [swserv15] + HTY Service IP Label [] Filesystems [/db1ha/svtha1/NODE0150] + Filesystems Consistency Check fsck + Filesystems Recovery Method sequential + Filesystems to Export [] + Filesystems to NFS Mount [] + Volume Groups [havg1516] + Concurrent Volume Groups [] + Raw Disk PVIDs [] + AIX Connections Services [] + Application Servers [as1516] + Miscellaneous Data [] Inactive Takeover Activated false + 9333 Disk Fencing Activated false + SSA Disk Fencing Activated false + Filesystems mounted before IP configured false + Synchronize Cluster Resources Ignore Cluster Verification Errors? [No] Un/Configure Cluster Resources? [Yes] Emulate or Actual? [Actual] Alternatively, enter the following command: /usr/sbin/cluster/utilities/cldare -r 5.10 Show resource information by resource group Show resource configuration associated with the group name: > Cluster Configuration > Cluster Resources > Show Cluster Resources > Show Resource Information by Resource Group Chapter 5. HACMP setup 29
    • < select the Resource Group > COMMAND STATUS Command: OK stdout: yes stderr: no Before command completion, additional instructions may appear below. Resource Group Name rg1516 Node Relationship cascading Participating Node Name(s) bf01n015 bf01n016 Service IP Label swserv15 HTY Service IP Label Filesystems /db1ha/svtha1/NODE0150 Filesystems Consistency Check fsck Filesystems Recovery Method sequential Filesystems to be exported Filesystems to be NFS mounted Volume Groups havg1516 Concurrent Volume Groups Disks AIX Connections Services Application Servers AS1516 Miscellaneous Data Inactive Takeover false 9333 Disk Fencing false SSA Disk Fencing false Filesystems mounted before IP configured false Run Time Parameters: Node Name bf01n015 Debug Level high Host uses NIS or Name Server false Node Name bf01n016 Debug Level high Host uses NIS or Name Server false Alternatively, enter the following command: /usr/sbin/cluster/utilities/clshowres -g rg1516 Note: This is done as a cross reference to ensure prior steps were completed correctly. 5.11 Verify cluster Verify cluster topology, resources, and custom-defined verification methods: # smit hacmp > Cluster Configuration > Cluster Verification > Verify Cluster Verify Cluster Base HACMP Verification Methods both (Cluster topology, resources, both, none) Custom Defined Verification Methods [All] Error Count [] Log File to store output [] Alternatively, enter the following command: 30 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • /usr/sbin/cluster/diag/clconfig -v ’-tr’ -m ’All’ Your DB2 UDB and HACMP/ES setup is complete. Note: Be sure to start and stop the cluster using the HACMP commands smit clstart and smit clstop respectively. Chapter 5. HACMP setup 31
    • 32 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Chapter 6. Troubleshooting This chapter documents some hints and tips to address some situations that may occur in an HACMP and DB2 UDB EEE V7.2 installation and configuration. 6.1 SQL6048 on db2start command If you receive SQL6048 from db2start, you not only need to make sure the $HOME/.rhosts file is created with the correct entries, you must also ensure that it has the correct permissions. The following is taken from the man page entry for rsh from the AIX system man pages: If rsh is to consult an .rhosts file on the remote machine, the file must have UNIX protections no more liberal than -rw-r--r--. If .rhosts resides in a user home directory in AFS, the home directory must also grant the LOOKUP and READ rights to system:anyuser. To help narrow the area that needs correcting, use the db2_all date command. If you get Permission denied messages for each of the partitions, try using the rsh date command. If you get a Permission denied message from AIX, then the .rhost file is most likely the problem. For example, if the .rhosts file has the following permissions set: -rwxrwxrwx 1 svtha1 build 192 Feb 26 10:38 .rhosts when you issue the db2start command you will get the following output: 05-23-2001 09:12:04 130 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing. 05-23-2001 09:12:05 131 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing. 05-23-2001 09:12:06 140 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing. 05-23-2001 09:12:07 150 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing. 05-23-2001 09:12:08 160 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing. 05-23-2001 09:12:10 161 0 SQL6048N A communication error occurred during START or STOP DATABASE MANAGER processing. SQL1032N No start database manager command was issued. SQLSTATE=57019 If you issue the db2_all command, you will get a Permission denied message for each partition. Now try using the rsh command: rsh bf01n015 date You will get the same permission denied from the rsh command. Change the permissions on the .rhosts file with the chmod command: chmod 600 .rhosts Now when you issue the rsh command it should return the response to you. The db2_all date command will return the responses for each of the active partitions. The db2start command should now work assuming everything else is working correctly in the cluster. © Copyright IBM Corp. 2001 33
    • 6.2 SQL6031 returned when issuing db2 “? SQL6031” command This error could indicate a problem with the entries in the db2node.cfg file. Following is an example: $ db2start SQL6031N Error in the db2nodes.cfg file at line number “2”. Reason code “9”. $ db2 “? sql6031 “ SQL6031N Error in the db2nodes.cfg file at line number “2”. Reason code “9”. $ cat db2nodes.cfg 130 b_sw_013 1 b_sw_013 131 b_sw_013 1 b_sw_013 140 b_sw_014 0 b_sw_014 150 b_sw_015 0 b_sw_015 160 b_sw_016 0 b_sw_016 161 b_sw_016 1 b_sw_016 The fix for db2nodes.cfg would be: 130 b_sw_013 0 b_sw_013 131 b_sw_013 1 b_sw_013 140 b_sw_014 0 b_sw_014 150 b_sw_015 0 b_sw_015 160 b_sw_016 0 b_sw_016 161 b_sw_016 1 b_sw_016 Following is the entire output for the db2 “? SQL6031” command: $ db2 “? sql6031 “ SQL6031N Error in the db2nodes.cfg file at line number “<line>”. Reason code “<reason code>”. Explanation: The statement cannot be processed because of a problem with the db2nodes.cfg file, as indicated by the following reason codes: (1) Cannot access the sqllib directory of the instance. (2) The full path name added to the db2nodes.cfg filename is too long. (3) Cannot open the db2nodes.cfg file in the sqllib directory. (4) A syntax error exists at line “<line>” of the db2nodes.cfg file in the sqllib directory. (5) The nodenum value at line “<line>” of the db2nodes.cfg file in the sqllib directory is not valid. (6) The nodenum value at line “<line>” of the db2nodes.cfg file 34 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • in the sqllib directory is out of sequence. (7) The nodenum value at line “<line>” of the db2nodes.cfg file in the sqllib directory is not unique. (8) The port value at line “<line>” of the db2nodes.cfg file in the sqllib directory is not valid. (9) The hostname/port couple at line “<line>” of the db2nodes.cfg file in the sqllib directory is not unique. (10) The hostname at line “<line>” of the db2nodes.cfg file in the sqllib directory is not valid. (11) The port value at line “<line>” of the db2nodes.cfg file in the sqllib directory is not defined for your DB2 instance id in the services file (/etc/services on UNIX-based systems). (12) The port value at line “<line>” of the db2nodes.cfg file in the sqllib directory is not in the valid port range defined for your DB2 instance id in the services file (/etc/services on UNIX-based systems). (13) The hostname value at line “<line>” of the db2nodes.cfg file in the sqllib directory has no corresponding port 0. (14) A db2nodes.cfg file with more than one entry exists, but the database manager configuration is not MPP. (15) The netname at line “<line>” of the db2nodes.cfg file in the sqllib directory is not valid. User Response: The action corresponding to the reason code is: (1) Ensure that the $DB2INSTANCE userid has the required permissions to access the sqllib directory of the instance. (2) Make the instance home directory path name shorter. (3) Ensure that the db2nodes.cfg file exists in the sqllib directory and is not empty. (4) Ensure that at least 2 values are defined per line in the db2nodes.cfg file and that the file does not contain blank lines. (5) Ensure that the nodenum value defined in the db2nodes.cfg file is between 0 and 999. (6) Ensure that all the nodenum values defined in the db2nodes.cfg file are in ascending order. (7) Ensure that each nodenum value defined in the db2nodes.cfg file is unique. (8) Ensure that the port value is between 0 and 999. Chapter 6. Troubleshooting 35
    • (9) Ensure that the new couple hostname/port is not already defined in the db2nodes.cfg file. (10) Ensure the hostname value defined in db2nodes.cfg at line “<line>” is both defined on the system and operational. (11) Ensure that the services file (/etc/services on UNIX-based systems) contains an entry for your DB2 instance id. (12) Ensure that you only use port values that are specified in the services file (/etc/services file on UNIX-based systems) for your instance. (13) Ensure that the port value 0 has been defined for the corresponding hostname in the db2nodes.cfg file. (14) Perform one of the following actions: o Remove the db2nodes.cfg file. o Alter the db2nodes.cfg file to contain exactly one entry. o Install the DB2 Enterprise - Extended Edition server. (15) Ensure the netname value defined in db2nodes.cfg at line “<line>” is both defined on the system and operational. 6.3 Ethernet IP label instead of the switch IP label in db2nodes.cfg file If you want the second column of the db2nodes.cfg to be the ethernet name and not the switch name, you will have to change the rc.db2pe script. To be able to call rc.db2pe with a fifth parameter, ENET, you will have to change line 679 highlighted below: ################################################################################### #Main body of program. Argument count/check, time start registry, etc. ################################################################################### if [ $# -ne 4 -a $2 != "NFS" ] ; then echo "$0 ERROR:: rc.db2pe $*" echo "$0 SYNTAX:: rc.db2pe [DB2_USER] [INSTANCE_NUMBER] [TAKEOVER_INSTANCE_NUMBER] [ start | stop ]" exit You will need to change $# -ne 4 to $# -ne 5 to enable rc.db2pe to be called with the fifth parameter set to ENET. As an alternative, the update to the rc.db2pe script could be made immediately after line 815 by setting the $hnn variable to the ethernet IP label instead of the switch IP label. The $hnn variable is used in the db2start restart command at line 250 of rc.db2pe. This is illustrated below. 36 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Line number 250: su - $DB2user -c $lnndir/sqllib/adm/db2start nodenum $It restart hostname $hnn port $pt netname $snn Make the changes to the rc.db2pe script by commenting out the lines highlighted in bold as indicated in Figure 3. Line number 815: ################################################################################# #Establish hostname for db2nodes.cfg as hostname of this physical node. #Establish netname for db2nodes.cfg as switch interface on this physical node. # # NOTE: The print field in snndot2 needs to be the base switch address for that node. # Whether this is the first or some other field will depend on configuration so # the user must adjust the statement accordingly. ################################################################################# hnn=‘/bin/hostname | cut -d’.’ -f1‘ snndot=‘/usr/sbin/ifconfig css0 | grep inet | awk ’{ print $2 }’‘ snndot2=‘echo $snndot | awk ’{ print $1 }’‘ snn="" retries=0 while [ -z "$snn" ] ; do snn=‘host $snndot2 | awk ’{ print $1 }’ | cut -d’.’ -f1‘ if [ -n "$snn" ] ; then break else sleep 5 retries=‘expr $retries + 1‘ echo "$PROGID - $HOST: Waiting to $NFS_RETRIES * 5 sec nameservice ($retries)" fi if [ $retries -gt $NFS_RETRIES ] ; then echo "$PROGID - $HOST: Cannot execute command host $snndot " echo "$PROGID - $HOST: Nameservice Problems???" echo "$PROGID - $HOST: Exiting With Error" /usr/bin/db2_update_events HAIND OFF /usr/bin/db2_update_events HA ON exit fi done #if [ "$5" = "enet" -o "$5" = "ENET" ] ; then hnn=‘/bin/hostname | cut -d’.’ -f1‘ echo "$HOST - $PROGID: Using SP ethernet as host in db2nodes.cfg" #else # hnn=$snn # echo "$PROGID - $HOST: Using SP switch as host in db2nodes.cfg" #fi Figure 3. rc.db2pe modification 6.4 SQL1032 when using Autoloader after a failback Since the autoloader, db2atld, does issue the SET CLIENT CONNECT NODE command on the physical node where the autoloader is being executed, a db2stop and db2start is required on that physical node. 6.5 SQL6072 when using the switch HACMP service IP label If you issue a db2start restart command and the db2nodes.cfg file is using the switch alias (HACMP service IP label) instead of the base switch name, the following error may occur: Chapter 6. Troubleshooting 37
    • SQL6072N DB2START with the RESTART option cannot proceed because the specified node is already active. 6.6 SQL6031 RC=12, not enough port in /etc/serivces Ensure the DB2 port range defined in the /etc/services file is large enough to handle all possible partitions to be started. In this example, we only have two ports defined in /etc/services on CLNODE15 and when we try to fail over DB2 partition 160, we are not allowed to use a third port. db2start nodenum 160 restart hostname b_sw_015 port 2 netname b_sw_015 05-07-2001 14:51:30 160 0 SQL6031N Error in the db2nodes.cfg file at line number "5". Reason code "12". SQL6031N Error in the db2nodes.cfg file at line number "5". Reason code "12". $ grep svtha1 /etc/services DB2_svtha1 15001/tcp DB2_svtha1_END 15002/tcp $ db2 " ? SQL6031" SQL6031N Error in the db2nodes.cfg file at line number "<line>". Reason code "<reason code>". Explanation: The statement cannot be processed because of a problem with the db2nodes.cfg file, as indicated by the following reason codes: (12) The port value at line "<line>" of the db2nodes.cfg file in the sqllib directory is not in the valid port range defined for your DB2 instance id in the services file (/etc/services on UNIX-based systems). User Response: The action corresponding to the reason code is: (12) Ensure that you only use port values that are specified in the services file (/etc/services file on UNIX-based systems) for your instance. To correct the problem, increase the value for DB2_svtha1_END in /etc/services to at least 15003. 6.7 SQL6030 RC=15, no port 0 defined in db2nodes.cfg file When using db2start with the restart option to update db2nodes.cfg, do not remove the node entry with port 0 until the other nodes with higher port numbers are removed. In the example below, we try to move DB2 partition 160 before we move DB2 partition 161; the partition that has port 0 must be the last one to move. Initial db2nodes.cfg 130 b_sw_013 0 b_sw_013 131 b_sw_013 1 b_sw_013 140 b_sw_014 0 b_sw_014 150 b_sw_015 0 b_sw_015 160 b_sw_016 0 b_sw_016 38 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • 161 b_sw_016 1 b_sw_016 Now issue the db2start restart command: $ db2start nodenum 160 restart hostname b_sw_015 port 1 netname b_sw_015 SQL6030N START or STOP DATABASE MANAGER failed. Reason code "15". The above command would result in the following db2nodes.cfg file: 130 b_sw_013 0 b_sw_013 131 b_sw_013 1 b_sw_013 140 b_sw_014 0 b_sw_014 150 b_sw_015 0 b_sw_015 160 b_sw_015 1 b_sw_015 161 b_sw_016 1 b_sw_016 You will notice that there is no partition defined with port 0 for node b_sw_016. $ db2 " ? SQL6030" SQL6030N START or STOP DATABASE MANAGER failed. Reason code "<reason-code>". Explanation: The reason code indicates the error. The statement cannot be processed. (15) A hostname value has no corresponding port 0 defined in the db2nodes.cfg file in the sqllib directory. User Response: The action corresponding to the reason code is: (15) Ensure that all the hostname values have a port 0 defined in the db2nodes.cfg file in the sqllib directory including the restart option parameters. The correct way to use db2start with the restart option is: $ db2start nodenum 161 restart hostname b_sw_015 port 1 netname b_sw_015 05-07-2001 14:38:09 161 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful. The resulting db2node.cfg file is: 130 b_sw_013 0 b_sw_013 131 b_sw_013 1 b_sw_013 140 b_sw_014 0 b_sw_014 150 b_sw_015 0 b_sw_015 160 b_sw_016 0 b_sw_016 161 b_sw_015 1 b_sw_015 $ db2start nodenum 160 restart hostname b_sw_015 port 2 netname b_sw_015 05-07-2001 14:38:32 160 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful. The resulting db2node.cfg file is: 130 b_sw_013 0 b_sw_013 131 b_sw_013 1 b_sw_013 140 b_sw_014 0 b_sw_014 150 b_sw_015 0 b_sw_015 160 b_sw_015 2 b_sw_015 161 b_sw_015 1 b_sw_015 Chapter 6. Troubleshooting 39
    • 6.8 HACMP Returns config to long, stopping the catalog node When stopping the catalog node, db2stop checks all of the other nodes for each of the databases defined. If the other nodes are already stopped, the db2stop command times out on the catalog node. To avoid this time out, stop the catalog node before the other nodes. If this cannot be done, adjust each of the parameters listed below: Connection Elapse Time (conn_elapse) Configuration Type: Database manager Applies To: Partitioned database server with local and remote clients Default [Range] : 10 [0-100] Unit of Measure: Seconds Related Parameters: Node Connection Retries (max_connretries) This parameter specifies the number of seconds within which a TCP/IP connection is to be established between two database partition servers. If the attempt completes within the time specified by this parameter, communications are established. If it fails, another attempt is made to establish communications. If the connection is attempted until the number of times specified by the max_connretries parameter is reached, and always times out, an error is issued. Node Connection Retries (max_connretries) Configuration Type: Database manager Applies To: Partitioned database server with local and remote clients Default [Range]: 5 [0-100] Related Parameters: Connection Elapse Time (conn_elapse) If the attempt to establish communication between two database partition servers fails (for example, the value specified by the conn_elapse parameter is reached), max_connretries specifies the number of connection retries that can be made to a database partition server. If the value specified for this parameter is exceeded, an error is returned. Start and Stop Timeout (start_stop_time) Configuration Type: Database manager Applies To: Partitioned database server with local and remote clients Default [Range]: 10 [1 -- 1 440] Unit of Measure: Minutes This parameter is applicable in a partitioned database environment only. It specifies the time, in minutes, within which all database partition servers must respond to a db2start or a db2stop command. It is also used as the time out value during an addnode operation. Database partition servers that do not respond to a db2start command within the specified time send a message to the db2start error log in sqllib/log in the $HOME directory for the instance. Issue a db2stop on these nodes before restarting them. Database partition servers that do not respond to a db2stop command within the specified time send a message to the db2stop error log in sqllib/log in the $HOME directory for the instance. You can either issue a db2stop 40 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • command for each database partition server that does not respond, or you can issue one db2stop command for all of the partitions (those that are already stopped will return a message stating that they are stopped). 6.9 db2_all with the “;” option loops If you use a non-hostname adapter in the second column of db2nodes.cfg file, you only need an entry in the .rhosts for the b_sw_0XX IP label to enable db2start and db2_all to work. However, db2_all with the “;” option will sometimes go into a loop. A short command does not go into a loop, but a longer running command like backup database or restart database might go into a loop. The work-around is to update the .rhosts with the hostname entry. The .rhosts file has entries for the b_sw_0** IP labels but not for the bf01n0** real hostname. With entries in the .rhosts that match the 2nd column in db2nodes.cfg, we can run db2start and db2_all. Since we are missing the entries in .rhosts that match with the real hostname IP labels of the machine, the db2_all “; sleep 20 ; date” command will loop. $ cat $HOME/sqllib/db2nodes.cfg 130 b_sw_013 0 b_sw_013 131 b_sw_013 1 b_sw_013 140 b_sw_014 0 b_sw_014 150 b_sw_015 0 b_sw_015 160 b_sw_016 0 b_sw_016 161 b_sw_016 1 b_sw_016 $ db2_all "; date " b_sw_013: Fri Jan 12 09:47:34 EST 2001 b_sw_013: date completed ok b_sw_013: Permission denied. b_sw_013: Fri Jan 12 09:47:35 EST 2001 b_sw_013: date completed ok b_sw_013: Permission denied. b_sw_014: Fri Jan 12 09:47:35 EST 2001 b_sw_014: date completed ok b_sw_015: Fri Jan 12 09:47:36 EST 2001 b_sw_015: date completed ok b_sw_016: Fri Jan 12 09:47:37 EST 2001 b_sw_016: date completed ok b_sw_016: Fri Jan 12 09:47:38 EST 2001 b_sw_016: date completed ok If you issue the command db2_all “; sleep 20 ; date” you will get a loop: $ db2_all "; sleep 20 ; date " b_sw_013: Fri Jan 12 09:49:34 EST 2001 b_sw_013: sleep 20 completed ok Chapter 6. Troubleshooting 41
    • b_sw_013: Permission denied. b_sw_013: Fri Jan 12 09:49:35 EST 2001 b_sw_013: sleep 20 completed ok b_sw_013: Permission denied. b_sw_014: Fri Jan 12 09:49:35 EST 2001 b_sw_014: sleep 20 completed ok b_sw_015: Fri Jan 12 09:49:35 EST 2001 b_sw_015: sleep 20 completed ok b_sw_016: Fri Jan 12 09:49:35 EST 2001 b_sw_016: sleep 20 completed ok b_sw_016: Fri Jan 12 09:49:35 EST 2001 b_sw_016: sleep 20 completed ok rah: primary monitoring process for sleep is 18074 rah: waiting for 49090, b_sw_013:sleep rah: waiting for 46984, b_sw_013:sleep rah: waiting for 49090, b_sw_013:sleep rah: waiting for 46984, b_sw_013:sleep ... ... ... The multiple rah: waiting messages indicates a loop. 42 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Chapter 7. Testing Implementing an HA cluster is a seven-stage process. The stages are as follows: 1. Carefully plan the cluster. 2. Plan it some more. 3. Keep planning it. 4. Implement the cluster. 5. Test the cluster. 6. Test it some more. 7. Test it again. If the test results are not satisfactory, return to step 1. Planning the proper setup of an HACMP cluster is not within the scope of this document. Please see the HACMP Administrator’s Guide for advice on the setup of the cluster. We are able to provide some testing guidelines, though. As important as setting up the HACMP cluster is actually testing it to make sure that it behaves as expected. Ideally, every single point of failure that has the potential to bring down a cluster should be tested. Granted, it may not be practical to shut down a building’s power supply in order to test backup power supplies, but a single source of electricity is a single point of failure, and until the system has been physically tested, it will never be entirely certain that the cluster will behave in practice as it does in theory. 7.1 Test environment and tools The best testing environment is on the HACMP cluster itself. Enough scheduled downtime should be set aside to thoroughly test the system before the system is put into production. A short, planned outage is preferable to a long, unplanned outage that reveals that an untested point of failure has left an application unavailable. The testing procedure itself is simple. Connect to the cluster from a client machine, cause one of the points of failure to fail, and watch to ensure that the failover takes place properly, and that the application is available and properly configured after failover. If the cluster is built using a cascading cluster configuration, check again after service has been restored to the original node. If the cluster is built using a rotating cluster configuration, bring up the original node again, then cause the second node to fail, which should restore the system to its original node. When testing the availability of the application, be sure that accounts and passwords work as expected, hostnames and IP addresses work as expected, the data is complete and up to date, and the changeover is essentially transparent to the user. Configure a remote machine to be able to connect to the highly available DB2 UDB database. A script can be easily written that will connect to our database, select some data from a table, record the results, and disconnect from the database. If these steps are set inside a loop that will run until interrupted by the operator, the procedure can be used to monitor the state of the cluster. Keep in mind that the script should continue even if the database cannot be contacted. This way, when the database restarts, it will provide a benchmark © Copyright IBM Corp. 2001 43
    • for the length of time that failover is expected to take. Here is a brief sample script that may be useful for testing an HACMP cluster: while : do db2 connect to database db2 "select count(*) from syscat.tables" db2 connect reset sleep 60 done Figure 4. Sample script for testing The clstat is an excellent tool for testing and monitoring the status of an HACMP cluster and is included in the HACMP package. It also comes in an X-windows compatible version, xclstat. From a remote client system (that is, a machine that is not part of the HACMP cluster), clstat will continuously monitor the status of the cluster and the individual nodes within the cluster. The display, or its sub-windows in the case of xclstat, will provide information as to whether the cluster is running, wheter it is stable, the status of the individual nodes, and the IP address used to connect to the nodes. To use clstat (or xclstat), follow these steps: 1. Using smit chinet, set the IP address on the client system to be in the same subnet as the cluster service addresses. The remote client machine used to monitor the cluster will need to be on the same subnet for the software to work properly. If you are using a token ring for your networking, it will need to be on the same ring. However, this machine cannot be part of the cluster. smit chinet > select the appropriate network interface Internet Address [9.21.77.123] Network Mask [255.255.255.0] 2. Include the client IP address and IP label in the /etc/hosts file on the cluster nodes. 3. Check to ensure that SNMP is running on the client system: # lssrc -s snmpd or # lssrc -g tcpip If it is not, start it by using smit startsrc (check the AIX administration manuals for details). 4. Install all the filesets on the HACMP installation CD containing the word client in their name on the client system. # smit install_latest 5. Add the service IP labels for the cluster to the file /usr/sbin/cluster/etc/clhosts on the client system. For example: localhost 127.0.0.1 bf01n015 9.21.77.225 44 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • 6. Reboot the client system. 7. Execute clstat. # /usr/sbin/cluster/clstat or # /usr/sbin/cluster/xclstat When the HACMP cluster is operating normally, the clstat screen will look something like this: Clstat - HACMP for AIX Cluster Status Monitor ---------------------------------------------------------- Cluster: cl1314 Date: May 28, 2001 (10:10 PM) State: UP Nodes: 2 Substate: Stable Node: bf01n013 State: UP Interface: b_sw_103 (0) Address: 9.21.77.223 State: UP Node: bf01n014 State: UP Interface: b_sw_014 (0) Address: 9.21.77.224 State: UP Figure 5. Sample clstat screen The second indispensable tool is a short script to provide a check that DB2 is running. Catalog the database on the client system, and at each point during the testing where the cluster ought to be available, run the script and make sure that the system responds as it is expected to. A sample script follows: db2 -v connect to dbname db2 -v "select * from syscat.tables" db2 -v "select count(*) from usertable1" db2 -v connect reset This can be modified to test for the cluster’s own special characteristics. 7.2 Points of failure for test consideration Points of failure that should be considered and tested are: 1. Correct software installation 2. Correct hardware configuration 3. Power failures 4. Network failures, both in hardware and software 5. Hardware failures in the CPU, the DASD, and any of the physical infrastructure 6. Careless operator behavior 7. Software failures in the operating system, HACMP, or the applications 8. All events that are monitored by HACMP This is not an exhaustive list. Cluster administrators are in the best position to know where points of failure are in their own clusters. An excellent procedure to follow for all tests is to log on as root on both nodes and execute on the server consoles the following command: Chapter 7. Testing 45
    • tail -f /tmp/hacmp.out This will provide a continuously flowing stream of information about the status of the HACMP cluster. Information that scrolls off the top of the screen can be retrieved by editing /tmp/hacmp.out. Another source of useful information is /var/adm/cluster.log. 1. Correct software installation. The first test for correctly installed software is starting the software. After installation, check the error log to ensure that no major errors occurred. This can be done with the AIX errpt command. HACMP software on a single node is most easily started using the following command: # smit clstart When the cluster is successfully started, messages will appear in /tmp/hacmp.out such as: Oct 12 08:28:32 EVENT COMPLETED: node_up_local_complete After the system has been successfully started, halt it and start it again. Halting the cluster is accomplished using the following command: # smit clstop There will be a choice of a graceful stop and a graceful stop with takeover. Testing both is a good idea. This will also be a good point to test that you can connect to the database locally as well as from a remote client machine. When the starting and stopping of individual cluster nodes is behaving in a satisfactory manner, stop and start the cluster using the Cluster Single Point of Control (C-SPOC) utility. Starting the cluster is managed using the following command: # smit cl_clstart.dialog Stopping the cluster is managed using the following command: # smit cl_clstop.dialog 2. Correct hardware configuration. There are a number of tests to execute to ensure that the hardware is correctly configured: - Use lsdev -C to check that devices are available. - Use the date command on all nodes to ensure that they are synchronized. - Use ifconfig <device> to check the network adapters. - Use netstat -i to check the network configuration. - Use no -a to check the ipforwarding and ipsendredirects settings. - Use lsvg -o to ensure that the volume groups are varied on only on the active node. 3. Power failures. Testing failover behavior in various types of power failure situations is worthwhile to ensure that the cluster is physically set up to behave in the manner desired. Testing for power failures in individual components can be 46 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • accomplished by simply pulling plugs out of sockets, or hitting power buttons. Larger power failures can be tested by throwing switches in the building’s electrical panel (make sure the person doing this knows what he or she is doing!). 4. Network failures. Testing the network starts by ensuring that the network behaves as it is expected to upon startup. From each node, ensure that the service, standby, and boot addresses on all other nodes in the cluster can be pinged. Note that the service and standby addresses will not both be active at the same time. Be sure that rlogin will connect to all service addresses from all cluster nodes. Network failures tend to be of two kinds: hardware and software. Hardware failures can be tested by physically unplugging network cables. One at a time, unplug every network cable entering the cluster machines. Plug each back in again before removing the next plug. Remember to check the RS-232 serial cable, or, if the cluster is also using SSA or SCSI devices as its serial connection , unplug them one at a time. This will simulate the physical failure of individual network controllers, or problems with the cables themselves. A properly designed and implemented system will be able to survive the loss of any one of these components. A network software failure can be simulated by killing network processes running on the primary server. The lssrc -a command will provide a list of running processes, including their group and PID. Selecting a likely looking process from the TCP/IP group and issuing a kill -9 <PID> command will simulate the failure of the network software. Killing processes such as /usr/sbin/inetd or /usr/sbin/portmap will be good tests. Networking software can also be halted using the following command: # smit communications > TCPIP > further configuration > stop TCPIP daemons Alternatively, enter the following command: # stopsrc -g tcpip 5. Hardware failures. Hardware failures can be simulated by a variety of brute force methods. CPU failures can be mimicked by pushing power or reset switches, by killing processes, or corrupting the memory. Issuing the following command from the root user account is a good way to bring down a machine: # echo “hello world” > /dev/kmem A good way to test for hardware failures in the DASD is to pull power cables, or physically pull disks out of their cabinets. 6. Careless operator behavior. Careless operator behavior can result in any of the situations outlined and tested in steps 4, 5, and 7. 7. Software failures. Software failures can be simulated by killing critical processes, which also works to simulate many operator errors. You can find the Process ID for the clsmuxpd and clstrmgr processes using the ps -ef command, and the Chapter 7. Testing 47
    • processes can then be killed using a kill -9 <PID> command. Doing the same with the db2sysc process will ensure that the scripts are properly configured to catch a failure in the critical application and will restart the application in the appropriate manner. 8. Monitored events. Make sure, when failovers are being tested, that the events scripted in your application scripts take place correctly. For instance, make sure that after every failover, you can connect to your DB2 database, and that the database is accessible to the users who will need it. Attempting to run simple scripts from a user account is a good way of testing this. 48 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Chapter 8. Additional information Availant (Formally known as CLAM Associates of Cambridge, Massachusetts): http://www.availant.com/services/cluster.html http://www.clam.com/services/cluster.html IBM: • IBM Websites: http://www.ibm.com/ http://www.rs6000.ibm.com/software/Apps/hacmp/ http://www.ibm.com/software/data/db2/udb/ • IBM Redbooks: http://www.redbooks.ibm.com/ Managing VLDB Using DB2 UDB EEE SG24-5105 IBM Certification Study Guide AIX HACMP SG24-5131 • DB2 documentation from the install image or the Web: http://www.ibm.com/cgi-bin/db2www/data/db2/udb/winos2unix/support/v7pubs .d2w/main DB2 UDB Administration Guide: Implementation SC09-2944 DB2 UDB Enterprise Extended - Edition for UNIX Quick Beginnings GC09-2964 DB2 UDB and DB2 Connect Installation and Configuration Supplement GC09-2957 DB2 Universal Database V7.1 for UNIX, Linux, Windows and OS/2 Database Certification Guide ISBN 0-13-091366-9 • AIX documentation: http://www.ibm.com/servers/aix/library/ • HACMP Documentation: HACMP Concepts and Facilities SC23-4276 HACMP Planning Guide SC23-4277 HACMP Installation Guide SC23-4278 HACMP Administration Guide SC23-4279 HACMP Troubleshooting Guide SC23-4280 HACMP Enhanced Scalability Installation and Admin Guide SC23-4284 • Technical Reports: http://www.ibm.com/software/data/pubs/papers IBM DB2 Universal Database Enterprise Edition and HACMP/ES TR-74.171 IBM DB2 Universal Database Enterprise Extended - Edition for AIX and HACMP/ES TR-74.174 © Copyright IBM Corp. 2001 49
    • 50 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)
    • Appendix A. Trademarks and service marks The following terms are trademarks or registered trademarks of International Business Machines Corporation in the United States, or other countries, or both: • IBM • AIX • DB2 • DB2 Universal Database • RS/6000 Other company, products and service names maybe trademarks or service marks of others. © Copyright IBM Corp. 2001 51
    • 52 IBM® DB2® UDB EEE for AIX® and HACMP/ES (TR-74.174)