Your SlideShare is downloading. ×
High availability scenarios with ibm tivoli workload scheduler and ibm tivoli framework sg246632
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

High availability scenarios with ibm tivoli workload scheduler and ibm tivoli framework sg246632

2,087
views

Published on

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,087
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Front coverHigh Availability Scenarioswith IBM Tivoli WorkloadScheduler andIBM Tivoli FrameworkImplementing high availability for ITWSand Tivoli FrameworkWindows 2000 Cluster Serviceand HACMP scenariosBest practices and tips Vasfi Gucer Satoko Egawa David Oswald Geoff Pusey John Webb Anthony Yenibm.com/redbooks
  • 2. International Technical Support OrganizationHigh Availability Scenarios with IBM TivoliWorkload Scheduler and IBM Tivoli FrameworkMarch 2004 SG24-6632-00
  • 3. Note: Before using this information and the product it supports, read the information in “Notices” on page vii.First Edition (March 2004)This edition applies to IBM Tivoli Workload Scheduler Version 8.2, IBM Tivoli ManagementFramework Version 4.1.© Copyright International Business Machines Corporation 2004. All rights reserved.Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADPSchedule Contract with IBM Corp.
  • 4. Contents Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 IBM Tivoli Workload Scheduler architectural overview . . . . . . . . . . . . . . . . 2 1.2 IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework . 4 1.3 High availability terminology used in this book . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Overview of clustering technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 High availability versus fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.2 Server versus job availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.3 Standby versus takeover configurations . . . . . . . . . . . . . . . . . . . . . . 12 1.4.4 IBM HACMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.5 Microsoft Cluster Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.5 When to implement IBM Tivoli Workload Scheduler high availability . . . . 24 1.5.1 High availability solutions versus Backup Domain Manager . . . . . . . 24 1.5.2 Hardware failures to plan for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.5.3 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6 Material covered in this book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 2. High level design and architecture . . . . . . . . . . . . . . . . . . . . . . 31 2.1 Concepts of high availability clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.1.1 A bird’s-eye view of high availability clusters . . . . . . . . . . . . . . . . . . 32 2.1.2 Software considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.1.3 Hardware considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2 Hardware configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.2.1 Types of hardware cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.2.2 Hot standby system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.3 Software configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.3.1 Configurations for implementing IBM Tivoli Workload Scheduler in a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.3.2 Software availability within IBM Tivoli Workload Scheduler . . . . . . . 57 2.3.3 Load balancing software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.3.4 Job recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60© Copyright IBM Corp. 2004. All rights reserved. iii
  • 5. Chapter 3. High availability cluster implementation . . . . . . . . . . . . . . . . . 63 3.1 Our high availability cluster scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1.1 Mutual takeover for IBM Tivoli Workload Scheduler . . . . . . . . . . . . . 64 3.1.2 Hot standby for IBM Tivoli Management Framework . . . . . . . . . . . . 66 3.2 Implementing an HACMP cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2.1 HACMP hardware considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2.2 HACMP software considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.2.3 Planning and designing an HACMP cluster . . . . . . . . . . . . . . . . . . . 67 3.2.4 Installing HACMP 5.1 on AIX 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.3 Implementing a Microsoft Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.3.1 Microsoft Cluster hardware considerations . . . . . . . . . . . . . . . . . . . 139 3.3.2 Planning and designing a Microsoft Cluster installation . . . . . . . . . 139 3.3.3 Microsoft Cluster Service installation . . . . . . . . . . . . . . . . . . . . . . . 141 Chapter 4. IBM Tivoli Workload Scheduler implementation in a cluster 183 4.1 Implementing IBM Tivoli Workload Scheduler in an HACMP cluster . . . 184 4.1.1 IBM Tivoli Workload Scheduler implementation overview . . . . . . . 184 4.1.2 Preparing to install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 4.1.3 Installing the IBM Tivoli Workload Scheduler engine . . . . . . . . . . . 191 4.1.4 Configuring the IBM Tivoli Workload Scheduler engine . . . . . . . . . 192 4.1.5 Installing IBM Tivoli Workload Scheduler Connector . . . . . . . . . . . 194 4.1.6 Setting the security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 4.1.7 Add additional IBM Tivoli Workload Scheduler Connector instance 201 4.1.8 Verify IBM Tivoli Workload Scheduler behavior in HACMP cluster. 202 4.1.9 Applying IBM Tivoli Workload Scheduler fix pack . . . . . . . . . . . . . . 204 4.1.10 Configure HACMP for IBM Tivoli Workload Scheduler . . . . . . . . . 210 4.1.11 Add IBM Tivoli Management Framework . . . . . . . . . . . . . . . . . . . 303 4.1.12 Production considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 4.1.13 Just one IBM Tivoli Workload Scheduler instance . . . . . . . . . . . . 345 4.2 Implementing IBM Tivoli Workload Scheduler in a Microsoft Cluster . . . 347 4.2.1 Single instance of IBM Tivoli Workload Scheduler . . . . . . . . . . . . . 347 4.2.2 Configuring the cluster group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 4.2.3 Two instances of IBM Tivoli Workload Scheduler in a cluster. . . . . 383 4.2.4 Installation of the IBM Tivoli Management Framework . . . . . . . . . . 396 4.2.5 Installation of Job Scheduling Services. . . . . . . . . . . . . . . . . . . . . . 401 4.2.6 Installation of Job Scheduling Connector . . . . . . . . . . . . . . . . . . . . 402 4.2.7 Creating Connector instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 4.2.8 Interconnecting the two Tivoli Framework Servers . . . . . . . . . . . . . 405 4.2.9 Installing the Job Scheduling Console . . . . . . . . . . . . . . . . . . . . . . 408 4.2.10 Scheduled outage configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 410 Chapter 5. Implement IBM Tivoli Management Framework in a cluster . 415 5.1 Implement IBM Tivoli Management Framework in an HACMP cluster . . 416iv High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 6. 5.1.1 Inventory hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 5.1.2 Planning the high availability design . . . . . . . . . . . . . . . . . . . . . . . . 418 5.1.3 Create the shared disk volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 5.1.4 Install IBM Tivoli Management Framework . . . . . . . . . . . . . . . . . . . 453 5.1.5 Tivoli Web interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 5.1.6 Tivoli Managed Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 5.1.7 Tivoli Endpoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 5.1.8 Configure HACMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4805.2 Implementing Tivoli Framework in a Microsoft Cluster . . . . . . . . . . . . . . 503 5.2.1 TMR server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 5.2.2 Tivoli Managed Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 5.2.3 Tivoli Endpoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555Appendix A. A real-life implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 571Rationale for IBM Tivoli Workload Scheduler and HACMP integration . . . . . 572Our environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572Installation roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573Software configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574Hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575Installing the AIX operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576Finishing the network configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577Creating the TTY device within AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577Testing the heartbeat interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578Configuring shared disk storage devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579Copying installation code to shared storage . . . . . . . . . . . . . . . . . . . . . . . . . 580Creating user accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581Creating group accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581Installing IBM Tivoli Workload Scheduler software . . . . . . . . . . . . . . . . . . . . 581Installing HACMP software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582Installing the Tivoli TMR software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 Patching the Tivoli TMR software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 TMR versus Managed Node installation . . . . . . . . . . . . . . . . . . . . . . . . . . 583Configuring IBM Tivoli Workload Scheduler start and stop scripts. . . . . . . . . 584Configuring miscellaneous start and stop scripts . . . . . . . . . . . . . . . . . . . . . . 584Creating and modifying various system files . . . . . . . . . . . . . . . . . . . . . . . . . 585Configuring the HACMP environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585Testing the failover procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 HACMP Cluster topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586 HACMP Cluster Resource Group topology . . . . . . . . . . . . . . . . . . . . . . . . 588 ifconfig -a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589Skills required to implement IBM Tivoli Workload Scheduling/HACMP . . . . . 590Observations and questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 Contents v
  • 7. Appendix B. TMR clustering for Tivoli Framework 3.7b on MSCS . . . . . 601 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 Configure the wlocalhost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 Install Framework on the primary node. . . . . . . . . . . . . . . . . . . . . . . . . . . 602 Install Framework on the secondary node . . . . . . . . . . . . . . . . . . . . . . . . 603 Configure the TMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 Set the root administrators login . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 Force the oserv to bind to the virtual IP . . . . . . . . . . . . . . . . . . . . . . . . . . 603 Change the name of the DBDIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604 Modify the setup_env.cmd and setup_env.sh . . . . . . . . . . . . . . . . . . . . . . 604 Configure the registry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604 Rename the Managed Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604 Rename the TMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Rename the top-level policy region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Rename the root administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Configure the ALIDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 Create the cluster resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 Create the oserv cluster resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 Create the trip cluster resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 Set up the resource dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 Validate and backup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 Test failover. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 Back up the Tivoli databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612 How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615vi High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 8. NoticesThis information was developed for products and services offered in the U.S.A.IBM may not offer the products, services, or features discussed in this document in other countries. Consultyour local IBM representative for information on the products and services currently available in your area.Any reference to an IBM product, program, or service is not intended to state or imply that only that IBMproduct, program, or service may be used. Any functionally equivalent product, program, or service thatdoes not infringe any IBM intellectual property right may be used instead. However, it is the usersresponsibility to evaluate and verify the operation of any non-IBM product, program, or service.IBM may have patents or pending patent applications covering subject matter described in this document.The furnishing of this document does not give you any license to these patents. You can send licenseinquiries, in writing, to:IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.The following paragraph does not apply to the United Kingdom or any other country where such provisionsare inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDESTHIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimerof express or implied warranties in certain transactions, therefore, this statement may not apply to you.This information could include technical inaccuracies or typographical errors. Changes are periodically madeto the information herein; these changes will be incorporated in new editions of the publication. IBM maymake improvements and/or changes in the product(s) and/or the program(s) described in this publication atany time without notice.Any references in this information to non-IBM Web sites are provided for convenience only and do not in anymanner serve as an endorsement of those Web sites. The materials at those Web sites are not part of thematerials for this IBM product and use of those Web sites is at your own risk.IBM may use or distribute any of the information you supply in any way it believes appropriate withoutincurring any obligation to you.Information concerning non-IBM products was obtained from the suppliers of those products, their publishedannouncements or other publicly available sources. IBM has not tested those products and cannot confirmthe accuracy of performance, compatibility or any other claims related to non-IBM products. Questions onthe capabilities of non-IBM products should be addressed to the suppliers of those products.This information contains examples of data and reports used in daily business operations. To illustrate themas completely as possible, the examples include the names of individuals, companies, brands, and products.All of these names are fictitious and any similarity to the names and addresses used by an actual businessenterprise is entirely coincidental.COPYRIGHT LICENSE:This information contains sample application programs in source language, which illustrates programmingtechniques on various operating platforms. You may copy, modify, and distribute these sample programs inany form without payment to IBM, for the purposes of developing, using, marketing or distributing applicationprograms conforming to the application programming interface for the operating platform for which thesample programs are written. These examples have not been thoroughly tested under all conditions. IBM,therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy,modify, and distribute these sample programs in any form without payment to IBM for the purposes ofdeveloping, using, marketing, or distributing application programs conforming to IBMs applicationprogramming interfaces.© Copyright IBM Corp. 2004. All rights reserved. vii
  • 9. TrademarksThe following terms are trademarks of the International Business Machines Corporation in the United States,other countries, or both: AFS® Maestro™ SAA® AIX® NetView® Tivoli Enterprise™ Balance® Planet Tivoli® Tivoli® DB2® PowerPC® TotalStorage® DFS™ pSeries® WebSphere® Enterprise Storage Server® Redbooks™ ^™ IBM® Redbooks (logo) ™ z/OS® LoadLeveler® RS/6000®The following terms are trademarks of other companies:Intel, Intel Inside (logos), and Pentium are trademarks of Intel Corporation in the United States, othercountries, or both.Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States,other countries, or both.Java and all Java-based trademarks and logos are trademarks or registered trademarks of SunMicrosystems, Inc. in the United States, other countries, or both.UNIX is a registered trademark of The Open Group in the United States and other countries.Other company, product, and service names may be trademarks or service marks of others.viii High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 10. Preface This IBM® Redbook is intended to be used as a major reference for designing and creating highly available IBM Tivoli® Workload Scheduler and Tivoli Framework environments. IBM Tivoli Workload Scheduler Version 8.2 is the IBM strategic scheduling product that runs on many different platforms, including the mainframe. Here, we describe how to install ITWS Version 8.2 in a high availability (HA) environment and configure it to meet high availability requirements. The focus is on the IBM Tivoli Workload Scheduler Version 8.2 Distributed product, although some issues specific to Version 8.1 and IBM Tivoli Workload Scheduler for z/OS® are also briefly covered. When implementing a highly available IBM Tivoli Workload Scheduler environment, you have to consider high availability for both IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework environments, because IBM Tivoli Workload Scheduler uses IBM Tivoli Management Frameworks services for authentication. Therefore, we discuss techniques you can use to successfully implement IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework (TMR server, Managed Nodes and Endpoints), and we present two major case studies: High-Availability Cluster Multiprocessing (HACMP) for AIX®, and Microsoft® Windows® Cluster Service. The implementation of IBM Tivoli Workload Scheduler within a high availability environment will vary from platform to platform and from customer to customer, based on the needs of the installation. Here, we cover the most common scenarios and share practical implementation tips. We also make recommendations for other high availability platforms; although there are many different clustering technologies in the market today, they are similar enough to allow us to offer useful advice regarding the implementation of a highly available scheduling system. Finally, although we basically cover highly available scheduling systems, we also offer a section for customers who want to implement a highly available IBM Tivoli Management Framework environment, but who are not currently using IBM Tivoli Workload Scheduler.The team that wrote this redbook This redbook was produced by a team of specialists from around the world working at the International Technical Support Organization, Austin Center.© Copyright IBM Corp. 2004. All rights reserved. ix
  • 11. Vasfi Gucer is an IBM Certified Consultant IT Specialist at the ITSO Austin Center. He has been with IBM Turkey for 10 years, and has worked at the ITSO since January 1999. He has more than 10 years of experience in systems management, networking hardware, and distributed platform software. He has worked on various Tivoli customer projects as a Systems Architect and Consultant in Turkey and in the United States, and is also a Certified Tivoli Consultant. Satoko Egawa is an I/T Specialist with IBM Japan. She has five years of experience in systems management solutions. Her area of expertise is job scheduling solutions using Tivoli Workload Scheduler. She is also a Tivoli Certified Consultant, and in the past has worked closely with the Tivoli Rome Lab. David Oswald is a Certified IBM Tivoli Services Specialist in New Jersey, United States, who works on IBM Tivoli Workload Scheduling and Tivoli storage architectures/deployments (TSRM, TSM,TSANM) for IBM customers located in the United States, Europe, and Latin America. He has been involved in disaster recovery, UNIX administration, shell scripting and automation for 17 years, and has worked with TWS Versions 5.x, 6.x, 7.x, and 8.x. While primarily a Tivoli services consultant, he is also involved in Tivoli course development, Tivoli certification exams, and Tivoli training efforts. Geoff Pusey is a Senior I/T Specialist in the IBM Tivoli Services EMEA region. He is a Certified IBM Tivoli Workload Scheduler Consultant and has been with Tivoli/IBM since January 1998, when Unison Software was acquired by Tivoli Systems. He has worked with the IBM Tivoli Workload Scheduling product for the last 10 years as a consultant, performing customer training, implementing and customizing IBM Tivoli Workload Scheduler, creating customized scripts to generate specific reports, and enhancing IBM Tivoli Workload Scheduler with new functions. John Webb is a Senior Consultant for Tivoli Services Latin America. He has been with IBM since 1998. Since joining IBM, John has made valuable contributions to the company through his knowledge and expertise in enterprise systems management. He has deployed and designed systems for numerous customers, and his areas of expertise include the Tivoli Framework and Tivoli PACO products. Anthony Yen is a Senior IT Consultant with IBM Business Partner Automatic IT Corporation, <http://www.AutomaticIT.com>, in Austin, Texas, United States. He has delivered 19 projects involving 11 different IBM Tivoli products over the past six years. His areas of expertise include Enterprise Console, Monitoring, Workload Scheduler, Configuration Manager, Remote Control, and NetView®. He has given talks at Planet Tivoli® and Automated Systems And Planning OPC and TWS Users Conference (ASAP), and has taught courses on IBM Tivolix High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 12. Workload Scheduler. Before that, he worked in the IT industry for 10 years as a UNIX and Windows system administrator. He has been an IBM Certified Tivoli Consultant since 1998. Thanks to the following people for their contributions to this project: Octavian Lascu, Dino Quintero International Technical Support Organization, Poughkeepsie Center Jackie Biggs, Warren Gill, Elaine Krakower, Tina Lamacchia, Grant McLaughlin, Nick Lopez IBM USA Antonio Gallotti IBM ItalyBecome a published author Join us for a two- to six-week residency program! Help write an IBM Redbook dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. Youll team with IBM technical professionals, Business Partners and/or customers. Your efforts will help increase product acceptance and customer satisfaction. As a bonus, youll develop a network of contacts in IBM development labs, and increase your productivity and marketability. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/Redbooks/residencies.htmlComments welcome Your comments are important to us! We want our Redbooks™ to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways: Use the online Contact us review Redbook form found at: ibm.com/Redbooks Send your comments in an Internet note to: Redbook@us.ibm.com Preface xi
  • 13. Mail your comments to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493xii High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 14. 1 Chapter 1. Introduction In this chapter, we introduce the IBM Tivoli Workload Scheduler suite and identify the need for high availability by IBM Tivoli Workload Scheduler users. Important ancillary concepts in IBM Tivoli Management Framework (also referred as Tivoli Framework, or TMF) and clustering technologies are introduced for new users as well. The following topics are covered in this chapter: “IBM Tivoli Workload Scheduler architectural overview” on page 2 “IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework” on page 4 “High availability terminology used in this book” on page 7 “Overview of clustering technologies” on page 8 “When to implement IBM Tivoli Workload Scheduler high availability” on page 24 “Material covered in this book” on page 27© Copyright IBM Corp. 2004. All rights reserved. 1
  • 15. 1.1 IBM Tivoli Workload Scheduler architectural overview IBM Tivoli Workload Scheduler Version 8.2 is the IBM strategic scheduling product that runs on many different platforms, including the mainframe. This redbook covers installing ITWS Version 8.2 in a high availability (HA) environment and configuring it to meet high availability requirements. The focus is on the IBM Tivoli Workload Scheduler Version 8.2 Distributed product, although some issues specific to Version 8.1 and IBM Tivoli Workload Scheduler for z/OS are also briefly covered. Understanding specific aspects of IBM Tivoli Workload Scheduler’s architecture is key to a successful high availability implementation. In-depth knowledge of the architecture is necessary for resolving some problems that might present themselves during the deployment of IBM Tivoli Workload Scheduler in an HA environment. We will only identify those aspects of the architecture that are directly involved with an high availability deployment. For a detailed discussion of IBM Tivoli Workload Scheduler’s architecture, refer to Chapter 2, “Overview”, in IBM Tivoli Workload Scheduling Suite Version 8.2, General Information, SC32-1256. IBM Tivoli Workload Scheduler uses the TCP/IP-based network connecting an enterprise’s servers to accomplish its mission of scheduling jobs. A job is an executable file, program, or command that is scheduled and launched by IBM Tivoli Workload Scheduler. All servers that run jobs using IBM Tivoli Workload Scheduler make up the scheduling network. A scheduling network contains at least one domain, the master domain, in which a server designated as the Master Domain Manager (MDM) is the management hub. This server contains the definitions of all scheduling objects that define the batch schedule, stored in a database. Additional domains can be used to divide a widely distributed network into smaller, locally managed groups. The management hubs for these additional domains are called Domain Manager servers. Each server in the scheduling network is called a workstation, or by the interchangeable term CPU. There are different types of workstations that serve different roles. For the purposes of this publication, it is sufficient to understand that a workstation can be one of the following types. You have already been introduced to one of them, the Master Domain Manager. The other types of workstations are Domain Manager (DM) and Fault Tolerant Agent (FTA). Figure 1-1 on page 3 shows the relationship between these architectural elements in a sample scheduling network.2 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 16. MASTERDM AIX Master Domain Manager DomainA DomainB AIX HPUX Domain Domain Manager Manager DM_A DM_B FTA1 FTA2 FTA3 FTA4 AIX OS/400 Windows 2000 SolarisFigure 1-1 Main architectural elements of IBM Tivoli Workload Scheduler relevant to high availability The lines between the workstations show how IBM Tivoli Workload Scheduler communicates between them. For example, if the MDM needs to send a command to FTA2, it would pass the command via DM_A. In this example scheduling network, the Master Domain Manager is the management hub for two Domain Managers, DM_A and DM_B. Each Domain Manager in turn is the management hub for two Fault Tolerant Agents. DM_A is the hub for FTA1 and FTA2, and DM_B is the hub for FTA3 and FTA4. IBM Tivoli Workload Scheduler operations revolve around a production day, a 24-hour cycle initiated by a job called Jnextday that runs on the Master Domain Chapter 1. Introduction 3
  • 17. Manager. Interrupting or delaying this process presents serious ramifications for the proper functioning of the scheduling network. Based upon this architecture, we determined that making IBM Tivoli Workload Scheduler highly available requires configuring at least the Master Domain Manager server for high availability. This delivers high availability of the scheduling object definitions. In some sites, even the Domain Manager and Fault Tolerant Agent servers are configured for high availability, depending upon specific business requirements.1.2 IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework IBM Tivoli Workload Scheduler provides out-of-the-box integration with up to six other IBM products: IBM Tivoli Management Framework IBM Tivoli Business Systems Manager IBM Tivoli Enterprise Console IBM Tivoli NetView IBM Tivoli Distributed Monitoring (Classic Edition) IBM Tivoli Enterprise Data Warehouse Other IBM Tivoli products, such as IBM Tivoli Configuration Manager, can also be integrated with IBM Tivoli Workload Scheduler but require further configuration not provided out of the box. Best practices call for implementing IBM Tivoli Management Framework on the same Master Domain Manager server used by IBM Tivoli Workload Scheduler. Figure 1-2 on page 5 shows a typical configuration of all six products, hosted on five servers (IBM Tivoli Business Systems Manager is often hosted on two separate servers).4 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 18. IBM Tivoli Management Framework IBM Tivoli Workload Scheduler IBM Tivoli Enterprise Console IBM Tivoli Management Framework IBM Tivoli Enterprise Data Warehouse IBM Tivoli Management Framework IBM Tivoli NetView IBM Tivoli Business Systems IBM Tivoli Distributed Monitoring ManagerFigure 1-2 Typical site configuration of all Tivoli products that can be integrated with IBM Tivoli WorkloadScheduler out of the box In this redbook, we show how to configure IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework for high availability, corresponding to the upper left server in the preceding example site configuration. Sites that want to implement other products on an IBM Tivoli Workload Scheduler Master Domain Manager server for high availability should consult their IBM service provider. IBM Tivoli Workload Scheduler uses IBM Tivoli Management Framework to deliver authentication services for the Job Scheduling Console GUI client, and to communicate with the Job Scheduling Console in general. Two components are used within IBM Tivoli Management Framework to accomplish these responsibilities: the Connector, and Job Scheduling Services (JSS). These components are only required on the Master Domain Manager server. For the purposes of this redbook, be aware that high availability of IBM Tivoli Workload Scheduler requires proper configuration of IBM Tivoli Management Framework, all Connector instances, and the Job Scheduling Services component. Figure 1-3 on page 6 shows the relationships between IBM Tivoli Management Framework, the Job Scheduling Services component, the IBM Tivoli Workload Scheduler job scheduling engine, and the Job Scheduling Console. Chapter 1. Introduction 5
  • 19. Job Scheduling Consoles Connector_A Connector_B Tivoli Management Framework Job Scheduling Production_A Services Production_BFigure 1-3 Relationship between major components of IBM Tivoli Workload Scheduler and IBM TivoliManagement Framework In this example, Job Scheduling Console instances on three laptops are connected to a single instance of IBM Tivoli Management Framework. This instance of IBM Tivoli Management Framework serves two different scheduling networks called Production_A and Production_B via two Connectors called Connector_A and Connector_B. Note that there is only ever one instance of the6 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 20. Job Scheduling Services component no matter how many instances of the Connector and Job Scheduling Console exist in the environment. It is possible to install IBM Tivoli Workload Scheduler without using the Connector and Job Scheduling Services components. However, without these components the benefits of the Job Scheduling Console cannot be realized. This is only an option if a customer is willing to perform all operations from just the command line interface. In high availability contexts, both IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework are typically deployed in a high availability environment. In this Redbook, we will show how to deploy IBM Tivoli Workload Scheduler both with and without IBM Tivoli Management Framework.1.3 High availability terminology used in this book It helps to share a common terminology for concepts used in this redbook. The high availability field often uses multiple terms for the same concept, but in this redbook, we adhere to conventions set by International Business Machines Corporation whenever possible. Cluster This refers to a group of servers configured for high availability of one or more applications. Node This refers to a single server in a cluster. Primary This refers to a node that initially runs an application when a cluster is started. Backup This refers to one or more nodes that are designated as the servers an application will be migrated to if the application’s primary node fails. Joining This refers to the process of a node announcing its availability to the cluster. Fallover This refers to the process of a backup node taking over an application from a failed primary node. Reintegration This refers to the process of a failed primary node that was repaired rejoining a cluster. Note that the primary node’s application does not necessarily have to migrate back to the primary node. See fallback. Fallback This refers to the process of migrating an application from a backup node to a primary node. Note that the primary node does not have to be the original primary node (for example, it can be a new node that joins the cluster). Chapter 1. Introduction 7
  • 21. For more terms commonly used when configuring high availability, refer to High Availability Cluster Multi-Processing for AIX Master Glossary, Version 5.1, SC23-4867.1.4 Overview of clustering technologies In this section we give an overview of clustering technologies with respect to high availability. A cluster is a group of loosely coupled machines networked together, sharing disk resources. While clusters can be used for more than just their high availability benefits (like cluster multi-processing), in this document we are only concerned with illustrating the high availability benefits; consult your IBM service provider for information about how to take advantage of the other benefits of clusters for IBM Tivoli Workload Scheduler. Clusters provide a highly available environment for mission-critical applications. For example, a cluster could run a database server program which services client applications on other systems. Clients send queries to the server program, which responds to their requests by accessing a database stored on a shared external disk. A cluster takes measures to ensure that the applications remain available to client processes even if a component in a cluster fails. To ensure availability, in case of a component failure, a cluster moves the application (along with resources that ensure access to the application) to another node in the cluster.1.4.1 High availability versus fault tolerance It is important for you to understand that we are detailing how to install IBM Tivoli Workload Scheduler in a highly available, but not a fault-tolerant, configuration. Fault tolerance relies on specialized hardware to detect a hardware fault and instantaneously switch to a redundant hardware component (whether the failed component is a processor, memory board, power supply, I/O subsystem, or storage subsystem). Although this cut-over is apparently seamless and offers non-stop service, a high premium is paid in both hardware cost and performance because the redundant components do no processing. More importantly, the fault-tolerant model does not address software failures, by far the most common reason for downtime. High availability views availability not as a series of replicated physical components, but rather as a set of system-wide, shared resources that cooperate to guarantee essential services. High availability combines software with industry-standard hardware to minimize downtime by quickly restoring essential services when a system, component, or application fails. While not instantaneous, services are restored rapidly, often in less than a minute.8 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 22. The difference between fault tolerance and high availability, then, is this: a fault-tolerant environment has no service interruption, while a highly available environment has a minimal service interruption. Many sites are willing to absorb a small amount of downtime with high availability rather than pay the much higher cost of providing fault tolerance. Additionally, in most highly available configurations, the backup processors are available for use during normal operation. High availability systems are an excellent solution for applications that can withstand a short interruption should a failure occur, but which must be restored quickly. Some industries have applications so time-critical that they cannot withstand even a few seconds of downtime. Many other industries, however, can withstand small periods of time when their database is unavailable. For those industries, HACMP can provide the necessary continuity of service without total redundancy. Figure 1-4 shows the costs and benefits of availability technologies.Figure 1-4 Cost and benefits of availability technologies As you can see, availability is not an all-or-nothing proposition. Think of availability as a continuum. Reliable hardware and software provide the base level of availability. Advanced features such as RAID devices provide an enhanced level of availability. High availability software provides near-continuous Chapter 1. Introduction 9
  • 23. access to data and applications. Fault-tolerant systems ensure the constant availability of the entire system, but at a higher cost.1.4.2 Server versus job availability You should also be aware of the difference between availability of the server and availability of the jobs the server runs. This redbook shows how to implement a highly available server. Ensuring the availability of the jobs is addressed on a job-by-job basis. For example, Figure 1-5 shows a production day with four job streams, labeled A, B, C and D. In this example, a failure incident occurs in between job stream B and D, during a period of the production day when no other job streams are running. Job Stream A Job Stream B Job Stream D Job Stream C Production Day Failure IncidentFigure 1-5 Example disaster recovery incident where no job recovery is required Because no jobs or job streams are running at the moment of the failure, making IBM Tivoli Workload Scheduler itself highly available is sufficient to bring back scheduling services. No recovery of interrupted jobs is required. Now suppose that job streams B and D must complete before a database change is committed. If the failure happened during job stream D as in Figure 1-6 on page 11, then before IBM Tivoli Workload Scheduler is restarted on a new server, the database needs to be rolled back so that when job stream B is restarted, it will not corrupt the database.10 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 24. Job Stream A Job Stream B Job Stream D Job Stream C Production Day Failure IncidentFigure 1-6 Example disaster recovery incident where job recovery not related to IBM Tivoli WorkloadScheduler is required This points out some important observations about high availability with IBM Tivoli Workload Scheduler. It is your responsibility to ensure that the application-specific business logic of your application is preserved across a disaster incident. For example, IBM Tivoli Workload Scheduler cannot know that a database needs to be rolled back before a job stream is restarted as part of a high availability recovery. Knowing what job streams and jobs to restart after IBM Tivoli Workload Scheduler falls over to a backup server is dependent upon the specific business logic of your production plan. In fact, it is critical to the success of a recovery effort that the precise state of the production day at the moment of failure is communicated to the team performing the recovery. Let’s look at Figure 1-7 on page 12, which illustrates an even more complex situation: multiple job streams are interrupted, each requiring its own, separate recovery activity. Chapter 1. Introduction 11
  • 25. Job Stream A Job Stream B Job Stream D Job Stream C Production Day Failure IncidentFigure 1-7 Example disaster recovery incident requiring multiple, different job recovery actions The recovery actions for job stream A in this example are different from the recovery actions for job stream B. In fact, depending upon the specifics of what your jobs and job streams run, the recovery action for a job stream that are required after a disaster incident could be different depending upon what jobs in a job stream finished before the failure. The scenario this redbook is most directly applicable towards is restarting an IBM Tivoli Workload Scheduler Master Domain Manager server on a highly available cluster where no job streams other than FINAL are executed. The contents of this redbook can also be applied to Master Domain Manager, Domain Manager, and Fault Tolerant Agent servers that run job streams requiring specific recovery actions as part of a high availability recovery. But implementing these scenarios requires simultaneous implementation of high availability for the individual jobs. The exact details of such implementations are specific to your jobs, and cannot be generalized in a “cookbook” manner. If high availability at the job level is an important criteria, your IBM service provider can help you to implement it.1.4.3 Standby versus takeover configurations There are two basic types of cluster configurations: Standby This is the traditional redundant hardware configuration. One or more standby nodes are set aside idling, waiting for a primary server in the cluster to fail. This is also known as hot standby.12 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 26. Takeover In this configuration, all cluster nodes process part of the cluster’s workload. No nodes are set aside as standby nodes. When a primary node fails, one of the other nodes assumes the workload of the failed node in addition to its existing primary workload. This is also known as mutual takeover.Typically, implementations of both configurations will involve shared resources.Disks or mass storage like a Storage Area Network (SAN) are most frequentlyconfigured as a shared resource.Figure 1-8 shows a standby configuration in normal operation, where Node A isthe primary node, and Node B is the standby node and currently idling. WhileNode B has a connection the shared mass storage resource, it is not activeduring normal operation. Node A Node B Standby (idle) Mass StorageFigure 1-8 Standby configuration in normal operationAfter Node A falls over to Node B, the connection to the mass storage resourcefrom Node B will be activated, and because Node A is unavailable, its connectionto the mass storage resource is inactive. This is shown in Figure 1-9 on page 14. Chapter 1. Introduction 13
  • 27. Node A (down) Node B Standby X (active) Mass Storage Figure 1-9 Standby configuration in fallover operation By contrast, a takeover configuration of this environment accesses the shared disk resource at the same time. For IBM Tivoli Workload Scheduler high availability configurations, this usually means that the shared disk resource has separate, logical filesystem volumes, each accessed by a different node. This is illustrated by Figure 1-10 on page 15.14 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 28. Node A Node B App 1 App 2 Node A FS Node B FS Mass StorageFigure 1-10 Takeover configuration in normal operationDuring normal operation of this two-node highly available cluster in a takeoverconfiguration, the filesystem Node A FS is accessed by App 1 on Node A, whilethe filesystem Node B FS is accessed by App 2 on Node B. If either node fails,the other node will take on the workload of the failed node. For example, if NodeA fails, App 1 is restarted on Node B, and Node B opens a connection tofilesystem Node A FS. This fallover scenario is illustrated by Figure 1-11 onpage 16. Chapter 1. Introduction 15
  • 29. Node A Node B X App 2 App 1 Node A FS Node B FS Mass Storage Figure 1-11 Takeover configuration in fallover operation Takeover configurations are more efficient with hardware resources than standby configurations because there are no idle nodes. Performance can degrade after a node failure, however, because the overall load on the remaining nodes increases. In this redbook we will be showing how to configure IBM Tivoli Workload Scheduler for takeover high availability.1.4.4 IBM HACMP The IBM tool for building UNIX-based, mission-critical computing platforms is the HACMP software. The HACMP software ensures that critical resources, such as applications, are available for processing. HACMP has two major components: high availability (HA) and cluster multi-processing (CMP). In this document we focus upon the HA component. The primary reason to create HACMP Clusters is to provide a highly available environment for mission-critical applications. For example, an HACMP Cluster could run a database server program that services client applications. The clients send queries to the server program, which responds to their requests by accessing a database stored on a shared external disk.16 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 30. In an HACMP Cluster, to ensure the availability of these applications, theapplications are put under HACMP control. HACMP takes measures to ensurethat the applications remain available to client processes even if a component ina cluster fails. To ensure availability, in case of a component failure, HACMPmoves the application (along with resources that ensure access to theapplication) to another node in the cluster.BenefitsHACMP helps you with each of the following: The HACMP planning process and documentation include tips and advice on the best practices for installing and maintaining a highly available HACMP Cluster. Once the cluster is operational, HACMP provides the automated monitoring and recovery for all the resources on which the application depends. HACMP provides a full set of tools for maintaining the cluster, while keeping the application available to clients.HACMP lets you: Set up an HACMP environment using online planning worksheets to simplify initial planning and setup. Ensure high availability of applications by eliminating single points of failure in an HACMP environment. Leverage high availability features available in AIX. Manage how a cluster handles component failures. Secure cluster communications. Set up fast disk takeover for volume groups managed by the Logical Volume Manager (LVM). Manage event processing for an HACMP environment. Monitor HACMP components and diagnose problems that may occur.For a general overview of all HACMP features, see the IBM Web site:http://www-1.ibm.com/servers/aix/products/ibmsw/high_avail_network/hacmp.htmlEnhancing availability with the AIX softwareHACMP takes advantage of the features in AIX, which is the high-performanceUNIX operating system.AIX Version 5.1 adds new functionality to further improve security and systemavailability. This includes improved availability of mirrored data and Chapter 1. Introduction 17
  • 31. enhancements to Workload Manager that help solve problems of mixed workloads by dynamically providing resource availability to critical applications. Used with the IBM IBM ^™ pSeries®, HACMP can provide both horizontal and vertical scalability, without downtime. The AIX operating system provides numerous features designed to increase system availability by lessening the impact of both planned (data backup, system administration) and unplanned (hardware or software failure) downtime. These features include: Journaled File System and Enhanced Journaled File System Disk mirroring Process control Error notification The IBM HACMP software provides a low-cost commercial computing environment that ensures that mission-critical applications can recover quickly from hardware and software failures. The HACMP software is a high availability system that ensures that critical resources are available for processing. High availability combines custom software with industry-standard hardware to minimize downtime by quickly restoring services when a system, component, or application fails. While not instantaneous, the restoration of service is rapid, usually 30 to 300 seconds. Physical components of an HACMP Cluster HACMP provides a highly available environment by identifying a set of resources essential to uninterrupted processing, and by defining a protocol that nodes use to collaborate to ensure that these resources are available. HACMP extends the clustering model by defining relationships among cooperating processors where one processor provides the service offered by a peer, should the peer be unable to do so. An HACMP Cluster is made up of the following physical components: Nodes Shared external disk devices Networks Network interfaces Clients The HACMP software allows you to combine physical components into a wide range of cluster configurations, providing you with flexibility in building a cluster that meets your processing requirements. Figure 1-12 on page 19 shows one18 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 32. example of an HACMP Cluster. Other HACMP Clusters could look very different, depending on the number of processors, the choice of networking and disk technologies, and so on.Figure 1-12 Example HACMP Cluster Nodes Nodes form the core of an HACMP Cluster. A node is a processor that runs both AIX and the HACMP software. The HACMP software supports pSeries uniprocessor and symmetric multiprocessor (SMP) systems, and the Scalable POWERParallel processor (SP) systems as cluster nodes. To the HACMP software, an SMP system looks just like a uniprocessor. SMP systems provide a cost-effective way to increase cluster throughput. Each node in the cluster can be a large SMP machine, extending an HACMP Cluster far beyond the limits of a single system and allowing thousands of clients to connect to a single database. Chapter 1. Introduction 19
  • 33. In an HACMP Cluster, up to 32 RS/6000® or pSeries stand-alone systems, pSeries divided into LPARS, SP nodes, or a combination of these cooperate to provide a set of services or resources to other entities. Clustering these servers to back up critical applications is a cost-effective high availability option. A business can use more of its computing power, while ensuring that its critical applications resume running after a short interruption caused by a hardware or software failure. In an HACMP Cluster, each node is identified by a unique name. A node may own a set of resources (disks, volume groups, filesystems, networks, network addresses, and applications). Typically, a node runs a server or a “back-end” application that accesses data on the shared external disks. The HACMP software supports from 2 to 32 nodes in a cluster, depending on the disk technology used for the shared external disks. A node in an HACMP Cluster has several layers of software components. Shared external disk devices Each node must have access to one or more shared external disk devices. A shared external disk device is a disk physically connected to multiple nodes. The shared disk stores mission-critical data, typically mirrored or RAID-configured for data redundancy. A node in an HACMP Cluster must also have internal disks that store the operating system and application binaries, but these disks are not shared. Depending on the type of disk used, the HACMP software supports two types of access to shared external disk devices: non-concurrent access, and concurrent access. In non-concurrent access environments, only one connection is active at any given time, and the node with the active connection owns the disk. When a node fails, disk takeover occurs when the node that currently owns the disk leaves the cluster and a surviving node assumes ownership of the shared disk. This is what we show in this redbook. In concurrent access environments, the shared disks are actively connected to more than one node simultaneously. Therefore, when a node fails, disk takeover is not required. We do not show this here because concurrent access does not support the use of the Journaled File System (JFS), and JFS is required to use either IBM Tivoli Workload Scheduler or IBM Tivoli Management Framework. Networks As an independent, layered component of AIX, the HACMP software is designed to work with any TCP/IP-based network. Nodes in an HACMP Cluster use the network to allow clients to access the cluster nodes, enable cluster nodes to20 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 34. exchange heartbeat messages and, in concurrent access environments, serialize access to data. The HACMP software has been tested with Ethernet, Token-Ring, ATM, and other networks. The HACMP software defines two types of communication networks, characterized by whether these networks use communication interfaces based on the TCP/IP subsystem (TCP/IP-based), or communication devices based on non-TCP/IP subsystems (device-based). Clients A client is a processor that can access the nodes in a cluster over a local area network. Clients each run a front-end or client application that queries the server application running on the cluster node. The HACMP software provides a highly available environment for critical data and applications on cluster nodes. Note that the HACMP software does not make the clients themselves highly available. AIX clients can use the Client Information (Clinfo) services to receive notice of cluster events. Clinfo provides an API that displays cluster status information. The /usr/es/sbin/cluster/clstat utility, a Clinfo client shipped with the HACMP software, provides information about all cluster service interfaces. The clients for IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework are the Job Scheduling Console and the Tivoli Desktop applications, respectively. These clients do not support the Clinfo API, but feedback that the cluster server is not available is immediately provided within these clients.1.4.5 Microsoft Cluster Service Microsoft Cluster Service (MSCS) provides three primary services: Availability Continue providing a service even during hardware or software failure. This redbook focuses upon leveraging this feature of MSCS. Scalability Enable additional components to be configured as system load increases. Simplification Manage groups of systems and their applications as a single system. MSCS is a built-in feature of Windows NT/2000 Server Enterprise Edition. It is software that supports the connection of two servers into a cluster for higher availability and easier manageability of data and applications. MSCS can automatically detect and recover from server or application failures. It can be used to move server workload to balance utilization and to provide for planned maintenance without downtime. Chapter 1. Introduction 21
  • 35. MSCS uses software heartbeats to detect failed applications or servers. In the event of a server failure, it employs a shared nothing clustering architecture that automatically transfers ownership of resources (such as disk drives and IP addresses) from a failed server to a surviving server. It then restarts the failed server’s workload on the surviving server. All of this, from detection to restart, typically takes under a minute. If an individual application fails (but the server does not), MSCS will try to restart the application on the same server. If that fails, it moves the application’s resources and restarts it on the other server. MSCS does not require any special software on client computers; so, the user experience during failover depends on the nature of the client side of their client-server application. Client reconnection is often transparent because MSCS restarts the application using the same IP address. If a client is using stateless connections (such as a browser connection), then it would be unaware of a failover if it occurred between server requests. If a failure occurs when a client is connected to the failed resources, then the client will receive whatever standard notification is provided by the client side of the application in use. For a client side application that has statefull connections to the server, a new logon is typically required following a server failure. No manual intervention is required when a server comes back online following a failure. As an example, when a server that is running Microsoft Cluster Server (server A) boots, it starts the MSCS service automatically. MSCS in turn checks the interconnects to find the other server in its cluster (server B). If server A finds server B, then server A rejoins the cluster and server B updates it with current cluster information. Server A can then initiate a failback, moving back failed-over workload from server B to server A. Microsoft Cluster Service concepts Microsoft provides an overview of MSCS in a white paper that is available at: http://www.microsoft.com/ntserver/ProductInfo/Enterprise/clustering/ClustArchit.asp The key concepts of MSCS are covered in this section. Shared nothing Microsoft Cluster employs a shared nothing architecture in which each server owns its own disk resources (that is, they share nothing at any point in time). In the event of a server failure, a shared nothing cluster has software that can transfer ownership of a disk from one server to another.22 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 36. Cluster ServicesCluster Services is the collection of software on each node that manages allcluster-specific activity.ResourceA resource is the canonical item managed by the Cluster Service. A resourcemay include physical hardware devices (such as disk drives and network cards),or logical items (such as logical disk volumes, TCP/IP addresses, entireapplications, and databases).GroupA group is a collection of resources to be managed as a single unit. A groupcontains all of the elements needed to run a specific application and for clientsystems to connect to the service provided by the application. Groups allow anadministrator to combine resources into larger logical units and manage them asa unit. Operations performed on a group affect all resources within that group.FallbackFallback (also referred as failback) is the ability to automatically rebalance theworkload in a cluster when a failed server comes back online. This is a standardfeature of MSCS. For example, say server A has crashed, and its workload failedover to server B. When server A reboots, it finds server B and rejoins the cluster.It then checks to see if any of the Cluster Group running on server B would preferto be running in server A. If so, it automatically moves those groups from serverB to server A. Fallback properties include information such as which group canfallback, which server is preferred, and during what hours the time is right for afallback. These properties can all be set from the cluster administration console.Quorum DiskA Quorum Disk is a disk spindle that MSCS uses to determine whether anotherserver is up or down.When a cluster member is booted, it searches whether the cluster software isalready running in the network: If it is running, the cluster member joins the cluster. If it is not running, the booting member establishes the cluster in the network.A problem may occur if two cluster members are restarting at the same time,thus trying to form their own clusters. This potential problem is solved by theQuorum Disk concept. This is a resource that can be owned by one server at atime and for which servers negotiate for ownership. The member who has theQuorum Disk creates the cluster. If the member that has the Quorum Disk fails,the resource is reallocated to another member, which in turn, creates the cluster. Chapter 1. Introduction 23
  • 37. Negotiating for the quorum drive allows MSCS to avoid split-brain situations where both servers are active and think the other server is down. Load balancing Load balancing is the ability to move work from a very busy server to a less busy server. Virtual server A virtual server is the logical equivalent of a file or application server. There is no physical component in the MSCS that is a virtual server. A resource is associated with a virtual server. At any point in time, different virtual servers can be owned by different cluster members. The virtual server entity can also be moved from one cluster member to another in the event of a system failure.1.5 When to implement IBM Tivoli Workload Scheduler high availability Specifying the appropriate level of high availability for IBM Tivoli Workload Scheduler often depends upon how much reliability needs to be built into the environment, balanced against the cost of solution. High availability is a spectrum of options, driven by what kind of failures you want IBM Tivoli Workload Scheduler to survive. These options lead to innumerable permutations of high availability configurations and scenarios. Our goal in this redbook is to demonstrate enough of the principles in configuring IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework to be highly available in a specific, non-trivial scenario such that you can use the principles to implement other configurations.1.5.1 High availability solutions versus Backup Domain Manager IBM Tivoli Workload Scheduler provides a degree of high availability through its Backup Domain Manager feature, which can also be implemented as a Backup Master Domain Manager. This works by duplicating the changes to the production plan from a Domain Manager to a Backup Domain Manager. When a failure is detected, a switchmgr command is issued to all workstations in the Domain Manager server’s domain, causing these workstations to recognize the Backup Domain Manager. However, properly implementing a Backup Domain Manager is difficult. Custom scripts have to be developed to implement sensing a failure, transferring the scheduling objects database, and starting the switchmgr command. The code for sensing a failure is by itself a significant effort. Possible failures to code for24 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 38. include network adapter failure, disk I/O adapter failure, network communicationsfailure, and so on.If any jobs are run on the Domain Manager, the difficulty of implementing aBackup Domain Manager becomes even more obvious. In this case, the customscripts also have to convert the jobs to run on the Backup Domain Manager, forinstance by changing all references to the workstation name of the DomainManager to the workstation name of the Backup Domain Manager, and changingreferences to the hostname of the Domain Manager to the hostname of theBackup Domain Manager.Then even more custom scripts have to be developed to migrate schedulingobject definitions back to the Domain Manager, because once the failure hasbeen addressed, the entire process has to be reversed. The effort required canbe more than the cost of acquiring a high availability product, which addressesmany of the coding issues that surround detecting hardware failures. The TotalCost of Ownership of maintaining the custom scripts also has to be taken intoaccount, especially if jobs are run on the Domain Manager. All the nuances ofensuring that the same resources that jobs expect on the Domain Manager aremet on the Backup Domain Manager have to be coded into the scripts, thendocumented and maintained over time, presenting a constant drain on internalprogramming resources.High availability products like IBM HACMP and Microsoft Cluster Service providea well-documented, widely-supported means of expressing the requiredresources for jobs that run on a Domain Manager. This makes it easy to addcomputational resources (for example, disk volumes) that jobs require into thehigh availability infrastructure, and keep it easily identified and documented.Software failures like a critical IBM Tivoli Workload Scheduler process crashingare addressed by both the Backup Domain Manager feature and IBM TivoliWorkload Scheduler configured for high availability. In both configurations,recovery at the job level is often necessary to resume the production day.Implementing high availability for Fault Tolerant Agents cannot be accomplishedusing the Backup Domain Manager feature. Providing hardware high availabilityfor a Fault Tolerant Agent server can be accomplished through custom scripting,but using a high availability solution is strongly recommended.Table 1-1 on page 26 illustrates the comparative advantages of using a highavailability solution versus the Backup Domain Manager feature to deliver ahighly available IBM Tivoli Workload Scheduler configuration. Chapter 1. Introduction 25
  • 39. Table 1-1 Comparative advantages of using a high availability solution Solution Hardware Software FTA Cost HA P P P TCO: $$ BMDM P Initially: $ TCO: $$1.5.2 Hardware failures to plan for When identifying the level of high availability for IBM Tivoli Workload Scheduler, potential hardware failures you want to plan for can affect the kind of hardware used for the high availability solution. In this section, we address some of the hardware failures you may want to consider when planning for high availability for IBM Tivoli Workload Scheduler. Site failure occurs when an entire computer room or data center becomes unavailable. Mitigating this failure involves geographically separate nodes in a high availability cluster. Products like IBM High Availability Geographic Cluster system (HAGEO) deliver a solution for geographic high availability. Consult your IBM service provider for help with implementing geographic high availability. Server failure occurs when a node in a high availability cluster fails. The minimum response to mitigate this failure mode is to make backup node available. However, you might also want to consider providing more than one backup node if the workstation you are making highly available is important enough to warrant redundant backup nodes. In this redbook we show how to implement a two-node cluster, but additional nodes are an extension to a two-node configuration. Consult your IBM service provider for help with implementing multiple-node configurations. Network failures occur when either the network itself (through a component like a router or switch), or network adapters on the server, fail. This type of failure is often addressed with redundant network paths in the former case, and redundant network adapters in the latter case. Disk failure occurs when a shared disk in a high availability cluster fails. Mitigating this failure mode often involves a Redundant Array of Independent Disks (RAID) array. However, even a RAID can catastrophically fail if two or more disk drives fail at the same time, if a power supply fails, or a backup power supply fails at the same time as a primary power supply. Planning for these catastrophic failures usually involves creating one or more mirrors of the RAID array, sometimes even on separate array hardware. Products like the IBM TotalStorage® Enterprise Storage Server® (ESS) and TotalStorage 7133 Serial Disk System can address these kinds of advanced disk availability requirements.26 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 40. These are only the most common hardware failures to plan for. Other failures may also be considered while planning for high availability.1.5.3 Summary In summary, for all but the simplest configuration of IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework, using a high availability solution to deliver high availability services is the recommended approach to satisfy high availability requirements. Identifying the kinds of hardware and software failures you want your IBM Tivoli Workload Scheduler installation to address with high availability is a key part of creating an appropriate high availability solution.1.6 Material covered in this book In the remainder of this redbook, we focus upon the applicable high availability concepts for IBM Tivoli Workload Scheduler, and two detailed implementations of high availability for IBM Tivoli Workload Scheduler, one using IBM HACMP and the other using Microsoft Cluster Service. In particular, we show you: Key architectural design issues and concepts to consider when designing highly available clusters for IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework; refer to Chapter 2, “High level design and architecture” on page 31. How to implement an AIX HACMP and Microsoft Cluster Service cluster; refer to Chapter 3, “High availability cluster implementation” on page 63. How to implement a highly available installation of IBM Tivoli Workload Scheduler, and a highly available IBM Tivoli Workload Scheduler with IBM Tivoli Management Framework, on AIX HACMP and Microsoft Cluster Service; refer to Chapter 4, “IBM Tivoli Workload Scheduler implementation in a cluster” on page 183. How to implement a highly available installation of IBM Tivoli Management Framework on AIX HACMP and Microsoft Cluster Service; refer to Chapter 5, “Implement IBM Tivoli Management Framework in a cluster” on page 415. The chapters are generally organized around the products we cover in this redbook: AIX HACMP, Microsoft Cluster Service, IBM Tivoli Workload Scheduler, and IBM Tivoli Management Framework. The nature of high availability design and implementation requires that some products and the high availability tool be considered simultaneously, especially during the planning Chapter 1. Introduction 27
  • 41. stage. This tends to lead to a haphazard sequence when applied along any thematic organization, except a straight cookbook recipe approach. We believe the best results are obtained when we present enough of the theory and practice of implementing highly available IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework installations so that you can apply the illustrated principles to your own requirements. This rules out a cookbook recipe approach in the presentation, but readers who want a “recipe” will still find value in this redbook. If you are particularly interested in following a specific configuration we show in this redbook from beginning to end, the following chapter road map gives the order that you should read the material. If you are not familiar with high availability in general, and AIX HACMP or Microsoft Cluster Service in particular, we strongly recommend that you use the introductory road map shown in Figure 1-13. Chapter 1 Chapter 2 Figure 1-13 Introductory high availability road map If you want an installation of IBM Tivoli Workload Scheduler in a highly available configuration by itself, without IBM Tivoli Management Framework, the road map shown in Figure 1-14 on page 29 gives the sequence of chapters to read. This would be appropriate, for example, for implementing a highly available Fault Tolerant Agent.28 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 42. Chapter 3 Chapter 4 (except for Framework sections)Figure 1-14 Road map for implementing highly available IBM Tivoli Workload Scheduler(no IBM Tivoli Management Framework, no Job Scheduling Console access throughcluster nodes)If you want to implement an installation of IBM Tivoli Workload Scheduler withIBM Tivoli Management Framework, use the road map shown in Figure 1-15. Chapter 3 Chapter 4Figure 1-15 Road map for implementing IBM Tivoli Workload Scheduler in a highlyavailable configuration, with IBM Tivoli Management FrameworkIf you want to implement an installation of IBM Tivoli Management Framework ina highly available configuration by itself, without IBM Tivoli Workload Scheduler,the road map shown in Figure 1-16 on page 30 should be used. This would beappropriate, for example, for implementing a stand-alone IBM Tivoli ManagementFramework server as a prelude to installing and configuring other IBM Tivoliproducts. Chapter 1. Introduction 29
  • 43. Chapter 3 Chapter 5 Figure 1-16 Road map for implementing IBM Tivoli Management Framework by itself High availability design is a very broad subject. In this redbook, we provide representative scenarios meant to demonstrate to you the issues that must be considered during implementation. Many ancillary issues are briefly mentioned but not explored in depth here. For further information, we encourage you to read the material presented in “Related publications” on page 611.30 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 44. 2 Chapter 2. High level design and architecture Implementing a high availability cluster is an essential task for most mission-critical systems. In this chapter, we present a high level overview of HA clusters. We cover the following topics: “Concepts of high availability clusters” on page 32 “Hardware configurations” on page 43 “Software configurations” on page 46© Copyright IBM Corp. 2004. All rights reserved. 31
  • 45. 2.1 Concepts of high availability clusters Today, as more and more business and non-business organizations rely on their computer systems to carry out their operations, ensuring high availability (HA) to their computer systems has become a key issue. A failure of a single system component could result in an extended denial of service. To avoid or minimize the risk of denial of service, many sites consider an HA cluster to be a high availability solution. In this section we describe what an HA cluster is normally comprised of, then discuss software/hardware considerations and introduce possible ways of configuring an HA cluster.2.1.1 A bird’s-eye view of high availability clusters We start with defining the components of a high availability cluster. Basic elements of a high availability cluster A typical HA cluster, as introduced in Chapter 1, “Introduction” on page 1, is a group of machines networked together sharing external disk resources. The ultimate purpose of setting up an HA cluster is to eliminate any possible single points of failure. By eliminating single points of failure, the system can continue to run, or recover in an acceptable period of time, with minimal impact to the end users. Two major elements make a cluster highly available: A set of redundant system components Cluster software that monitors and controls these components in case of a failure Redundant system components provide backup in case of a single component failure. In an HA cluster, an additional server(s) is added to provide server-level backups in case of a server failure. Components in a server, such as network adapters, disk adapters, disks and power supplies, are also duplicated to eliminate single points of failure. However, simply duplicating system components does not provide high availability, and cluster software is usually employed to control them. Cluster software is the core element in HA clusters. It is what ties system components into clusters and takes control of those clusters. Typical cluster software provides a facility to configure clusters and predefine actions to be taken in case of a component failure. The basic function of cluster software in general is to detect component failure and control the redundant components to restore service after a failure. In the event of a component failure, cluster software quickly transfers whatever service32 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 46. the failed component provided to a backup component, thus ensuring minimumdowntime. There are several cluster software products in the market today;Table 2-1 lists common cluster software for each platform.Table 2-1 Commonly used cluster software - by platform Platform type Cluster software AIX HACMP HP-UX MC/Service Guard Solaris Sun Cluster, Veritas Cluster Service Linux SCYLD Beowulf, Open Source Cluster Application Resources (OSCAR), IBM Tivoli System Automation Microsoft Windows Microsoft Cluster ServiceEach cluster software product has its own unique benefits, and the terminologiesand technologies may differ from product to product. However, the basic conceptand functions of most cluster software provides have much in common. In thefollowing sections we describe how an HA cluster is typically configured and howit works, using simplified examples.Typical high availability cluster configurationMost cluster software offers various options to configure an HA cluster.Configurations depend on the system’s high availability requirements and thecluster software used. Though there are several variations, the twoconfigurations types most often discussed are idle or hot standby, and mutualtakeover.Basically, a hot standby configuration assumes a second physical node capableof taking over for the first node. The second node sits idle except in the case of afallover. Meanwhile, the mutual takeover configuration consists of two nodes,each with their own set of applications, that can take on the function of the otherin case of a node failure. In this configuration, each node should have sufficientmachine power to run jobs of both nodes in the event of a node failure.Otherwise, the applications of both nodes will run in a degraded mode after afallover, since one node is doing the job previously done by two. Mutual takeoveris usually considered to be a more cost effective choice since it avoids having asystem installed just for hot standby.Figure 2-1 on page 34 shows a typical mutual takeover configuration. Using thisfigure as an example, we will describe what comprises an HA cluster. Keep inmind that this is just an example of an HA cluster configuration. Mutual takeoveris a popular configuration; however, it may or may not be the best high Chapter 2. High level design and architecture 33
  • 47. availability solution for you. For a configuration that best matches your requirements, consult your service provider. Cluster_A subnet1 subnet2 net_hb App_A App_B Disk_A Disk_B Disk_A Disk_B mirror mirror Node_A Node_BFigure 2-1 A typical HA cluster configuration As you can see in Figure 2-1, Cluster_A has Node_A and Node_B. Each node is running an application. The two nodes are set up so that each node is able to provide the function of both nodes in case a node or a system component on a node fails. In normal production, Node_A runs App_A and owns Disk_A, while Node_B runs App_B and owns Disk_B. When one of the nodes fail, the other node will acquire ownership of both disks and run both applications. Redundant hardware components are the bottom-line requirement to enable a high availability scenario. In the scenario shown here, notice that most hardware components are duplicated. The two nodes are each connected to two physical TCP/IP networks, subnet1 and subnet2, providing an alternate network connection in case of a network component failure. They share a same set of external disks, Disk_A and Disk_B, each mirrored to prevent the loss of data in case of a disk failure. Both nodes have a path to connect to the external disks. This enables one node to acquire owner ship of an external disk owned by34 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 48. another node in case of a node failure. For example, if Node_A fails, Node_B canacquire ownership of Disk_A and resume whatever service that requires Disk_A.Disk adapters connecting the nodes and the external disks are duplicated toprovide backup in the event of a disk adapter failure.In some cluster configurations, there may be an additional non-TCP/IP networkthat directly connects the two nodes, used for heartbeats. This is shown in thefigure as net_hb. To detect failures such as network and node failure, mostcluster software uses the heartbeat mechanism.Each node in the cluster sends ‘‘heartbeat’’ packets to its peer nodes overTCP/IP network and/or non-TCP/IP network. If heartbeat packets are notreceived from the peer node for a predefined amount of time, the cluster softwareinterprets it as a node failure.When using only TCP/IP networks to send heartbeats, it is difficult to differentiatenode failures from network failures. Because of this, most cluster softwarerecommends (or require) a dedicated point-to-point network for sendingheartbeat packets. Used together with TCP/IP networks, the point-to-pointnetwork prevents cluster software from misinterpreting network componentfailure as node failure. The network type for this point-to-point network may varydepending on the type of network the cluster software supports. RS-232C,Target Mode SCSI, Target Mode SSA is supported for point-to-point networks insome cluster software.Managing system componentsCluster software is responsible for managing system components in a cluster. Itis typically installed on the local disk of each cluster node. There is usually a setof processes or services that is running constantly on the cluster nodes. Itmonitors system components and takes control of those resources whenrequired. These processes or services are often referred to as the clustermanager.On a node, applications and other system components that are required by thoseapplications are bundled into a group. Here, we refer to each application andsystem component as resource, and refer to a group of these resources asresource group.A resource group is generally comprised of one or more applications, one ormore logical storages residing on an external disk, and an IP address that is notbound to a node. There may be more or fewer resources in the group, dependingon application requirements and how much the cluster software is able tosupport. Chapter 2. High level design and architecture 35
  • 49. A resource group is associated with two or more nodes in the cluster, and in normal production. A resource group is the unit that a cluster manager uses to move resources to one node from another. It will reside on the primary node in normal production; in the event of a node or component failure on the primary node, the cluster manager will move the group to another node. Figure 2-2 shows an example of resources and resource groups in a cluster. Cluster_A 192.168.1.101 192.168.1.102 APP1 APP2 DISK1 DISK3 DISK2 DISK4 Node_A Node_B Resource Group: GRP_1 Resource Group: GRP_2 Application: APP1 Application: APP2 Disk: DISK1, DISK2 Disk: DISK3, DISK4 IP Address: IP Address: 192.168.1.101 192.168.1.102Figure 2-2 Resource groups in a cluster In Figure 2-2, a resource group called GRP_1 is comprised of an application called APP1, and external disks DISK1 and DISK2. IP address 192.168.1.101 is associated to GRP_1. The primary node for GRP1 is Node_A, and the secondary node is Node_B. GRP_2 is comprised of application APP2, and disks DISK3 and DISK4, and IP address 192.168.1.102. For GRP_2, Node_B is the primary node and Node_A is the secondary node.36 High Availability Scenarios with IBM Tivoli Workload Scheduler and IBM Tivoli Framework
  • 50. Fallover and fallback of a resource groupIn normal production, cluster software constantly monitors the cluster resourcesfor any signs of failure. As soon as a cluster manager running on a node detectsa node or a component failure, it will quickly acquire the ownership of theresource group and restart the application.In our example, assume a case where Node_A crashed. Through heartbeats,Node_B detects Node_A’s failure. Because Node_B is configured as asecondary node for resource GRP_1, Node_B’s cluster manager acquiresownership of resource group GRP_1. As a result, DISK1 and DISK2 aremounted on Node_B, and the IP address associated to GRP_1 has moved toNode_B.Using these resources, Node_B will restart APP1, and resume applicationprocessing. Because these operations are initiated automatically based onpre-defined actions, it is a matter of minutes before processing of APP1 isrestored. This is called a fallover. Figure 2-3 on page 38 shows an image of thecluster after fallover. Chapter 2. High level design and architecture 37
  • 51. Cluster_A 192.168.1.102 192.168.1.101 APP1 DISK1 DISK3 APP2 DISK2 DISK4 Node_A Nod