© 2015 IBM Corporation
Planning for Catastrophe with
IBM WebSphere Application Server &
IBM Business Process Manager
Tom Alcott STSM
Chris Richardson STSM
This Session
• This session will focus on the architectural and operational issues that need to
be considered when planning and implementing a Disaster Recovery plan with
WebSphere Application Server and IBM BPM. Topics will include use of
multiple data centers, geographic separation constraints, supporting software
components, disaster recovery and other common deployment issues. Though
focused primarily on WebSphere Application Server and IBM BPM this session
also applies to IBM Middleware that is deployed on WebSphere Application
Server as well as Pure Application System.
• While not a prerequisite, attendees should be familiar with the material
covered in " Preparing to Fail, Practical WebSphere Application Server High
Availability”
Introduction
• Why Are We Here?
• To Avoid This
Agenda
• Concepts
• Disaster Recovery
• Multiple Cells and Data Centers
• WebSphere Application Server Recovery
• IBM BPM Recovery
• Final Thoughts
Definitions
• Redundancy
• The provision of additional or duplicate systems, equipment, etc., that
function in case an operating part or system fails, as in a spacecraft.
• Isolated
• Separated from other persons or things; alone; solitary
• Independent
• Not dependent; not depending or contingent upon something else for
existence, operation, etc.
• All of the Above are Fundamental for Effective High Availability and
Disaster Recovery
Definitions
High Availability (HA)
• Ensuring that the system can continue to process work within one
location after routine single component failures
• Usually we assume a single failure
• Usually the goal is very brief disruptions for only some users for
unplanned events
Continuous Operations
• Ensuring that the system is never unavailable during planned
activities
• E.g., if the application is upgraded to a new version, we do it in a
way that avoids all downtime
Continuous Availability (CA)
• High Availability coupled with Continuous Operations
• No tolerance for planned downtime
• Little unplanned downtime as possible
• Very expensive
• Note that while achieving CA almost always requires an
aggressive DR plan, they are not the same thing
Definitions
Background: High Availability in one picture
7
IP Sprayer
Node Agt
Node1
Server 1Cluster “A”
Web Server
User Registry
DMgr
IHS
IHS
Server 1
Server 2
Node Agt
Node2
Server 2
Server 2
Shared
Filesystem
WAS Txn Logs
Database
Storage (SAN)
Server 2
Cluster “B”
Cluster “C”
• Clustered IP Sprayer and Firewalls (not
depicted)
• Clustered HTTP Servers
• WAS-ND Cell with Clustered Application
Servers
• User Registry (LDAP) Hardware Clustered with
Shared Disk
• Database Hardware Clustered With Shared
Disk
• JMS Provider (Not Depicted)
• WAS Messaging Engine with Shared Disks/DB
• External JMS with Hardware Cluster and
Shared Disk
• Transaction Logs on Shared File System
• Clusters of “2” Provide High Availability,
• Don’t Forget “Rule of 3”,
• When Using With Clusters of 2
• An Outage (Planned or Unplanned) Reduces
Capacity by 50%
• Is No Longer Fault Tolerant
Disaster Recovery (DR)
• Ensuring that the system can be reconstituted and/or activated at another location and
can process work after an unexpected catastrophic failure at one location
• Often multiple single failures (which are normally handled by high availability techniques)
is considered catastrophic
• There may or may not be significant downtime as part of a disaster recovery
• This environment may be substantially smaller than the entire production environment, as
only a subset of production applications demand DR
• Normally based on justifiable business need.
• Recovery Time Objective (RTO)
• Service Recovery with little to no interruption
• Recovery Point Objective (RPO)
• Data Recovery and acceptable data loss
Definitions
• Service Levels (SLAs ) cover many things, our focus is availability aspects
• You need a clear set of requirements that define precisely the availability
requirements of the system, taking into account
• Components of the system
 A system has many pieces and business aspects, how do their requirements differ?
‒ Responsiveness and throughput requirements
 100% of requests aren't going to work perfectly 100% of the time
• Degraded services requirements
 Does everything have to meet the responsiveness requirements ALL the time?
• Dependent system requirements
 What are the implications if a system on which you depend is down?
• Data Loss
• Application Data
• Application State (Is this Critical in a Disaster?)
• Maintenance
 Change occurs, how does that affect availability?
• Disaster Recovery
 The unimaginable happens, then what?
Definitions
HA Service Level Example
SLA External
Commitment
SLA Internal Target
Service Timeframe 7 x 24 7 x 24
Application
Processing Availability
99.5% per month 99.7% per month
Recovery Time
Objective
4 Hours 1 Hour
Maintenance Window Tue-Thurs
3:00 - 6:00 am
Tue-Thurs
3:00 - 6:00 am
99.5% = 3.60 Hours
Downtime/Month
99.7% = 2.16 Hours
Downtime/Month
DR Service Level Example
SLA External
Commitment
SLA Internal Target
Recovery Time
Objective
16 Hours 4 Hours
Recovery Point
Objective
~ 0 (No Data Loss) ~ 0 (No Data Loss)
Note: This is Recovery of an Entire Data Center with 100’s of Servers,
Application, Database, Messaging, etc
Agenda
• Concepts
• Disaster Recovery
• Multiple Cells and Data Centers
• WebSphere Application Server Recovery
• IBM BPM Recovery
• Final Thoughts
Stage 0 DR – a sound HA strategy
• HA is cheaper and Less Complex Than DR .
• A Robust HA Solution prevents small failures from
becoming disasters
• Don’t let a (relatively) minor failure become a
catastrophe
• Eliminate all single points of failure in your primary
datacenter
• Spread Workloads Across Multiple Servers (and
Hypervisors ! )
• Add 2nd Production WAS-ND Cell
• Consider DB Replication, DB2 HADR, Oracle RAC in
conjunction with hardware clustering
• LDAP/Registry Replication
• Otherwise, an HA event could force you to enact
your DR procedure
• Database is only replicated HA in Different Data
Center
13
IP Sprayer
Node Agt
Node1
Server 1Cluster “A”
Web Server
User Registry
DMgr
IHS
IHS
Server 1
Server 2
Node Agt
Node2
Server 2
Server 2
Shared
Filesystem
WAS Txn Logs
Database
Storage (SAN)
Server 2
Cluster “B”
Cluster “C”
Multiple Data Center Options (1/4)
• Classic DR
• Active/Passive
• Two Data Centers, one Serving Requests the other Idling
• Independent Cells
• Easier Than Active/Active
• User and Application State Synchronization are Less Critical
• Asynchronous Replication Is Likely Sufficient
• Lower Cost for Network and Hardware Capacity
• From a Capacity perspective One Data Center is Being Underutilized.
• Typically Does Not Incur S/W License Charges When Idle
• If You Don’t Pay for S/W Licenses Is Cost and Underutilization Still a Concern?
• WebSphere License Provides for
o Hot – Processing Requests License Required
o Warm – Started But Not Processing Requests, License Not Required
o Cold – Installed, But Not Started, License Not Required
• DB2 and MQ Require < 100 % of Hot Licenses for Replication
Multiple Data Center Options (2/4)
• “Active/Active” with Single Set of Active Databases
• Two Data Centers
• Independent Application Cells
• Serving Requests for Same Applications
• Database(s) Only Active in One Data Center
– Additional Latency for Application Data Requests from Remote Data Center
– Request Processing Interruption When Data Replica is Promoted to Primary
Multiple Data Center Options (3/4)
• Classic “Active/Active”
• Two Data Centers
• Independent Cells and Synchronized Resource Managers (DBs)
• Serving Requests for Same Applications
• Requires Shared Application Data
o Application Data Consistency is prerequisite to any other planning
o Simultaneous Reads/Writes = Geographic Synchronous Disk Replication
• Additional Hardware and Disk Capacity Required
• e.g. IBM High Availability Geographic Cluster (HAGEO), Sun Cluster Geographic Edition
• Expectation of Continuous Availability and Transparent Failover
o Requires Sharing Application State
• Expectation Seldom Realized
• Outage of One Data Center, Stops Disk Writes in Both, No Longer “Transparent”
• Synchronous Disk Replication Limits Geographic Separation
• Hardest and Costliest to Achieve
Note: Disk Replication only employed for Application Data and Application State, WAS-ND cell configuration, software
updates, and application maintenance should maintained independently in order to insure isolation (and availability)
Multiple Data Center Options (4/4)
• Hybrid “Active/Active” (Partitioned by Applications)
• Two Data Centers
• Independent Cells with replicated Resource Managers
• Both DC’s Serving Requests, Both DC’s Configured for All Applications
• Running Different Applications (With Different Application Data)
– New Application Tests
– One DC Performing Updates, One DC Performing Inquiry Only (e.g. data
warehouse)
• No Shared Application State, No Shared Application Data
– Asynchronous Replication Sufficient
• Global Network Switch Used to Partition/Distribute Traffic
• In the Event of a Disaster
• Users failover from one DC to the other
• Likely Some Interruption
– As Data Replica is Promoted to Primary
– During Failover Workload Startup
• Provides Most of the Benefits of “Classic Active/Active” without the Cost and
Complexity
The CAP Theorem
• In a distributed environment, especially spanning data centers
across LANs and WANs there are three core requirements for a
service:
• Consistency
– Either the service works or fails
– Traditional ACID of databases provides consistency and isolation
• Availability
– Extremely important in web business model
– In a large distributed system, one may have to compromise with
consistency for the sake of availability
• Partition Tolerance
– Network partition will happen when not all machines are connected
– “No set of failures less than the total network failure is allowed to
cause the system to respond incorrectly” – Seth and Lynch
– Quorum is used to guard against split brain syndrome
• Brewer’s CAP conjecture states that
• One can achieve only two not all three of the above mentioned
requirements
http://en.wikipedia.org/wiki/CAP_theorem
18
Multiple Active Data Centers and the CAP Theorem
• Active/Active requires you to sacrifice either consistency, availability or
partition tolerance.
• All three aren’t possible
• If you choose full availability, then you are going to lose guaranteed
consistency.
• So you need to design with this in mind, and build in mechanisms
(typically involving queuing technologies) that enable your system to
"tend towards“ consistency.
• Your data is going to be in two places, either partitioned or replicated.
• If the former, what happens when one site is down?
• If the latter, what happens when users hitting each site see slightly
different versions of the current state?
• These are very complex problems.
• Which is why I try to steer customers away from active/active and into
an active/passive model with DR from active to passive.
• But they always feel like they are wasting hardware……………!
19
Data Center Utilization Urban Legends
• Legend
• Active/Active Improves Utilization
• Reality
• An Active/Active Topology at 40-50% Utilization in Each DC Is
Equivalent to An Active/Passive Datacenter Deployment with One
Active at 80% to 90 % Utilization and the Other Passive
• Running Active/Active at Greater Than 50% Of Total (both
Datacenters) Capacity Can Often Result in a Complete Loss of Service
When a Data Center Outage Occurs
o Insufficient Capacity in Remaining Data Center to Handle > 100% Capacity
Results in
• Poor Response Time (at best)
• Network and Server Overload, Resulting in a Complete Crash
Active/Active - What’s Wrong With This Picture ?
A former employer of mine had two data centers, running active/active
at two facilities approximately 2.6 miles (or 4.2 KM) apart.
• Close Proximity Addressed Data Consistency Concerns ……..But………
What Happens When ?
• There’s an earthquake
• There’s a Civil Insurrection
• A Hazardous Chemical Spill Occurs
• And The Wind Is Blowing the Chemical Cloud from West to East (or vice versa)
• Your DC May Not Be Located in a Locale Prone to Earthquakes
• But what about the other catastrophes ???
• They can, *and* will happen !!
• There’s No Substitute For Isolation Between Data Centers
• Data Centers Should Be Sufficiently Distant So That a Single Event Doesn’t
Impact Both !!
• This Likely Mandates Asynchronous Replication
• Active/Active No Longer Practical
Network Latency and Application Data Consistency – A 3rd
Party Perspective
• Since the latency or round trip time for a network is usually correlated to the
length of the network, or the physical distance between the two end points (in
this case the primary and standby), Maximum Protection and Maximum
Availability modes are not recommended for Data Guard deployments over a
Wide Area Network (WAN). Note that this recommendation is driven by the
laws of physics (speed of light limitation) - the greater the distance of a
network, the longer it will take for data packets to traverse the network, and
hence the longer it will take for primary database transactions to commit.
• http://www.oracle.com/technology/deploy/availability/htdocs/dataguardnetwork.htm
Multiple Cells and Data Centers
• Your Network Team Assures You That Can (or Have) Constructed a Network Link
Between Data Centers
• For Arguments Sake, We’ll I Agree, It Is possible to construct a network so that latency is
NOT an issue Under Normal Conditions
• Even so, WANs are Less Reliable than LANs.
o And Much Harder To Fix !
• But You’re Missing The Point !
• Network Interdependency Between Data Centers Means That the Data Centers are Not
Independent
• Question
• Do You Want to Have to Explain to Your CIO Why A Problem In One Data
Center Impacted The Other and Resulted in a Outage Because You Didn’t
Have Cells Aligned to Data Center Boundaries ?
How Do I Recover WebSphere Application Server ?
• File System or OS Backup and Recovery
• Disk or Tape
• WAS backUpConfig/restoreConfig
– WAS_PROFILE/properties, ../etc, WAS_ROOT/java/jre/lib/*properties,
WAS_ROOT/java/jre/lib/security,
• Build from Scratch
o Only a Realistic Option with Complete Set of Scripts and Rigorous Change Control
• Best Options
• File System/OS Backup & Recovery
• backupConfig/restoreConfig for Deployment Manager
– WAS V8.0 and above, addNode –asExistingNode can Reconfigure Each Note After
restoreConfig of Deployment Manager Configuration
• Both From Last Know Working Production Configuration
• Otherwise No Assurance Recovery Will Succeed
• Same Concern with “Build From Scratch”
‒ If Using Virtualization Consider VM Cloning for Install and Configuration
‒ Consider Smart Cloud Orchestrator for Automated Install and Configuration
o Provision Both Primary Site and DR Site in a Consistent Manner with SCO
o Note: Don’t Deploy to Backup to DR Site over a WAN !
• May Need to Change Cell and Host Names
• Will the Original Data Center Be Restored, Or Is it Gone (for Good)?
WAS Full Profile DR Recovery
• Transaction Recovery on Separate (Physical) Server
• Access the Transaction Logs
– Move/Mount the Transaction Logs to Physical Server Hosting Application Server with
Access to Same Resources (e.g. JDBC, JMS)
– V8.x Optional Use of DB for Transaction logs
• If Recovery Occurs in Different Cell use wsadmin to Configure the Same JAAS
Alias for Accessing XA Resources
– With adminconsole the node name gets prefixed to the alias.
• No Longer Required to Have Same Hostname and IP Address
– Different IP’s with Multiple Host Alias’s Typical
http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tjta_mvelog.html
• WAS Messaging Engine
• recoverMEConfig AdminTask Command Retrieves MEUUID From Persistent
Message Store and Updates Message Engine Configuration
• Allows Recovery of Stranded Messages After Catastrophic ME Failure in WAS
V8.5.0 and above
• http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/rjk_recoverme_config.html
• Some Small Capacity in DR Site Needs to be set Aside for Recovery
• In Addition to Production Workload
• Or Production Workload Not Processed Until Recovery is Complete
WAS Liberty Profile DR Recovery
• Transaction Recovery on Separate (Physical) Server
• Access the Transaction Logs
– Move/Mount the Transaction Logs to Physical Server Hosting Application Server with
Access to Same Resources (e.g. JDBC, JMS)
– V8.x Optional Use of DB for Transaction logs (Not supported for Production)
• Restore server.xml Backup (e.g. zip)
• No Longer Required to Have Same Hostname and IP Address
– Different IP’s with Multiple Host Alias’s Typical
• WAS Messaging Engine
• Restore server.xml Backup (e.g zip)
• Point to Copy of Messages on File System (Could also employ zip/unzip)
• Some Small Capacity in DR Site Needs to be set Aside for Recovery
• In Addition to Production Workload
• Or Production Workload Not Processed Until Recovery is Complete
Classic DR for Stateful Applications: full cell replication
28
28
IP Sprayer
Node Agt
Node1
Msg.mem1Messaging
Web Server
User Registry
DMgr
AppTarget
Support
DMgr
IHS
IHS
App.mem1
Sup.mem1
Node Agt
Node2
App.mem2
Sup.mem2
Filesystem (NFS)
Node Agt
Node1
Msg.mem1
DMgr
IHS
IHS
App.mem1
Sup.mem1
Node Agt
Node2
App.mem2
Sup.mem2
WAS Txn Logs WAS Txn Logs
Consistency Group
SAN Replication
for Application Data
File Copy
for Install & Config Data
Database
Storage (SAN)
Primary Datacenter Secondary Datacenter
Msg.mem2 Msg.mem2
A P A P
Filesystem (NFS)
User Registry
Web Server
Database
Storage (SAN)
Consistency Group
IP Sprayer
Messaging
AppTarget
Support
DMgr
DR via Stray Nodes & Database Managed replication
29
29
IP Sprayer
Node Agt
Node1
Messaging
Web Server
DMgr
AppTarget
Support
IHS
IHS
App.mem1
Sup.mem1
Node Agt
Node2
App.mem2
Sup.mem2
IP Sprayer
Node3
DMgr
IHS
IHS
App.mem3
Sup.mem3
Node4
App.mem4
Sup.mem4
WAS Txn Logs WAS Txn Logs
User Registry
Primary Datacenter Secondary Datacenter
Database Database
DB-managed
Replication for
Application Data
Msg.mem1 Msg.mem2
A P
Msg.mem3 Msg.mem4
A P
Node AgtNode Agt
User Registry
Web Server
IBM BPM: Licensing Guidance for HA/DR Configurations
Configuration
What’s active?
What licenses are needed for the backup nodes?
DB2
WAS
ND
BPM
“Classic” Disaster Recovery
(DR): SAN-based replication
off off off
• Files in the backup data center are being synchronized automatically by a
SAN. But there is no DB2, WAS ND, or BPM program active.
• No extra DB2, WAS ND, or BPM licenses needed for backup nodes.
DR configuration with OS
replication for config data & DB2
replication for runtime data
ON off off
• Active DB2 HADR Standby setup considered warm standby – licenses for
100 DB2 PVUs required to cover warm standby servers
• BPM and WAS ND are inactive – no extra WAS ND or BPM licenses
needed for backup nodes.
DR configuration with WAS ND
replication for config data &
DB2 replication for runtime data
ON ON off
• Active DB2 HADR Standby setup considered warm standby – licenses for
100 DB2 PVUs required to cover warm standby servers
• WebSphere process in the backup nodes used for synchronization – WAS
ND licenses are required.
• BPM is inactive* – no extra BPM licenses needed for backup nodes.
High Availability (HA) ON ON ON
• IBM BPM is active in all nodes – full BPM licenses are required.
• WAS ND and DB2 licensing based on Supporting Programs terms of BPM
Note: For any HA or DR configuration, any node(s) running DMGR only will also require a WAS ND license.
(*Inactive: BPM server JVMs in the remote datacenter are not started)
REFERENCES
• IBM Program License Agreement licensing for Backup Use: This document explains that extra licenses are not required for cold or warm backup
nodes. However, if there is a program actively “doing work” to keep the backup node synchronized with the primary site, then that program must
be licensed. E.g., when the DB2 and WAS ND node agents are actively “doing work” (replication), they must be licensed, but if IBM BPM
services are not active, no BPM licenses are required in backup nodes.
• DeveloperWorks article on “Stray Node” DR Configuration: This article describes a “better” Stray Node DR configuration. It is different from a
Classic DR configuration in that it keeps the WAS ND environment up-to-date as well as the DB2 environment, in order to reduce the server
recovery time after a disaster. In this “better” Stray Node DR configuration, WAS ND node agents are active, but IBM BPM is not active.
•DeveloperWorks article on licensing DB2 10.1 servers in a HA environment
What Your Mother Didn’t Tell You
About Disaster Recovery
• Transaction Log replication & Network configuration requirements
• Historically, the WebSphere transaction service required IP Address & Hostname at the target server match the
source
• This is because, for some types of transactions, the server network information is written into the logs
themselves
• As of WAS v8 servers, this requirement is relaxed a bit – IP addresses no longer need to match
• High Availability and Disaster Recovery for the Deployment Manager
• Techniques are available that leverage hardware clustering or replication of the DMgr’s cell configuration to an
alternate server. These are described in detail at:
https://www.ibm.com/developerworks/websphere/techjournal/1001_webcon/1001_webcon.html
• Beginning in version 8.5.5 WebSphere supports High Availability features for the Deployment Manager, using a
shared filesystem. WAS installations (including BPM) running on applicable versions can leverage this feature:
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.doc/ae/twve_xdsoconfig.html
• Because these HA techniques rely on DMgr replication, they apply unchanged to DR scenarios
– Generally, we recommend recovering application servers before bringing up a replacement DMgr
• A note about logical corruption – data integrity problems that get replicated to the DR environment
• We recommend using storage system tooling (for example, FlashCopy) to periodically copy the system state
• This can be done at the replica, to avoid interfering with normal operations
• If the Primary data and its replica are both corrupted, then state can be restored to a copy made before the
corruption
• For DR purposes, why can’t I just make one WebSphere ND cell with members running in both of my datacenters? That
way, if one datacenter is lost, another can carry the load
• See ‘Active/Active Antipattern’ discussion on the following slides
31
Active/Active anti-pattern (Cells Spanning Data Centers)
Secondary DatacenterPrimary Datacenter
32
IP Sprayer
Node Agt
Node1
Msg.mem1Messaging
Web Server
User Registry
DMgr
AppTarget
Support
IHS
IHS
App.mem1
Sup.mem1
Node Agt
Node2
App.mem2
Sup.mem2
Node Agt
Node3
DMgr
App.mem3
Sup.mem3
Node Agt
Node4
App.mem4
Sup.mem4
Msg.mem2 Msg.mem3 Msg.mem4
A P P P
IP Sprayer
Web Server
IHS
IHS
Filesystem (NFS)
WAS Logs
Consistency Group Consistency Group
SAN Replication
for Application Data
Database
Storage (SAN) Storage (SAN)
User Registry
Filesystem (NFS)
Database
Why is this type of topology considered an Anti-Pattern?
• Active/Active approaches introduce new complexities that undermine the stability of the system
• Issues/Problems Can Propagate From One DC to the Other
• This Compromises Redundancy and Resiliency
• Worst Case a Outage Cascades Across Both Data Centers
• Frequently these negate the advantages that led the customer to consider the approach in the first place
• Increased risk of network instability can lead to partitioned network (‘split brain’)
– Independent Transaction “Recovery” in Both Data Centers By HA Manager
– The two data centers could move to inconsistent transactional states!!
• Increased network latency can limit system performance during normal operations
– Latency between the Application Server and its databases
– Latency among cluster members communicating via the WebSphere HA Manager
component
• Desire to automate failover increases risk of false failover & rapid cycling
• A system more than 50% utilized introduces the risk that losing a single component will
compromise the entire system, turning what could have been a (simple) HA event into a true
disaster
• In practice, many Active/Active topologies do not deliver Disaster Recovery capability at all:
• Attempts to limit latency lead to datacenters physically near each other, increasing the risk that
a single disaster will eliminate the entire system
• Many disasters arise from human error and data corruption. Tight coupling between DR
resources does not provide protection from this type of failure at all
• A WAS-ND Cell Spanning Data Centers will actually interfere with Zero RTO
• Refer to
• http://www.ibm.com/developerworks/websphere/techjournal/0606_col_alcott/0606_col_alcott.html#sec1d
• http://www.ibm.com/developerworks/websphere/techjournal/1004_webcon/1004_webcon.html
33
What are the recommended alternatives?
• Properly plan a High Availability solution distinct from Disaster Recovery
• Eliminate single points of failure through redundancy in network and software components
• HA features allow rapid and automatic recovery from loss of a single component. Utilize them!
• Improve RTO by reducing complexity, scripting operational procedures and drill
• Automate Processes for Repeatability and Consistency
– Scripting
– Point and Click” is Not Repeatable
• Discipline and Practice are Essential
• Well Defined Procedures for Every Contingency
– You Do Not Want to Learn During an Outage
– Practice Those Procedures
– Won’t Make Mistakes in Crisis
– Validates that Procedures Actually Work
– Practice Backup and Recovery, System Failures, Disaster Recovery, etc.
• Goal: Make Daily Operations Boring
• Improve electricity distribution via Uninterruptable Power Supply
• Utilize application design patterns like loose coupling in order to improve application flexibility
• In cases where RTO between 1 and 4 hours is necessary, without the requirement to process new work,
consider the Stray Node pattern
34
Is This Different for the Liberty Profile and Liberty Collectives?
• No
• Same Fundamentals for Effective Redundancy and the
Requirements for Isolation and Independence Apply
• Though Liberty May Make It Easier to Ignore or Believe That
the Fundamentals Don’t Apply
35
Agenda
• Concepts
• Disaster Recovery
• Multiple Cells and Data Centers
• WebSphere Application Server Recovery
• IBM BPM Recovery
• Final Thoughts
Disaster Recovery
• Develop a Disaster Recovery Plan
• Group Business Needs and Associated Applications into Tiers
• Group into tiers based on the hard/soft dollar impact on the
organization
• Categorize by RPO and RTO.
• The top tier likely includes zero data loss and either no downtime or
perhaps just a few minutes of down time
• Subsequent tiers have an RTO of 24 hours, then 48 to 72 hours,
then perhaps 72 to 96 hour
• Essential Part of Any Plan
• Who approves DR move/recovery ?
• Automated site failover is a bad idea
o Typically triggering DR is very expensive
o You do not want to trigger a DR by accident because of some transient issue – just makes
the situation worse
Disaster Recovery Objectives
• Recovery Time Objective
• How quickly the system will be able to accept traffic after the disaster
• Shorter times require progressively more expensive techniques
o e.g., a tape backup and restore is relatively inexpensive
o e.g., a fully redundant fully operational data center is very expensive
• One challenge is detection time
• It takes time to determine you are in a disaster state and trigger
disaster procedures
o While you are deciding if you are down, you are probably missing your SLA.
o Does the RTO include detection time?
Disaster Recovery Objectives
• Recovery Point Objective
• How much data you are willing to lose when there is a disaster
• Limiting data loss raises costs
o e.g., restoring from tape is relatively inexpensive but you'll lose everything
since the last backup
o e.g., asynchronous replication of data and system state requires significant
network bandwidth to prevent falling far behind
o e.g., synchronous replication to the backup data center guarantees no data
loss but requires VERY fast and reliable network and will significantly harm
performance
• Warning: results in increased latency which means capacity must be increased at
all layers
Disaster Recovery Objectives
• Most RTO and RPO goals will deeply impact application and infrastructure
architecture and can't be done “after the fact”
• e.g., if data is shared across data centers, your database and application
design will have to be careful to avoid conflicting database updates and/or
tolerate them
• e.g., application upgrades have to account for multiple versions of the
application running at once which can affect user interface design, database
layout, etc
• Extreme RTO and RPO goals tend to conflict
• e.g., using synchronous disk replication of data gives you a zero RPO but that
means the second system can't be operational, which raises RTO
• Trying to Achieve a Zero RTO *and* a Zero RPO is Mutually Exclusive
Disaster Recovery Testing
• The DR hardware Should Be Put Into Actual Production Usage
• Otherwise How Can You Be Sure It Will Work When You REALLY Need It.
• A Corollary of Murphy’s Law
• The larger the numbers, the less likely all Tier 1 machines can be successfully
restored to operations.
• DR Testing Options
• “Saturday Afternoon Surprise”
– Unannounced DR Test
– Only If You Can Tolerate an Outage
• Progressively More Realistic and Complex Tests
– Startup of Remote Infrastructure
– Remote Startup with Simulated Workload
– Remote Startup with Production Workload Shift
• Other Issues In a Real Disaster
• Will your key staff want to travel?
• Will they be able to travel?
41
Example and (Very High Level) DR Plan
• Executive/Management Approval for Activation of DR
• Isolate Data Centers
• Halt Incoming Network Traffic
– Static “Temporarily Unavailable Web Page”
• Break Disk Synchronization
• Sever Network Links Between Data Centers
• Start and Recovery of Surviving Center
• Restore/Recovery Hardware and Middleware
• Start DB, Messaging and Application Servers
• Examine DB and Message provider logs for pending transactions
• Recover Pending transactions and messages
• Start Accepting New Work in Surviving Data Center
• Enable Network
42
Other Aspects to Consider
• An HA, CA or DR Deployment Architecture is Not a Product Feature.
• WAS and the WebSphere portfolio products
– Provide HA Features and Function
– Can Be Employed in an HA Architecture
– The Appropriate Environment Varies by Customer
– One Size Does Not Fit All !
• Optimizing WebSphere HA Capabilities into a Robust Deployment
• Requires In-depth Understanding
– Of Environment
– Of Applications
– Of Operational Requirements (Service Levels)
• Architectural Advice May Require ISSW Assistance
43
Learn from Your Mistakes
• Mistakes and failures will occur, learn from them
• What separates mediocre organizations from the good and great isn't so
much perfection as it is the constant striving to get better – to not repeat
mistakes
• After every outage perform
• Root cause analysis
– Capture diagnostic information
– Meet as a team including all key players to discuss
– Determine precisely what went wrong
• Wrong doesn't mean “Bob made an error.”
• Find the process flaw that led to the problem
– Determine a corrective action that will prevent this from happening again
• If you can't, determine what diagnostic information is needed next time this happens and ensure it is collected
– Implement that corrective action
• All too often this last step isn't done
• Verify that action corrected problem
• A senior manager must own this process
Indispensable When Planning for Catastrophe
• Think !
Questions?
Notices and Disclaimers
Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or
transmitted in any form without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been
reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM
shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY,
EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF
THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT
OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the
agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without
notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are
presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual
performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,
programs or services available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not
necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither
intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal
counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s
business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or
represent or warrant that its services or products will ensure that the customer is in compliance with any law.
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products in connection with this
publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,
INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any
IBM patents, copyrights, trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document
Management System™, Global Business Services ®, Global Technology Services ®, Information on Demand,
ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™,
PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®,
urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and
service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on
the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
Thank You
Your Feedback is
Important!
Access the InterConnect 2015
Conference CONNECT Attendee
Portal to complete your session
surveys from your smartphone,
laptop or conference kiosk.
Backup Slides
51
Shameless Self Promotion
IBM WebSphere Deployment and Advanced Configuration
By Roland Barcia, Bill Hines, Tom Alcott and Keys Botzum
ISBN: 0131468626
52
Another Recommended Book
IBM WebSphere v5.0 System Administration
By Leigh Williamson, Lavena Chan,Roger Cundiff, Shawn Lauzon and Christopher
C. Mitchell
ISBN: 0131446045
Licensing Servers as Back Up Servers
From IBM Contracts and Practices Database
• The policy is to Charge for HOT, and not for WARM or COLD back ups. The following are definitions of what constitutes HOT-
WARM-COLD backups:
• All programs running in backup mode must be under the customer's control, even if running at another enterprise's location.
• COLD - a copy of the program may be stored for backup purpose machine as long as the program has not been started.
• There is no charge for this copy.
• WARM - a copy of the program may reside for backup purposes on a machine and is started, but is "idling", and is not doing any
work of any kind.
• There is no charge for this copy.
• HOT - a copy of the program may reside for backup purposes on a machine, is started and is doing work. However, this
program must be ordered.
• There is a charge for this copy.
• "Doing Work", includes, for example, production, development, program maintenance, and testing. It also could include other
activities such as mirroring of transactions, updating of files, synchronization of programs, data or other resources (e.g. active
linking with another machine, program, data base or other resource, etc.) or any activity or configurability that would allow an
active hot-switch or other synchronized switch-over between programs, data bases, or other resources to occur
Refer to http://www-03.ibm.com/software/sla/sladb.nsf/pdf/policies/$file/Feb-2003-IPLA-backup.pdf for more information
53

Planning For Catastrophe with IBM WAS and IBM BPM

  • 1.
    © 2015 IBMCorporation Planning for Catastrophe with IBM WebSphere Application Server & IBM Business Process Manager Tom Alcott STSM Chris Richardson STSM
  • 2.
    This Session • Thissession will focus on the architectural and operational issues that need to be considered when planning and implementing a Disaster Recovery plan with WebSphere Application Server and IBM BPM. Topics will include use of multiple data centers, geographic separation constraints, supporting software components, disaster recovery and other common deployment issues. Though focused primarily on WebSphere Application Server and IBM BPM this session also applies to IBM Middleware that is deployed on WebSphere Application Server as well as Pure Application System. • While not a prerequisite, attendees should be familiar with the material covered in " Preparing to Fail, Practical WebSphere Application Server High Availability”
  • 3.
    Introduction • Why AreWe Here? • To Avoid This
  • 4.
    Agenda • Concepts • DisasterRecovery • Multiple Cells and Data Centers • WebSphere Application Server Recovery • IBM BPM Recovery • Final Thoughts
  • 5.
    Definitions • Redundancy • Theprovision of additional or duplicate systems, equipment, etc., that function in case an operating part or system fails, as in a spacecraft. • Isolated • Separated from other persons or things; alone; solitary • Independent • Not dependent; not depending or contingent upon something else for existence, operation, etc. • All of the Above are Fundamental for Effective High Availability and Disaster Recovery
  • 6.
    Definitions High Availability (HA) •Ensuring that the system can continue to process work within one location after routine single component failures • Usually we assume a single failure • Usually the goal is very brief disruptions for only some users for unplanned events Continuous Operations • Ensuring that the system is never unavailable during planned activities • E.g., if the application is upgraded to a new version, we do it in a way that avoids all downtime
  • 7.
    Continuous Availability (CA) •High Availability coupled with Continuous Operations • No tolerance for planned downtime • Little unplanned downtime as possible • Very expensive • Note that while achieving CA almost always requires an aggressive DR plan, they are not the same thing Definitions
  • 8.
    Background: High Availabilityin one picture 7 IP Sprayer Node Agt Node1 Server 1Cluster “A” Web Server User Registry DMgr IHS IHS Server 1 Server 2 Node Agt Node2 Server 2 Server 2 Shared Filesystem WAS Txn Logs Database Storage (SAN) Server 2 Cluster “B” Cluster “C” • Clustered IP Sprayer and Firewalls (not depicted) • Clustered HTTP Servers • WAS-ND Cell with Clustered Application Servers • User Registry (LDAP) Hardware Clustered with Shared Disk • Database Hardware Clustered With Shared Disk • JMS Provider (Not Depicted) • WAS Messaging Engine with Shared Disks/DB • External JMS with Hardware Cluster and Shared Disk • Transaction Logs on Shared File System • Clusters of “2” Provide High Availability, • Don’t Forget “Rule of 3”, • When Using With Clusters of 2 • An Outage (Planned or Unplanned) Reduces Capacity by 50% • Is No Longer Fault Tolerant
  • 9.
    Disaster Recovery (DR) •Ensuring that the system can be reconstituted and/or activated at another location and can process work after an unexpected catastrophic failure at one location • Often multiple single failures (which are normally handled by high availability techniques) is considered catastrophic • There may or may not be significant downtime as part of a disaster recovery • This environment may be substantially smaller than the entire production environment, as only a subset of production applications demand DR • Normally based on justifiable business need. • Recovery Time Objective (RTO) • Service Recovery with little to no interruption • Recovery Point Objective (RPO) • Data Recovery and acceptable data loss Definitions
  • 10.
    • Service Levels(SLAs ) cover many things, our focus is availability aspects • You need a clear set of requirements that define precisely the availability requirements of the system, taking into account • Components of the system  A system has many pieces and business aspects, how do their requirements differ? ‒ Responsiveness and throughput requirements  100% of requests aren't going to work perfectly 100% of the time • Degraded services requirements  Does everything have to meet the responsiveness requirements ALL the time? • Dependent system requirements  What are the implications if a system on which you depend is down? • Data Loss • Application Data • Application State (Is this Critical in a Disaster?) • Maintenance  Change occurs, how does that affect availability? • Disaster Recovery  The unimaginable happens, then what? Definitions
  • 11.
    HA Service LevelExample SLA External Commitment SLA Internal Target Service Timeframe 7 x 24 7 x 24 Application Processing Availability 99.5% per month 99.7% per month Recovery Time Objective 4 Hours 1 Hour Maintenance Window Tue-Thurs 3:00 - 6:00 am Tue-Thurs 3:00 - 6:00 am 99.5% = 3.60 Hours Downtime/Month 99.7% = 2.16 Hours Downtime/Month
  • 12.
    DR Service LevelExample SLA External Commitment SLA Internal Target Recovery Time Objective 16 Hours 4 Hours Recovery Point Objective ~ 0 (No Data Loss) ~ 0 (No Data Loss) Note: This is Recovery of an Entire Data Center with 100’s of Servers, Application, Database, Messaging, etc
  • 13.
    Agenda • Concepts • DisasterRecovery • Multiple Cells and Data Centers • WebSphere Application Server Recovery • IBM BPM Recovery • Final Thoughts
  • 14.
    Stage 0 DR– a sound HA strategy • HA is cheaper and Less Complex Than DR . • A Robust HA Solution prevents small failures from becoming disasters • Don’t let a (relatively) minor failure become a catastrophe • Eliminate all single points of failure in your primary datacenter • Spread Workloads Across Multiple Servers (and Hypervisors ! ) • Add 2nd Production WAS-ND Cell • Consider DB Replication, DB2 HADR, Oracle RAC in conjunction with hardware clustering • LDAP/Registry Replication • Otherwise, an HA event could force you to enact your DR procedure • Database is only replicated HA in Different Data Center 13 IP Sprayer Node Agt Node1 Server 1Cluster “A” Web Server User Registry DMgr IHS IHS Server 1 Server 2 Node Agt Node2 Server 2 Server 2 Shared Filesystem WAS Txn Logs Database Storage (SAN) Server 2 Cluster “B” Cluster “C”
  • 15.
    Multiple Data CenterOptions (1/4) • Classic DR • Active/Passive • Two Data Centers, one Serving Requests the other Idling • Independent Cells • Easier Than Active/Active • User and Application State Synchronization are Less Critical • Asynchronous Replication Is Likely Sufficient • Lower Cost for Network and Hardware Capacity • From a Capacity perspective One Data Center is Being Underutilized. • Typically Does Not Incur S/W License Charges When Idle • If You Don’t Pay for S/W Licenses Is Cost and Underutilization Still a Concern? • WebSphere License Provides for o Hot – Processing Requests License Required o Warm – Started But Not Processing Requests, License Not Required o Cold – Installed, But Not Started, License Not Required • DB2 and MQ Require < 100 % of Hot Licenses for Replication
  • 16.
    Multiple Data CenterOptions (2/4) • “Active/Active” with Single Set of Active Databases • Two Data Centers • Independent Application Cells • Serving Requests for Same Applications • Database(s) Only Active in One Data Center – Additional Latency for Application Data Requests from Remote Data Center – Request Processing Interruption When Data Replica is Promoted to Primary
  • 17.
    Multiple Data CenterOptions (3/4) • Classic “Active/Active” • Two Data Centers • Independent Cells and Synchronized Resource Managers (DBs) • Serving Requests for Same Applications • Requires Shared Application Data o Application Data Consistency is prerequisite to any other planning o Simultaneous Reads/Writes = Geographic Synchronous Disk Replication • Additional Hardware and Disk Capacity Required • e.g. IBM High Availability Geographic Cluster (HAGEO), Sun Cluster Geographic Edition • Expectation of Continuous Availability and Transparent Failover o Requires Sharing Application State • Expectation Seldom Realized • Outage of One Data Center, Stops Disk Writes in Both, No Longer “Transparent” • Synchronous Disk Replication Limits Geographic Separation • Hardest and Costliest to Achieve Note: Disk Replication only employed for Application Data and Application State, WAS-ND cell configuration, software updates, and application maintenance should maintained independently in order to insure isolation (and availability)
  • 18.
    Multiple Data CenterOptions (4/4) • Hybrid “Active/Active” (Partitioned by Applications) • Two Data Centers • Independent Cells with replicated Resource Managers • Both DC’s Serving Requests, Both DC’s Configured for All Applications • Running Different Applications (With Different Application Data) – New Application Tests – One DC Performing Updates, One DC Performing Inquiry Only (e.g. data warehouse) • No Shared Application State, No Shared Application Data – Asynchronous Replication Sufficient • Global Network Switch Used to Partition/Distribute Traffic • In the Event of a Disaster • Users failover from one DC to the other • Likely Some Interruption – As Data Replica is Promoted to Primary – During Failover Workload Startup • Provides Most of the Benefits of “Classic Active/Active” without the Cost and Complexity
  • 19.
    The CAP Theorem •In a distributed environment, especially spanning data centers across LANs and WANs there are three core requirements for a service: • Consistency – Either the service works or fails – Traditional ACID of databases provides consistency and isolation • Availability – Extremely important in web business model – In a large distributed system, one may have to compromise with consistency for the sake of availability • Partition Tolerance – Network partition will happen when not all machines are connected – “No set of failures less than the total network failure is allowed to cause the system to respond incorrectly” – Seth and Lynch – Quorum is used to guard against split brain syndrome • Brewer’s CAP conjecture states that • One can achieve only two not all three of the above mentioned requirements http://en.wikipedia.org/wiki/CAP_theorem 18
  • 20.
    Multiple Active DataCenters and the CAP Theorem • Active/Active requires you to sacrifice either consistency, availability or partition tolerance. • All three aren’t possible • If you choose full availability, then you are going to lose guaranteed consistency. • So you need to design with this in mind, and build in mechanisms (typically involving queuing technologies) that enable your system to "tend towards“ consistency. • Your data is going to be in two places, either partitioned or replicated. • If the former, what happens when one site is down? • If the latter, what happens when users hitting each site see slightly different versions of the current state? • These are very complex problems. • Which is why I try to steer customers away from active/active and into an active/passive model with DR from active to passive. • But they always feel like they are wasting hardware……………! 19
  • 21.
    Data Center UtilizationUrban Legends • Legend • Active/Active Improves Utilization • Reality • An Active/Active Topology at 40-50% Utilization in Each DC Is Equivalent to An Active/Passive Datacenter Deployment with One Active at 80% to 90 % Utilization and the Other Passive • Running Active/Active at Greater Than 50% Of Total (both Datacenters) Capacity Can Often Result in a Complete Loss of Service When a Data Center Outage Occurs o Insufficient Capacity in Remaining Data Center to Handle > 100% Capacity Results in • Poor Response Time (at best) • Network and Server Overload, Resulting in a Complete Crash
  • 22.
    Active/Active - What’sWrong With This Picture ? A former employer of mine had two data centers, running active/active at two facilities approximately 2.6 miles (or 4.2 KM) apart. • Close Proximity Addressed Data Consistency Concerns ……..But………
  • 23.
    What Happens When? • There’s an earthquake • There’s a Civil Insurrection • A Hazardous Chemical Spill Occurs • And The Wind Is Blowing the Chemical Cloud from West to East (or vice versa) • Your DC May Not Be Located in a Locale Prone to Earthquakes • But what about the other catastrophes ??? • They can, *and* will happen !! • There’s No Substitute For Isolation Between Data Centers • Data Centers Should Be Sufficiently Distant So That a Single Event Doesn’t Impact Both !! • This Likely Mandates Asynchronous Replication • Active/Active No Longer Practical
  • 24.
    Network Latency andApplication Data Consistency – A 3rd Party Perspective • Since the latency or round trip time for a network is usually correlated to the length of the network, or the physical distance between the two end points (in this case the primary and standby), Maximum Protection and Maximum Availability modes are not recommended for Data Guard deployments over a Wide Area Network (WAN). Note that this recommendation is driven by the laws of physics (speed of light limitation) - the greater the distance of a network, the longer it will take for data packets to traverse the network, and hence the longer it will take for primary database transactions to commit. • http://www.oracle.com/technology/deploy/availability/htdocs/dataguardnetwork.htm
  • 25.
    Multiple Cells andData Centers • Your Network Team Assures You That Can (or Have) Constructed a Network Link Between Data Centers • For Arguments Sake, We’ll I Agree, It Is possible to construct a network so that latency is NOT an issue Under Normal Conditions • Even so, WANs are Less Reliable than LANs. o And Much Harder To Fix ! • But You’re Missing The Point ! • Network Interdependency Between Data Centers Means That the Data Centers are Not Independent • Question • Do You Want to Have to Explain to Your CIO Why A Problem In One Data Center Impacted The Other and Resulted in a Outage Because You Didn’t Have Cells Aligned to Data Center Boundaries ?
  • 26.
    How Do IRecover WebSphere Application Server ? • File System or OS Backup and Recovery • Disk or Tape • WAS backUpConfig/restoreConfig – WAS_PROFILE/properties, ../etc, WAS_ROOT/java/jre/lib/*properties, WAS_ROOT/java/jre/lib/security, • Build from Scratch o Only a Realistic Option with Complete Set of Scripts and Rigorous Change Control • Best Options • File System/OS Backup & Recovery • backupConfig/restoreConfig for Deployment Manager – WAS V8.0 and above, addNode –asExistingNode can Reconfigure Each Note After restoreConfig of Deployment Manager Configuration • Both From Last Know Working Production Configuration • Otherwise No Assurance Recovery Will Succeed • Same Concern with “Build From Scratch” ‒ If Using Virtualization Consider VM Cloning for Install and Configuration ‒ Consider Smart Cloud Orchestrator for Automated Install and Configuration o Provision Both Primary Site and DR Site in a Consistent Manner with SCO o Note: Don’t Deploy to Backup to DR Site over a WAN ! • May Need to Change Cell and Host Names • Will the Original Data Center Be Restored, Or Is it Gone (for Good)?
  • 27.
    WAS Full ProfileDR Recovery • Transaction Recovery on Separate (Physical) Server • Access the Transaction Logs – Move/Mount the Transaction Logs to Physical Server Hosting Application Server with Access to Same Resources (e.g. JDBC, JMS) – V8.x Optional Use of DB for Transaction logs • If Recovery Occurs in Different Cell use wsadmin to Configure the Same JAAS Alias for Accessing XA Resources – With adminconsole the node name gets prefixed to the alias. • No Longer Required to Have Same Hostname and IP Address – Different IP’s with Multiple Host Alias’s Typical http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tjta_mvelog.html • WAS Messaging Engine • recoverMEConfig AdminTask Command Retrieves MEUUID From Persistent Message Store and Updates Message Engine Configuration • Allows Recovery of Stranded Messages After Catastrophic ME Failure in WAS V8.5.0 and above • http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/rjk_recoverme_config.html • Some Small Capacity in DR Site Needs to be set Aside for Recovery • In Addition to Production Workload • Or Production Workload Not Processed Until Recovery is Complete
  • 28.
    WAS Liberty ProfileDR Recovery • Transaction Recovery on Separate (Physical) Server • Access the Transaction Logs – Move/Mount the Transaction Logs to Physical Server Hosting Application Server with Access to Same Resources (e.g. JDBC, JMS) – V8.x Optional Use of DB for Transaction logs (Not supported for Production) • Restore server.xml Backup (e.g. zip) • No Longer Required to Have Same Hostname and IP Address – Different IP’s with Multiple Host Alias’s Typical • WAS Messaging Engine • Restore server.xml Backup (e.g zip) • Point to Copy of Messages on File System (Could also employ zip/unzip) • Some Small Capacity in DR Site Needs to be set Aside for Recovery • In Addition to Production Workload • Or Production Workload Not Processed Until Recovery is Complete
  • 29.
    Classic DR forStateful Applications: full cell replication 28 28 IP Sprayer Node Agt Node1 Msg.mem1Messaging Web Server User Registry DMgr AppTarget Support DMgr IHS IHS App.mem1 Sup.mem1 Node Agt Node2 App.mem2 Sup.mem2 Filesystem (NFS) Node Agt Node1 Msg.mem1 DMgr IHS IHS App.mem1 Sup.mem1 Node Agt Node2 App.mem2 Sup.mem2 WAS Txn Logs WAS Txn Logs Consistency Group SAN Replication for Application Data File Copy for Install & Config Data Database Storage (SAN) Primary Datacenter Secondary Datacenter Msg.mem2 Msg.mem2 A P A P Filesystem (NFS) User Registry Web Server Database Storage (SAN) Consistency Group IP Sprayer Messaging AppTarget Support DMgr
  • 30.
    DR via StrayNodes & Database Managed replication 29 29 IP Sprayer Node Agt Node1 Messaging Web Server DMgr AppTarget Support IHS IHS App.mem1 Sup.mem1 Node Agt Node2 App.mem2 Sup.mem2 IP Sprayer Node3 DMgr IHS IHS App.mem3 Sup.mem3 Node4 App.mem4 Sup.mem4 WAS Txn Logs WAS Txn Logs User Registry Primary Datacenter Secondary Datacenter Database Database DB-managed Replication for Application Data Msg.mem1 Msg.mem2 A P Msg.mem3 Msg.mem4 A P Node AgtNode Agt User Registry Web Server
  • 31.
    IBM BPM: LicensingGuidance for HA/DR Configurations Configuration What’s active? What licenses are needed for the backup nodes? DB2 WAS ND BPM “Classic” Disaster Recovery (DR): SAN-based replication off off off • Files in the backup data center are being synchronized automatically by a SAN. But there is no DB2, WAS ND, or BPM program active. • No extra DB2, WAS ND, or BPM licenses needed for backup nodes. DR configuration with OS replication for config data & DB2 replication for runtime data ON off off • Active DB2 HADR Standby setup considered warm standby – licenses for 100 DB2 PVUs required to cover warm standby servers • BPM and WAS ND are inactive – no extra WAS ND or BPM licenses needed for backup nodes. DR configuration with WAS ND replication for config data & DB2 replication for runtime data ON ON off • Active DB2 HADR Standby setup considered warm standby – licenses for 100 DB2 PVUs required to cover warm standby servers • WebSphere process in the backup nodes used for synchronization – WAS ND licenses are required. • BPM is inactive* – no extra BPM licenses needed for backup nodes. High Availability (HA) ON ON ON • IBM BPM is active in all nodes – full BPM licenses are required. • WAS ND and DB2 licensing based on Supporting Programs terms of BPM Note: For any HA or DR configuration, any node(s) running DMGR only will also require a WAS ND license. (*Inactive: BPM server JVMs in the remote datacenter are not started) REFERENCES • IBM Program License Agreement licensing for Backup Use: This document explains that extra licenses are not required for cold or warm backup nodes. However, if there is a program actively “doing work” to keep the backup node synchronized with the primary site, then that program must be licensed. E.g., when the DB2 and WAS ND node agents are actively “doing work” (replication), they must be licensed, but if IBM BPM services are not active, no BPM licenses are required in backup nodes. • DeveloperWorks article on “Stray Node” DR Configuration: This article describes a “better” Stray Node DR configuration. It is different from a Classic DR configuration in that it keeps the WAS ND environment up-to-date as well as the DB2 environment, in order to reduce the server recovery time after a disaster. In this “better” Stray Node DR configuration, WAS ND node agents are active, but IBM BPM is not active. •DeveloperWorks article on licensing DB2 10.1 servers in a HA environment
  • 32.
    What Your MotherDidn’t Tell You About Disaster Recovery • Transaction Log replication & Network configuration requirements • Historically, the WebSphere transaction service required IP Address & Hostname at the target server match the source • This is because, for some types of transactions, the server network information is written into the logs themselves • As of WAS v8 servers, this requirement is relaxed a bit – IP addresses no longer need to match • High Availability and Disaster Recovery for the Deployment Manager • Techniques are available that leverage hardware clustering or replication of the DMgr’s cell configuration to an alternate server. These are described in detail at: https://www.ibm.com/developerworks/websphere/techjournal/1001_webcon/1001_webcon.html • Beginning in version 8.5.5 WebSphere supports High Availability features for the Deployment Manager, using a shared filesystem. WAS installations (including BPM) running on applicable versions can leverage this feature: http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.doc/ae/twve_xdsoconfig.html • Because these HA techniques rely on DMgr replication, they apply unchanged to DR scenarios – Generally, we recommend recovering application servers before bringing up a replacement DMgr • A note about logical corruption – data integrity problems that get replicated to the DR environment • We recommend using storage system tooling (for example, FlashCopy) to periodically copy the system state • This can be done at the replica, to avoid interfering with normal operations • If the Primary data and its replica are both corrupted, then state can be restored to a copy made before the corruption • For DR purposes, why can’t I just make one WebSphere ND cell with members running in both of my datacenters? That way, if one datacenter is lost, another can carry the load • See ‘Active/Active Antipattern’ discussion on the following slides 31
  • 33.
    Active/Active anti-pattern (CellsSpanning Data Centers) Secondary DatacenterPrimary Datacenter 32 IP Sprayer Node Agt Node1 Msg.mem1Messaging Web Server User Registry DMgr AppTarget Support IHS IHS App.mem1 Sup.mem1 Node Agt Node2 App.mem2 Sup.mem2 Node Agt Node3 DMgr App.mem3 Sup.mem3 Node Agt Node4 App.mem4 Sup.mem4 Msg.mem2 Msg.mem3 Msg.mem4 A P P P IP Sprayer Web Server IHS IHS Filesystem (NFS) WAS Logs Consistency Group Consistency Group SAN Replication for Application Data Database Storage (SAN) Storage (SAN) User Registry Filesystem (NFS) Database
  • 34.
    Why is thistype of topology considered an Anti-Pattern? • Active/Active approaches introduce new complexities that undermine the stability of the system • Issues/Problems Can Propagate From One DC to the Other • This Compromises Redundancy and Resiliency • Worst Case a Outage Cascades Across Both Data Centers • Frequently these negate the advantages that led the customer to consider the approach in the first place • Increased risk of network instability can lead to partitioned network (‘split brain’) – Independent Transaction “Recovery” in Both Data Centers By HA Manager – The two data centers could move to inconsistent transactional states!! • Increased network latency can limit system performance during normal operations – Latency between the Application Server and its databases – Latency among cluster members communicating via the WebSphere HA Manager component • Desire to automate failover increases risk of false failover & rapid cycling • A system more than 50% utilized introduces the risk that losing a single component will compromise the entire system, turning what could have been a (simple) HA event into a true disaster • In practice, many Active/Active topologies do not deliver Disaster Recovery capability at all: • Attempts to limit latency lead to datacenters physically near each other, increasing the risk that a single disaster will eliminate the entire system • Many disasters arise from human error and data corruption. Tight coupling between DR resources does not provide protection from this type of failure at all • A WAS-ND Cell Spanning Data Centers will actually interfere with Zero RTO • Refer to • http://www.ibm.com/developerworks/websphere/techjournal/0606_col_alcott/0606_col_alcott.html#sec1d • http://www.ibm.com/developerworks/websphere/techjournal/1004_webcon/1004_webcon.html 33
  • 35.
    What are therecommended alternatives? • Properly plan a High Availability solution distinct from Disaster Recovery • Eliminate single points of failure through redundancy in network and software components • HA features allow rapid and automatic recovery from loss of a single component. Utilize them! • Improve RTO by reducing complexity, scripting operational procedures and drill • Automate Processes for Repeatability and Consistency – Scripting – Point and Click” is Not Repeatable • Discipline and Practice are Essential • Well Defined Procedures for Every Contingency – You Do Not Want to Learn During an Outage – Practice Those Procedures – Won’t Make Mistakes in Crisis – Validates that Procedures Actually Work – Practice Backup and Recovery, System Failures, Disaster Recovery, etc. • Goal: Make Daily Operations Boring • Improve electricity distribution via Uninterruptable Power Supply • Utilize application design patterns like loose coupling in order to improve application flexibility • In cases where RTO between 1 and 4 hours is necessary, without the requirement to process new work, consider the Stray Node pattern 34
  • 36.
    Is This Differentfor the Liberty Profile and Liberty Collectives? • No • Same Fundamentals for Effective Redundancy and the Requirements for Isolation and Independence Apply • Though Liberty May Make It Easier to Ignore or Believe That the Fundamentals Don’t Apply 35
  • 37.
    Agenda • Concepts • DisasterRecovery • Multiple Cells and Data Centers • WebSphere Application Server Recovery • IBM BPM Recovery • Final Thoughts
  • 38.
    Disaster Recovery • Developa Disaster Recovery Plan • Group Business Needs and Associated Applications into Tiers • Group into tiers based on the hard/soft dollar impact on the organization • Categorize by RPO and RTO. • The top tier likely includes zero data loss and either no downtime or perhaps just a few minutes of down time • Subsequent tiers have an RTO of 24 hours, then 48 to 72 hours, then perhaps 72 to 96 hour • Essential Part of Any Plan • Who approves DR move/recovery ? • Automated site failover is a bad idea o Typically triggering DR is very expensive o You do not want to trigger a DR by accident because of some transient issue – just makes the situation worse
  • 39.
    Disaster Recovery Objectives •Recovery Time Objective • How quickly the system will be able to accept traffic after the disaster • Shorter times require progressively more expensive techniques o e.g., a tape backup and restore is relatively inexpensive o e.g., a fully redundant fully operational data center is very expensive • One challenge is detection time • It takes time to determine you are in a disaster state and trigger disaster procedures o While you are deciding if you are down, you are probably missing your SLA. o Does the RTO include detection time?
  • 40.
    Disaster Recovery Objectives •Recovery Point Objective • How much data you are willing to lose when there is a disaster • Limiting data loss raises costs o e.g., restoring from tape is relatively inexpensive but you'll lose everything since the last backup o e.g., asynchronous replication of data and system state requires significant network bandwidth to prevent falling far behind o e.g., synchronous replication to the backup data center guarantees no data loss but requires VERY fast and reliable network and will significantly harm performance • Warning: results in increased latency which means capacity must be increased at all layers
  • 41.
    Disaster Recovery Objectives •Most RTO and RPO goals will deeply impact application and infrastructure architecture and can't be done “after the fact” • e.g., if data is shared across data centers, your database and application design will have to be careful to avoid conflicting database updates and/or tolerate them • e.g., application upgrades have to account for multiple versions of the application running at once which can affect user interface design, database layout, etc • Extreme RTO and RPO goals tend to conflict • e.g., using synchronous disk replication of data gives you a zero RPO but that means the second system can't be operational, which raises RTO • Trying to Achieve a Zero RTO *and* a Zero RPO is Mutually Exclusive
  • 42.
    Disaster Recovery Testing •The DR hardware Should Be Put Into Actual Production Usage • Otherwise How Can You Be Sure It Will Work When You REALLY Need It. • A Corollary of Murphy’s Law • The larger the numbers, the less likely all Tier 1 machines can be successfully restored to operations. • DR Testing Options • “Saturday Afternoon Surprise” – Unannounced DR Test – Only If You Can Tolerate an Outage • Progressively More Realistic and Complex Tests – Startup of Remote Infrastructure – Remote Startup with Simulated Workload – Remote Startup with Production Workload Shift • Other Issues In a Real Disaster • Will your key staff want to travel? • Will they be able to travel? 41
  • 43.
    Example and (VeryHigh Level) DR Plan • Executive/Management Approval for Activation of DR • Isolate Data Centers • Halt Incoming Network Traffic – Static “Temporarily Unavailable Web Page” • Break Disk Synchronization • Sever Network Links Between Data Centers • Start and Recovery of Surviving Center • Restore/Recovery Hardware and Middleware • Start DB, Messaging and Application Servers • Examine DB and Message provider logs for pending transactions • Recover Pending transactions and messages • Start Accepting New Work in Surviving Data Center • Enable Network 42
  • 44.
    Other Aspects toConsider • An HA, CA or DR Deployment Architecture is Not a Product Feature. • WAS and the WebSphere portfolio products – Provide HA Features and Function – Can Be Employed in an HA Architecture – The Appropriate Environment Varies by Customer – One Size Does Not Fit All ! • Optimizing WebSphere HA Capabilities into a Robust Deployment • Requires In-depth Understanding – Of Environment – Of Applications – Of Operational Requirements (Service Levels) • Architectural Advice May Require ISSW Assistance 43
  • 45.
    Learn from YourMistakes • Mistakes and failures will occur, learn from them • What separates mediocre organizations from the good and great isn't so much perfection as it is the constant striving to get better – to not repeat mistakes • After every outage perform • Root cause analysis – Capture diagnostic information – Meet as a team including all key players to discuss – Determine precisely what went wrong • Wrong doesn't mean “Bob made an error.” • Find the process flaw that led to the problem – Determine a corrective action that will prevent this from happening again • If you can't, determine what diagnostic information is needed next time this happens and ensure it is collected – Implement that corrective action • All too often this last step isn't done • Verify that action corrected problem • A senior manager must own this process
  • 46.
    Indispensable When Planningfor Catastrophe • Think !
  • 47.
  • 48.
    Notices and Disclaimers Copyright© 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.
  • 49.
    Notices and Disclaimers(con’t) Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. • IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™, Global Business Services ®, Global Technology Services ®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
  • 50.
    Thank You Your Feedbackis Important! Access the InterConnect 2015 Conference CONNECT Attendee Portal to complete your session surveys from your smartphone, laptop or conference kiosk.
  • 51.
  • 52.
    51 Shameless Self Promotion IBMWebSphere Deployment and Advanced Configuration By Roland Barcia, Bill Hines, Tom Alcott and Keys Botzum ISBN: 0131468626
  • 53.
    52 Another Recommended Book IBMWebSphere v5.0 System Administration By Leigh Williamson, Lavena Chan,Roger Cundiff, Shawn Lauzon and Christopher C. Mitchell ISBN: 0131446045
  • 54.
    Licensing Servers asBack Up Servers From IBM Contracts and Practices Database • The policy is to Charge for HOT, and not for WARM or COLD back ups. The following are definitions of what constitutes HOT- WARM-COLD backups: • All programs running in backup mode must be under the customer's control, even if running at another enterprise's location. • COLD - a copy of the program may be stored for backup purpose machine as long as the program has not been started. • There is no charge for this copy. • WARM - a copy of the program may reside for backup purposes on a machine and is started, but is "idling", and is not doing any work of any kind. • There is no charge for this copy. • HOT - a copy of the program may reside for backup purposes on a machine, is started and is doing work. However, this program must be ordered. • There is a charge for this copy. • "Doing Work", includes, for example, production, development, program maintenance, and testing. It also could include other activities such as mirroring of transactions, updating of files, synchronization of programs, data or other resources (e.g. active linking with another machine, program, data base or other resource, etc.) or any activity or configurability that would allow an active hot-switch or other synchronized switch-over between programs, data bases, or other resources to occur Refer to http://www-03.ibm.com/software/sla/sladb.nsf/pdf/policies/$file/Feb-2003-IPLA-backup.pdf for more information 53