Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
High Availability and Disaster Recovery
in IBM PureApplication System
Scott Moonen <smoonen@us.ibm.com>
Agenda
• Principles and definitions
• HA and DR tools in PureApplication System
• Composing tools to meet your requirement...
Principles and definitions
3
Principles and definitions: HA and DR
• Business continuity
Ability to recover business operations within specified parame...
Principles and definitions: Active, Passive, etc.
• Active–Active
A system where continuous or high availability is achiev...
Principles and definitions: RTO and RPO
• RTO: recovery time objective
How long it takes for an HA or DR procedure to brin...
Principles and definitions: Scenarios
• Metropolitan distance: multiple data centers within 100–300km
–High availability i...
Principles and definitions: Personas
• Application architect
Responsible for planning the application design in such a way...
Principles: Automation and repeatability
• Automate all aspects of your application’s deployment and configuration
–Using ...
Principles: Separation of application and data
• Ensure that all persistent data (transaction logs, database, etc.) is sto...
Principles: Transaction consistency
If your application stores data in multiple locations (e.g., transaction logs on file ...
HA and DR tools in
PureApplication System
12
Tools: Compute node availability
• PureApplication System offers two options for planning for failure of compute nodes:
–C...
Tools: Block storage
Block storage volumes in PureApplication System:
• May be up to 8TB in size
• Are allocated and manag...
Tools: Shared block storage
• Block storage volumes may be shared (simultaneously attached) by virtual machines
–On the sa...
Tools: Block storage replication
Two PureApplication Systems can be connected for replication of block storage
• Connectiv...
Tools: External block storage
• PureApplication System can connect to external SVC, V7000, V9000 devices:
–Allows for bloc...
Tools: IBM GPFS (General Parallel File System) / Spectrum Scale
• GPFS is:
–A shared filesystem (like NFS)
–Optionally: a ...
Tools: Multi–system deployment
• Connect systems in a “deployment subdomain” for cross–system pattern deployment
–Virtual ...
Composing tools
to meet your requirements
20
Scenario: Test application, middleware, or schema update
Copy block storage from production application for use in testing...
Scenario: Update application or middleware
When both the current and new application and middleware can share the same dat...
Scenario: Backward incompatible updates to database or schema
In some cases, a new version of an application, database ser...
Scenario: HA planning for compute node failure
Principles:
• Deploy multiple instances of each service so that each servic...
Scenario: recovery planning for VM failure or corruption
Three scenarios:
• Backup and restore of the VM itself is feasibl...
Scenario: recovery planning for database corruption
You may use your database’s own capabilities for backup and restore, i...
Scenario: HA planning for system or site failure
• As with planning for compute node failure, deploy multiple instances: n...
Scenario: Two–tier HA planning for system or site failure
• Compared to the previous slide, if you desire HA both within a...
Scenario: DR planning for rack or site failure
• You should expect nonzero RPO if the sites are too far apart to allow syn...
Scenario: horizontal scaling and bursting
• Use of the base scaling policy allows you to horizontally scale, manually or i...
Caveats
31
Caveats: Networking considerations
• Some middleware is sensitive to IP addresses and hostnames (e.g., WAS) and for DR pur...
Caveats: Middleware–specific considerations
• Combining both mirroring and replication (Active–Active–Passive–Passive)
–Th...
Caveats: Virtual machine backup and restore
The power and flexibility of PureApplication patterns means that your PureAppl...
Caveats: Practice, practice, practice
Because of the complexity of HA and DR implementation, and especially because of som...
Resources
36
Resources
• Implementing High Availability and Disaster Recovery in IBM PureApplication Systems V2
http://www.redbooks.ibm...
Resources, continued
• “High availability (again) versus continuous availability”
http://www.ibm.com/developerworks/websph...
Upcoming SlideShare
Loading in …5
×

High availability and disaster recovery in IBM PureApplication System

Implementing high availability and disaster recovery solutions using IBM PureApplication System

High availability and disaster recovery in IBM PureApplication System

  1. 1. High Availability and Disaster Recovery in IBM PureApplication System Scott Moonen <smoonen@us.ibm.com>
  2. 2. Agenda • Principles and definitions • HA and DR tools in PureApplication System • Composing tools to meet your requirements • Caveats • Resources
  3. 3. Principles and definitions 3
  4. 4. Principles and definitions: HA and DR • Business continuity Ability to recover business operations within specified parameters in case of specified disasters • Continuous availability Operation of a system where unplanned outages prevent the operation for at most 5 minutes per year (“five nines” or 99.999% availability) • High availability Operation of a system where unplanned outages prevent the operation for at most a few seconds or minutes while failover occurs. Often used as an umbrella term to include continuous availability. • Disaster recovery Operation of a system with a plan and process for reconstructing or recovering operations in a separate location in case of disaster.
  5. 5. Principles and definitions: Active, Passive, etc. • Active–Active A system where continuous or high availability is achieved by having active operation in multiple locations • Active–Standby (or “warm standby”) A system where high availability is achieved by having active operation in one location with another location or locations able to become active within seconds or minutes, without a “failover” of responsibility • Active–Passive (or “cold standby”) A system where high availability or disaster recovery is achieved by having active operation in one location with another location or locations able to become active within minutes or hours after a “failover” of responsibility
  6. 6. Principles and definitions: RTO and RPO • RTO: recovery time objective How long it takes for an HA or DR procedure to bring a system back into operation • RPO: recovery point objective How much data (measured in elapsed time) might be lost in the event of a disaster RPO zero daysminutesseconds hours mirrored file systems replicated file systems backup and restore
  7. 7. Principles and definitions: Scenarios • Metropolitan distance: multiple data centers within 100–300km –High availability is achievable using Active–Active or Active–Standby solutions that involve active mirroring of data between sites. –Disaster recovery with zero RPO is achievable using Active–Passive solutions that involve replication of data between sites. • Regional to global distance: multiple data centers beyond 200–300km Disaster recovery with nonzero RPO is achievable using Active–Passive solutions that involve replication of data between sites.
  8. 8. Principles and definitions: Personas • Application architect Responsible for planning the application design in such a way that high availability or disaster recovery is achievable (e.g., separating application from data) • Infrastructure administrator Responsible for configuring and managing infrastructure in such a way as to achieve the ability to implement high availability or disaster recovery (e.g., configuring and managing disk mirroring or replication) • Application administrator Responsible for deploying and managing the components of an application in such a way as to achieve high availability or disaster recovery (e.g., deploying the application in duplicate between two sites and orchestrating the failover of the application and its disks together with the infrastructure administrator)
  9. 9. Principles: Automation and repeatability • Automate all aspects of your application’s deployment and configuration –Using PureApplication patterns, pattern components, script packages, customized images –Using external application lifecycle tooling such as IBM UrbanCode Deploy • Why? This achieves rapid and confident repeatability of your application deployment, allowing: –Quality and control: lower risk and chance of error –Agility and simplicity • Quickly recover application if you need to redeploy it • Quickly deploy your application at separate sites for HA or DR purposes • Quickly deploy new versions of the application for test or upgrade purposes • Create a continuous integration lifecycle for faster and more frequent application deployment and testing –Portability: deploy to other cloud environments (e.g., PureApplication Service)
  10. 10. Principles: Separation of application and data • Ensure that all persistent data (transaction logs, database, etc.) is stored on separate disks from the application or database application itself • Why? This multiplies your recovery options because it decouples your strategy for application and data recovery, which often must be addressed in different ways: –Application recovery may involve backup & restore, re–deployment, or multiple deployment Often the application cannot be replicated due to infrastructure entanglement –Data recovery may involve backup & restore, replication, or mirroring • This also allows additional flexibility for development and test cycles, for example: –Deploy new versions of the application or database server and connect to original data –Deploy test instances of the application using copies of the production data
  11. 11. Principles: Transaction consistency If your application stores data in multiple locations (e.g., transaction logs on file server and transactions in database), then you must ensure that either: • The “lower” statements of record are replicated with total consistency together with the “higher” statements of record, or else • The “lower” statements of record are at all times replicated in advance of the “higher” statements of record. This ensures that you do not replicate inconsistent data (e.g., transaction log indicates a transaction is committed but the transaction is not present in the database). So, for example: • Your database and fileserver disks are replicated together with strict consistency, or instead • Your database is replicated synchronously (zero RPO) but your fileserver asynchronously (nonzero RPO).
  12. 12. HA and DR tools in PureApplication System 12
  13. 13. Tools: Compute node availability • PureApplication System offers two options for planning for failure of compute nodes: –Cloud group HA, if enabled, will reserve 1/n CPU and memory overhead on each compute node in a cloud group containing n compute nodes. If one compute node fails, all VMs will be recovered into this reserved space on the remaining nodes. –System HA allows you to designate one or more compute nodes as spares for all cloud groups that are enabled for system HA. This allows you both to (1) allocate more than one spare, and also (2) share a spare between multiple cloud groups. • If neither cloud group HA or system HA is enabled and a compute node fails, the system will attempt to recover as many VMs as possible on the remaining nodes in the cloud group, in priority order. • VMs being recovered will experience an outage equivalent to being rebooted. • Recommendation: always enable cloud group HA or system HA –This ensures your workload capacity is restored quickly after a compute node failure –This also ensures that workload does not need to be stopped for planned compute node maintenance
  14. 14. Tools: Block storage Block storage volumes in PureApplication System: • May be up to 8TB in size • Are allocated and managed independently of VM storage, can be attached and detached • Is not included in VM snapshots • Can be cloned (copied) • Can be exported and imported against external scp servers • Groups of volumes can be created for time–consistent cloning or export of multiple volumes
  15. 15. Tools: Shared block storage • Block storage volumes may be shared (simultaneously attached) by virtual machines –On the same system Note: this is supported on Intel, and on Power beginning with V2.2. –Between systems. Notes: • This is supported only for external block storage that resides outside of the system (see later slide). • This is supported on Intel. Support on Power is forthcoming. • This allows for creation of highly available clusters (GPFS, GFS, DB2 pureScale, Windows cluster) –A clustering protocol is necessary for sharing of the disk –The IBM GPFS pattern (see later slide) supports GPFS clusters on a single rack using shared block storage, but does not support cross–system clusters using shared external block storage • Restrictions –Storage volumes must be specifically created as “shared” volumes –Special placement techniques are required in the pattern to ensure anti–collocation of VMs –IBM GPFS pattern supports clustering (see below)
  16. 16. Tools: Block storage replication Two PureApplication Systems can be connected for replication of block storage • Connectivity options –Fiber channel connectivity supported beginning in V2.0 –TCP/IP connectivity supported beginning in V2.2 • Volumes are selected for replication individually –Replicate in either direction –Replicate synchronously up to 3ms latency (~300km), asynchronously up to 80ms latency (~8000km). RPO for asynchronous replication is up to 1 second. • All volumes are replicated together with strict consistency • Target volume must not be attached while replication is taking place • Replication may be terminated (unplanned failover) or reversed in place (planned failover). Reverse in place requires volume to be unattached on both sides.
  17. 17. Tools: External block storage • PureApplication System can connect to external SVC, V7000, V9000 devices: –Allows for block and block “shared” volumes to be accessed by VMs on PureApplication System. Base VM disks cannot reside on external storage. –Depending on extent size, allows for volumes larger than 8TB in size –Requires both TCP/IP and fiber channel connectivity to external device • All volume management is performed outside of system –Volumes are allocated and deleted by admin on external device –Alternate storage providers, RAID configurations, or combinations of HDD and SSD may be used –Volumes may be mirrored externally (e.g., SVC–managed mirroring across multiple devices) –Volumes may be replicated externally (e.g., SVC to SVC replication between data centers) • Advanced scenarios, sharing access to the same SVC cluster or V7000, or replicated ones: –Two systems sharing access to cluster or to replicated volumes –PureApplication System and PureApplication Software sharing access to cluster or replicated volumes
  18. 18. Tools: IBM GPFS (General Parallel File System) / Spectrum Scale • GPFS is: –A shared filesystem (like NFS) –Optionally: a clustered filesystem (unlike NFS) providing HA and high performance. Note: clustering supported on Power Systems beginning with V2.2. –Optionally: mirrored between cloud groups or systems • A tiebreaker (on third rack or external system) is required for quorum • Mirroring is not recommended above 1–3ms (~100–300km) latency –Optionally: (using block storage or external storage replication) replicated between systems ServerServer Server Data Client Client Server Data Client Client ServerServer Server Data Client Client ServerServer Server Data ServerServer Server Data Client Client ServerServer Server Data Shared Clustered Mirrored Replicated Tie
  19. 19. Tools: Multi–system deployment • Connect systems in a “deployment subdomain” for cross–system pattern deployment –Virtual machines for individual vsys.next or vapp deployments may be distributed across systems –Allows for easier deployment and management of highly available applications using a single pattern –Systems may be located in same or different data centers • Notes and restrictions –Up to four systems may be connected (limit is two systems prior to V2.2) –Inter–system network latencies must be less than 3ms (~300km) –An external 1GB iSCSI tiebreaker target must be configured for quorum purposes –Special network configuration is required for inter–system management communications
  20. 20. Composing tools to meet your requirements 20
  21. 21. Scenario: Test application, middleware, or schema update Copy block storage from production application for use in testing Data base Data App Data base Data Test App copy
  22. 22. Scenario: Update application or middleware When both the current and new application and middleware can share the same database without conflict (e.g., no changes to database schema), you can run the newer version of the application or middleware side by side for testing, and then eventually direct clients to the new version and retire the old version. Data base Data App App V2
  23. 23. Scenario: Backward incompatible updates to database or schema In some cases, a new version of an application, database server, or database schema may be unable to coexist with the existing application. In this case, you can use the “copy” strategy on a previous slide to test the upgrade of your application. When you are ready to promote the new version to production, you can detach the block storage from the existing deployment and attach it to the upgraded deployment. Data base Data App DB V2 App V2 detach attach
  24. 24. Scenario: HA planning for compute node failure Principles: • Deploy multiple instances of each service so that each service continues if one instance is lost • Enable cloud group or system HA so that failed instances can be recovered quickly DB primary Data App DB secondary Data App HADR Load balancer GPFSGPFS GPFS Data
  25. 25. Scenario: recovery planning for VM failure or corruption Three scenarios: • Backup and restore of the VM itself is feasible if it can be recovered in place • If the VM cannot be recovered: –If the VM is part of a horizontally scalable cluster, you can scale in to remove the failed VM and scale out to create a new VM –If the VM is not horizontally scalable, you must plan to re–deploy it: • You can deploy the entire pattern again and recover the data to it • You may be able to deploy a new pattern that recreates only the failed VM, and use manual or scripted configuration to reconnect it to your existing deployment
  26. 26. Scenario: recovery planning for database corruption You may use your database’s own capabilities for backup and restore, import and export. Alternatively, you may use block storage copies (and optionally export and import) to backup your database. Attach the backup copy (importing it beforehand if necessary) to restore. Data base Data App Data copy export/import detach attach
  27. 27. Scenario: HA planning for system or site failure • As with planning for compute node failure, deploy multiple instances: now across systems. • You may deploy separately on each system, or use multi–system deployment across systems. • Distance at which HA is possible is limited. • GPFS clustering is optional. It can provide additional throughput and also additional availability on a single system. DB primary Data App DB secondary Data HADR Load balancer GPFSGPFS GPFS Data GPFSGPFS GPFS Data App System A System B mirror Tie
  28. 28. Scenario: Two–tier HA planning for system or site failure • Compared to the previous slide, if you desire HA both within a site and also between sites, you must duplicate your application, database and filesystem both within and between sites. • Native database replication between sites must be synchronous, or may be asynchronous if you have no need of GPFS (see slide 11). DB primary Data App DB secondary Data HADR Load balancer or DNS GPFSGPFS GPFS Data GPFSGPFS GPFS Data App may be standby Site A Site B DB secondary Data HADR App mirror Tie
  29. 29. Scenario: DR planning for rack or site failure • You should expect nonzero RPO if the sites are too far apart to allow synchronous replication • Applications must be quiesced at the recovery site because replicated disks are inaccessible • The database is here replicated using disk replication for transaction consistency. You can use native database replication (as on slide 28) only if it is synchronous, or asynchronously only if you have no need of GPFS (see slide 11). DB primary Data App DB secondary Data HADR Load balancer or DNS GPFSGPFS GPFS Data GPFSGPFS GPFS Data App System A System B DB primary Data DB secondary Data HADR Replication
  30. 30. Scenario: horizontal scaling and bursting • Use of the base scaling policy allows you to horizontally scale, manually or in some cases automatically, new instances of a virtual machine with clustered software components. • When using multi–system deployment, horizontally scaled virtual machines will be distributed as much as possible across systems referenced in your environment profile • An alternate approach, especially in heterogeneous environments like PureApplication System and PureApplication Service, is to deploy new pattern instances for scaling or bursting, and federate them together.
  31. 31. Caveats 31
  32. 32. Caveats: Networking considerations • Some middleware is sensitive to IP addresses and hostnames (e.g., WAS) and for DR purposes you may need to plan to duplicate either IP addresses or hostnames in your backup data center • Both HA architectures and zero–RPO DR architectures are sensitive to latency. If latency is too high you can experience poor write throughput or even mirroring or replication failure. For these cases you should ideally plan for less than 1ms (~100km) of latency between sites. • You must also plan for adequate network throughput between sites when mirroring or replicating. • HA architectures require the use of a tiebreaker to govern quorum–leader determination in case of a network split. In a multi–site HA design, you should plan to locate the quorum at a third location, with equally low latency.
  33. 33. Caveats: Middleware–specific considerations • Combining both mirroring and replication (Active–Active–Passive–Passive) –The IBM GPFS pattern does not support combining both mirroring and replication –This combination is possible for other middleware (e.g., DB2 as on slide 29), but you must manually determine and designate which instance is Primary or Secondary at the time of recovery • Read carefully your middleware’s recommendations for configuring HA. For example: –IBM WebSphere recommends against cross–site cells –The IBM DB2 HADR pattern preconfigures a reservationless IP–based tiebreaker, which is not recommended –IBM DB2 HADR provides a variety of synchronization modes with different RPO characteristics • Ensure your middleware tolerates attaching existing storage if you replicate or copy volumes –The IBM DB2 HADR pattern requires an empty disk when first deploying. You can attach a new disk or replicate into this disk only after deployment. –The IBM GPFS pattern does not support attaching existing GPFS disks
  34. 34. Caveats: Virtual machine backup and restore The power and flexibility of PureApplication patterns means that your PureApplication VMs are tightly integrated both within a single deployment, and with the system on which they are deployed. Because of this tight integration, you cannot use backup and restore techniques to recover your PureApplication VMs unless you are recovering to the exact same virtual machine that was previously backed up. Your cloud strategy for recovering corrupted deployments should build on the efficiency and repeatability of patterns so that you are able to re–deploy in the event of extreme failure scenarios such as accidental virtual machine deletion or total system failure.
  35. 35. Caveats: Practice, practice, practice Because of the complexity of HA and DR implementation, and especially because of some of the caveats we have noted and which you may encounter in your unique situation, it is vital for you to practice all aspects of your HA or DR implementation and lifecycle before you roll it out into production. This includes testing network bandwidth and latency to their expected limits. It also includes simulating failures and verifying and perfecting your procedures for recovery and also for failback.
  36. 36. Resources 36
  37. 37. Resources • Implementing High Availability and Disaster Recovery in IBM PureApplication Systems V2 http://www.redbooks.ibm.com/abstracts/sg248246.html • “Implement multisystem management and deployment with IBM PureApplication System” http://www.ibm.com/developerworks/websphere/techjournal/1506_vanrun/1506_vanrun- trs.html • “Demystifying virtual machine placement in IBM PureApplication System” http://www.ibm.com/developerworks/websphere/library/techarticles/1605_moonen- trs/1605_moonen.html
  38. 38. Resources, continued • “High availability (again) versus continuous availability” http://www.ibm.com/developerworks/websphere/techjournal/1004_webcon/1004_webcon.ht ml • “Can I run a WebSphere Application Server cell over multiple data centers?” http://www.ibm.com/developerworks/websphere/techjournal/0606_col_alcott/0606_col_alcot t.html#sec1d • “Increase DB2 availability” http://www.ibm.com/developerworks/data/library/techarticle/dm-1406db2avail/index.html • “HADR sync mode” https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/DB2HADR/pag e/HADR%20sync%20mode

×