Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Metro Cluster High Availability or SRM Disaster Recovery?

2,567 views

Published on

Presentation explains the difference between multi site high availability (aka metro cluster) and disaster recovery. General concepts are similar for any products but presentation is more tailored for VMware technologies.

Published in: Technology
  • Be the first to comment

Metro Cluster High Availability or SRM Disaster Recovery?

  1. 1. © 2014 VMware Inc. All rights reserved. Metro Cluster High Availability or SRM Disaster Recovery? David Pasek, VMware PSO, TAM, VCDX #200 Stanislav Jurena, VMware PSO, TAM, VCAP-DCD/DCA Demystifying myths VMUG Prague, 2016 Dec 6
  2. 2. Agenda 1 Business Continuity 2 High Availability 3 Disaster Recovery 4 Disaster Avoidance 5 Multisite High Availability or Disaster Recovery? 6 Q & A CONFIDENTIAL 2
  3. 3. Business Continuity 3
  4. 4. Business Continuity - Definition • Business continuity encompasses planning and preparation to ensure that an organization can continue to operate in case of serious incidents or disasters and is able to recover to an operational state within a reasonably short period. As such, business continuity includes three key elements and they are – Resilience (High Availability) : critical business functions and the supporting infrastructure must be designed in such a way that they are materially unaffected by relevant disruptions, for example through the use of redundancy and spare capacity; – Recovery (Disaster Recovery): arrangements have to be made to recover or restore critical and less critical business functions that fail for some reason. – Contingency: the organization establishes a generalized capability and readiness to cope effectively with whatever major incidents and disasters occur, including those that were not, and perhaps could not have been, foreseen. Contingency preparations constitute a last-resort response if resilience and recovery arrangements should prove inadequate in practice. Source: https://en.wikipedia.org/wiki/Business_continuity WHAT IS NOT MENTIONED IN WIKIPEDIA – Mitigation (Disaster Avoidance): the organization can improve contingency planning with mitigation planning. Do something proactively to avoid unexpected disasters. 4
  5. 5. Business Continuity - Terminology • General concepts and terminology – Business Continuity – must be based on BIA (Business Impact Analysis) • RPO (Recovery Point Objective), RTO (Recovery Time Objective) – Infrastructure level • WRT (Work Recovery Time) – Application level • MTD (Maximum Tolerable Downtime) = RTO + WRT – Business level – High Availability, Disaster Recovery, Disaster Avoidance – Availability Zones, Regions 5 Less then ~60km More then ~ 60km High Availability
  6. 6. Business Continuity / High Availability 6
  7. 7. Business Continuity / High Availability 7 • High Availability technologies – Self initiated failover without human intervention – Master node or software arbiter is required – For multisite HA solution third site for arbiter is required • VMware HA Cluster Solutions – Local vSphere High Availability Cluster (vSphere HA) – Multisite vSphere Metro Storage Cluster (vMSC)
  8. 8. Single site vSphere HA Cluster 8 Single Site Shared Storage FC, iSCSI, NFS, VSAN We all know that, right? Local vSphere HA Cluster (single clustered system in single availability zone) • Protection against • Physical server failure (ESXi Hosts monitoring) • OS failure on top of ESXi (Guest OS monitoring) • App failure on top of ESXi (App Monitoring) • System Requirements • Shared local storage (Fibre Channel, SAS, iSCSI, NFS, SDS like VSAN) • Flat L2 Networks for VMs • Software arbiter - Master node of HA Cluster
  9. 9. Multisite vSphere Metro Storage Cluster 9 Multisite Shared Storage FC, iSCSI, NFS, VSAN Not so common in the field but very popular topic. Multisite vSphere Metro Storage Cluster (single clustered system over two availability zones) • Protection against • Various Storage Array Failures • Whole Single Site Storage Array Failure • Complete Site Failure • Anticipated disaster (Disaster Avoidance) • System Requirements • Shared stretched storage Volumes / LUNs distributed across two storage arrays and visible/mounted to ESXi • Third zone required because of arbiter in 3rd zone • Flat L2 Networks for VMs Distributed LUN across two storage systems Storage System A Storage System B Storage Witness
  10. 10. Business Continuity / High Availability - HA2 Metro Storage Cluster (vMSC) – Advantages • Positive impact on RTO during single storage or site failure – faster disaster recovery because VMs are automatically restarted without human interaction • Higher Protection (redundancy) against specific infrastructure failures – Protection against single storage array failure – Protection against complete site failure • Non-disruptive Disaster Avoidance – VM workloads vMotion between availability zones – VMs does not need to be restarted = higher VM availability SLA can be achieved • Operational Simplicity – Design, Implement, Test and Forget. Then pray that it will work when needed. – Schedule periodical tests to be sure it really works. – Disadvantages • Single stretched fault zone • Complex clustering techniques highly dependent on particular storage vendor • No test plan - it can be tested only by real failure simulation – Business critical application owners will not accept real failures. • App start order and dependency cannot be achieved = negative impact on WRT and MTD • Third site is required for software arbiter (arbiter, witness, tie-breaker) 10
  11. 11. Business Continuity / Disaster Recovery 11
  12. 12. Business Continuity / Disaster Recovery • SRM - VMware DR technology = human initiated failovers – human arbiter – Should be implemented between regions but can be implemented between availability zones as well – Only two regions are required because human arbiter can run recovery from anywhere without split brain – Can be implemented for more regions – N : M – Independent Fault Zones - Data Replication and L3 network are the only common denominators among sites – Network connectivity should be L3 (routed) to mitigate fault propagation (broadcast storms, unknown unicasts flooding, etc.) – All infrastructure services has to be duplicated on each region (NTP, DNS, Active Directory, vCenter, etc.) – DR orchestration = Application Dependencies (start order) can and should be specified 12
  13. 13. Business Continuity / Disaster Recovery • DR (VMware SRM) – Advantages • Positive impact on WRT – VMs restarts with priority orders and application dependency – RunBook (SRM Recovery Plan) • Independence on other region failures • Mitigation of false positive failures and unnecessary failovers – Human initiation of DR failover – business approval required • DR tests without impact on production – Detail report of performed DR tests – Disadvantages • Higher RTO – Have to wait for human interaction (Business approval before failover) – Storage Replication has to be break and volumes / LUNs has to be mounted to ESXi hosts on recovery sites – all VMs in single recovery plan are started in parallel but only 10 recovery plans can be executed concurrently • Operational and Business overhead – BIA must exists – Protection groups and Recovery Plans has to be defined based on BIA – Recovery Plans has to be tested – Operational personnel has to be trained 13
  14. 14. Business Continuity / Disaster Avoidance 14
  15. 15. Business Continuity / Disaster Avoidance 15 • Disaster Avoidance is preventive failover to another availability zone to avoid anticipated disaster • Failover with service disruption – Option 1: SRM fail-over • Two independent vCenters in two independent SSO domains • VMs graceful shutdown • VM re-start in correct order in another region / availability zone • Failover without service disruption – Option 1: vSphere Metro Storage Cluster (vMSC) • Stretched LUN / datastore across availability zones (storage vendor specific technology) • VMware VM vMotion (CPU, RAM) – Option 2: vMotion without shared storage • VMware vMotion within single vCenter or cross two vCenters in single SSO domain • VMware VM vMotion (CPU, RAM) • VMware Storage vMotion share nothing (vDisk) – Option 3: SRM cross vCenter vMotion without shared storage • Two independent vCenters in two different SSO domains • VMware VM vMotion (CPU, RAM) • VMware Storage vMotion share nothing (vDisk)
  16. 16. Multisite High Availability (Metro Cluster) or Disaster Recovery? 16 Infrastructure Design Qualities • Availability <= High Availability • Manageability • Scalability • Performance • Security • Recoverability <= Disaster Recovery • Cost
  17. 17. Multisite HA (Metro Cluster) or Disaster Recovery? • vSphere Storage Metro Cluster (vMSC) is High Availability solution great for – Protection against complete storage system failure – Non-disruptive Disaster Avoidance between availability zones – Protection against complete site failure with low RTO but unpredictable WRT and MTD • but Metro HA (vMSC) is not real Disaster Recovery because of – Workload restart order unpredictability – Single system (fault zone) stretched across sites – Very hardly testable – Shorter distance protection (< ~60km) • Real VMware Disaster Recovery solution is SRM – Predictable recovery plans – Testable recovery plans without impact on production – Longer distance protection (> ~60km) • So, what technology should I use? – It always depends on business requirements (BIA) and what you want to achieve – Stretched Metro HA Cluster (vMSC) for HA2 and Disaster Avoidance – SRM for Disaster Recovery – Both solutions can be used together – vSphere Storage Metro Cluster protected by SRM 17
  18. 18. Questions and Answers Twitter: @david_pasek Blog: http://blog.igics.com
  19. 19. Backup slides 19
  20. 20. Metro cluster (vMSC) topologies 20
  21. 21. Multisite vSphere Metro Storage Cluster 21 Physical Infrastructure Logical Design Controller A1 Controller A2 Storage Array A FC SW A1 FC SW A2 ESXi A1 ESXi A2 ETH SW A1 ETH SW A2 Router A Controller B1 Controller B2 Storage Array B FC SW B1 FC SW B2 ESXi B1 ESXi B2 ETH SW B1 ETH SW B2 Router B Ethernet DCI Fibre Channel DCI Arbiter / Witness/ Tie-Braker Router C DC A DC B DC C
  22. 22. Multi site vSphere Metro Storage Cluster 22 vMSC Logical Design – Uniform Mode – Active/Active storage DC A DC B DC C Controller A1 Controller A2 Storage Array A 1 2 1 2 ESXi A1 Controller B1 Controller B2 Storage Array 02 1 2 1 2 ESXi B1 VMFS Datastore 01 Distributed Storage Volume 01 with Coherent Cache LUN Active on Storage A and Passive on Storage B Arbiter / Witness/ Tie-Braker VM A VM B vSphere Metro Storage Cluster (vMSC) Storage Metro Cluster (Active/Active) Paths Active everywhere (Special Multipathing Driver is required to identify optimal paths to storage targets where LUN is active) Active Optimize Local Path Active Optimize Remote Path
  23. 23. Multi site vSphere Metro Storage Cluster 23 vMSC Logical Design – Non-Uniform mode – Active/Active storage DC A DC B DC C Controller A1 Controller A2 Storage Array A 1 2 1 2 ESXi A1 Controller B1 Controller B2 Storage Array 02 1 2 1 2 ESXi B1 VMFS Datastore 01 Distributed Storage Volume 01 LUN Active on Storage A and Passive on Storage B Arbiter / Witness/ Tie-Braker VM A VM B vSphere Metro Storage Cluster (vMSC) LUN Paths Active Optimized in DC A LUN Paths Active Optimized in DC B Storage Metro Cluster (Active/Active) Active Optimized Local Path
  24. 24. Multi site vSphere Metro Storage Cluster 24 vMSC Logical Design – Uniform Mode – ALUA storage DC A DC B DC C Controller A1 Controller A2 Storage Array A 1 2 1 2 ESXi A1 Controller B1 Controller B2 Storage Array 02 1 2 1 2 ESXi B1 VMFS Datastore 01 Distributed Storage Volume 01 LUN Active on Storage A and Passive on Storage B Arbiter / Witness/ Tie-Braker VM A VM B vSphere Metro Storage Cluster (vMSC) Storage Metro Cluster (ALUA) LUN Paths Active Optimized to DC A and Active Non-Optimized to DC B Active Optimized Local Path Active Optimized Remote Path Active Non-Optimized Local Path Active Non-Optimized Remote Path
  25. 25. Site Recovery 25
  26. 26. VMware SRM Terminology • SRM - Site Recovery Manager • Data Replication types – HBR – Host Based Replication (async replication with delta 15 min => RPO) – SBR – Storage Based Replication (sync/async replication , sync => I/O write performance impact) • SRM Constructs – Protection Group = group of VMs to protect as a single business service – Recovery Plan = RunBook how VMs in Protection Group has to be started • Failover and Failback process – Failover – Failover-test – Re-protect – Failback 26
  27. 27. SRM Logical Design 27 DC1 (ANT) DC2 (BUD) vCenter Server SRM Authentication VMs Workload SRA vSphere Replication SRM Plug-in vSphere Client esx-01 esx-02 esx-X vRA SAN LUN01 LUN02 LUNX vRA vRA Site A Datacenter VM VM VM VM LUN01 LUN02 LUNX Replicated LUNsNon-Replicated LUNs vCenter Server SRM VMs Workload SRA vSphere Replication esx-01 esx-02 esx-X vRA SAN LUN01 LUN02 LUNX vRA vRA Site B Datacenter VM VM VM VM LUN01 LUN02 LUNX Replicated LUNs Non-Replicated LUNs

×