Implementing dr w. hyper v clustering

1,063 views

Published on

Published in: Technology
  • Be the first to comment

Implementing dr w. hyper v clustering

  1. 1. Implementing Affordable Disaster Recovery with Hyper-V andMulti-Site Clustering<br />Greg Shields, MVPPartner and Principal Technologistwww.ConcentratedTech.com<br />
  2. 2. This slide deck was used in one of our many conference presentations. We hope you enjoy it, and invite you to use it within your own organization however you like.<br />For more information on our company, including information on private classes and upcoming conference appearances, please visit our Web site, www.ConcentratedTech.com. <br />For links to newly-posted decks, follow us on Twitter:@concentrateddon or @concentratdgreg<br />This work is copyright ©Concentrated Technology, LLC<br />
  3. 3. What Makes a Disaster?<br /><ul><li>Which of the following would you consider a disaster?
  4. 4. A naturally-occurring event, such as a tornado, flood, or hurricane, impacts your datacenter and causes damage. That damage causes the entire processing of that datacenter to cease.
  5. 5. A widespread incident, such as a water leakage or long-term power outage, that interrupts the functionality of your datacenter for an extended period of time.
  6. 6. A problem with a virtual host creates a “blue screen of death”, immediately ceasing all processing on that server.
  7. 7. An administrator installs a piece of code that causes problems with a service, shutting down that service and preventing some action from occurring on the server.
  8. 8. An issue with power connections causes a server or an entire rack of servers to inadvertently and rapidly power down.</li></li></ul><li>What Makes a Disaster?<br /><ul><li>Which of the following would you consider a disaster?
  9. 9. A naturally-occurring event, such as a tornado, flood, or hurricane, impacts your datacenter and causes damage. That damage causes the entire processing of that datacenter to cease.
  10. 10. A widespread incident, such as a water leakage or long-term power outage, that interrupts the functionality of your datacenter for an extended period of time.
  11. 11. A problem with a virtual host creates a “blue screen of death”, immediately ceasing all processing on that server.
  12. 12. An administrator installs a piece of code that causes problems with a service, shutting down that service and preventing some action from occurring on the server.
  13. 13. An issue with power connections causes a server or an entire rack of servers to inadvertently and rapidly power down.</li></ul>DISASTER!<br />JUST A BAD DAY!<br />
  14. 14. What Makes a Disaster?<br />Your decision to “declare a disaster” and move to “disaster ops” is a major one.<br />The technologies used for disaster protection are different than those used for high-availability.<br />More complex.<br />More expensive.<br />
  15. 15. What Makes a Disaster?<br />Your decision to “declare a disaster” and move to “disaster ops” is a major one.<br />The technologies used for disaster protection are different than those used for high-availability.<br />More complex.<br />More expensive.<br />Failover and failback processes involve more thought.<br />You might not be able to just “fail back” with a click of a button.<br />
  16. 16. A Disastrous Poll<br />Where are We? Who Here is…<br />Planning a DR Environment?<br />In Process of Implementing One?<br />Already Enjoying One?<br />What’s a “DR Environment” ???<br />
  17. 17. Multi-Site Hyper-V == Single-Site Hyper-V<br />DON’T PANIC: Multi-site Hyper-V looks very much the same as single-site Hyper-V.<br />Microsoft has not done a good job of explaining this fact!<br />Some Hyper-V hosts.<br />Some networking and storage.<br />Virtual machines that Live Migrate around.<br />
  18. 18. Multi-Site Hyper-V == Single-Site Hyper-V<br />DON’T PANIC: Multi-site Hyper-V looks very much the same as single-site Hyper-V.<br />Microsoft has not done a good job of explaining this fact!<br />Some Hyper-V hosts.<br />Some networking and storage.<br />Virtual machines that Live Migrate around.<br />But there are some major differences too…<br />VMs can Live Migrate across sites.<br />Sites typically have different subnet arrangements.<br />Data in the primary site must be replaced with the DR site.<br />Clients need to know where your servers go!<br />
  19. 19. Constructing Site-Proof Hyper-V:Three Things You Need<br />At a very high level, Hyper-V disaster recovery is three things:<br />A storage mechanism<br />A replication mechanism<br />A set of target servers and a cluster to receive virtual machines and their data<br />Once you have these three things, layering Hyper-V atop is easy.<br />
  20. 20. Constructing Site-Proof Hyper-V:Three Things You Need<br />Replication Mechanism<br />Storage Device(s)<br />Target Servers<br />
  21. 21. Thing 1:A Storage Mechanism<br />Typically, two SANs in two different locations<br />Fibre Channel , iSCSI, FCoE, heck JBOD.<br />Often similar model or manufacturer. <br />This similarity can be necessary (although not required) for some replication mechanisms to function property.<br />
  22. 22. Thing 1:A Storage Mechanism<br />Typically, two SANs in two different locations<br />Fibre Channel , iSCSI, FCoE, heck JBOD.<br />Often similar model or manufacturer. <br />This similarity can be necessary (although not required) for some replication mechanisms to function property.<br />Backup SAN doesn’t necessarily need to be of the same size or speed as the primary SAN<br />Replicated data isn’t always full set of data.<br />You may not need disaster recovery for everything.<br />DR Environments: Where Old SANs Go To Die.<br />
  23. 23. Thing 2:A Replication Mechanism<br />Replication between SANs must occur.<br />There are two commonly-accepted ways to accomplish this….<br />
  24. 24. Thing 2:A Replication Mechanism<br />Replication between SANs must occur.<br />There are two commonly-accepted ways to accomplish this….<br />Synchronously<br />Changes are made on one node at a time. <br />Subsequent changes on primary SAN must wait for ACK from backup SAN.<br />Asynchronously<br />Changes on backup SAN will eventually be written. <br />Changes queued at primary SAN to be transferred at intervals.<br />
  25. 25. Thing 2:A Replication Mechanism<br />Synchronously<br />Changes are made on one node at a time. Subsequent changes on primary SAN must wait for ACK from backup SAN.<br />
  26. 26. Thing 2:A Replication Mechanism<br />Asynchronously<br />Changes on backup SAN will eventually be written. Are queued at primary SAN to be transferred at intervals.<br />
  27. 27. Class Discussion<br /><ul><li>Which would you choose? Why?</li></li></ul><li>Class Discussion<br /><ul><li>Which would you choose? Why?
  28. 28. Synchronous
  29. 29. Assures no loss of data.
  30. 30. Requires a high-bandwidth and low-latency connection.
  31. 31. Write and acknowledgement latencies impact performance.
  32. 32. Requires shorter distances between storage devices.
  33. 33. Asynchronous
  34. 34. Potential for loss of data during a failure.
  35. 35. Leverages smaller-bandwidth connections, more tolerant of latency.
  36. 36. No performance impact.
  37. 37. Potential to stretch across longer distances.
  38. 38. Your Recovery Point Objective makes this decision…</li></li></ul><li>Thing 2½:Replication Processing Location<br />There are also two locations for replication processing…<br />
  39. 39. Thing 2½:Replication Processing Location<br />There are also two locations for replication processing…<br />Storage Layer<br />Replication processing is handled by the SAN itself.<br />Agents are often installed to virtual hosts or machines to ensure crash consistency.<br />Easier to set up, fewer moving parts. More scalable.<br />Concerns about crash consistency.<br />OS / Application Layer<br />Replication processing is handled by software in the VM OS.<br />This software also operates as the agent.<br />More challenging to set up, more moving parts. More installations to manage/monitor. Scalability and cost are linear.<br />Fewer concerns about crash consistency.<br />
  40. 40. Thing 3:Target Servers and a Cluster<br />Finally are target servers and a cluster in the backup site.<br />
  41. 41. Clustering’s Sordid History<br />Windows NT 4.0<br />Microsoft Cluster Service “Wolfpack”.<br />“As the corporate expert in Windows clustering, I recommend you don’t use Windows clustering.”<br />
  42. 42. Clustering’s Sordid History<br />Windows NT 4.0<br />Microsoft Cluster Service “Wolfpack”.<br />“As the corporate expert in Windows clustering, I recommend you don’t use Windows clustering.”<br />Windows 2000<br />Greater availability, scalability. Still painful.<br />Windows 2003<br />Added iSCSI storage to traditional Fibre Channel.<br />SCSI Resets still used as method of last resort (painful).<br />
  43. 43. Clustering’s Sordid History<br />Windows NT 4.0<br />Microsoft Cluster Service “Wolfpack”.<br />“As the corporate expert in Windows clustering, I recommend you don’t use Windows clustering.”<br />Windows 2000<br />Greater availability, scalability. Still painful.<br />Windows 2003<br />Added iSCSI storage to traditional Fibre Channel.<br />SCSI Resets still used as method of last resort (painful).<br />Windows 2008<br />Eliminated use of SCSI Resets.<br />Eliminated full-solution HCL requirement.<br />Added Cluster Validation Wizard and pre-cluster tests.<br />Clusters can now span subnets (ta-da!)<br />
  44. 44. Clustering’s Sordid History<br />Windows NT 4.0<br />Microsoft Cluster Service “Wolfpack”.<br />“As the corporate expert in Windows clustering, I recommend you don’t use Windows clustering.”<br />Windows 2000<br />Greater availability, scalability. Still painful.<br />Windows 2003<br />Added iSCSI storage to traditional Fibre Channel.<br />SCSI Resets still used as method of last resort (painful).<br />Windows 2008<br />Eliminated use of SCSI Resets.<br />Eliminated full-solution HCL requirement.<br />Added Cluster Validation Wizard and pre-cluster tests.<br />Clusters can now span subnets (ta-da!)<br />Windows 2008 R2<br />Improvements to Cluster Validation Wizard and Migration Wizard.<br />Additional cluster services.<br />Cluster Shared Volumes (!) and Live Migration (!)<br />
  45. 45. So, What IS a Cluster?<br />
  46. 46. So, What IS a Cluster?<br />Quorum Drive & Storage for Hyper-V VMs<br />
  47. 47. So, What IS a Multi-Site Cluster?<br />
  48. 48. Quorum: Windows Clustering’s Most Confusing Configuration<br />Ever been to a Kiwanis meeting…?<br />
  49. 49. Quorum: Windows Clustering’s Most Confusing Configuration<br />Ever been to a Kiwanis meeting…?<br />A cluster “exists” because it has quorum between its members. That quorum is achieved through a voting process.<br />Different Kiwanis clubs have different rules for quorum.<br />Different clusters have different rules for quorum.<br />
  50. 50. Quorum: Windows Clustering’s Most Confusing Configuration<br />Ever been to a Kiwanis meeting…?<br />A cluster “exists” because it has quorum between its members. That quorum is achieved through a voting process.<br />Different Kiwanis clubs have different rules for quorum.<br />Different clusters have different rules for quorum.<br />If a cluster “loses quorum”, the entire cluster shuts down and ceases to exist. This happens until quorum is regained.<br />This is much different than a resource failover, which is the reason why clusters are implemented.<br />Multiple quorum models exist.<br />
  51. 51. Four Options for Quorum<br />Node and Disk Majority<br />Node Majority<br />Node and File Share Majority<br />No Majority: Disk Only<br />
  52. 52. Four Options for Quorum<br />Node and Disk Majority<br />Node Majority<br />Node and File Share Majority<br />No Majority: Disk Only<br />
  53. 53. Four Options for Quorum<br />Node and Disk Majority<br />Node Majority<br />Node and File Share Majority<br />No Majority: Disk Only<br />
  54. 54. Four Options for Quorum<br />Node and Disk Majority<br />Node Majority<br />Node and File Share Majority<br />No Majority: Disk Only<br />
  55. 55. Quorum in Multi-Site Clusters<br />Node and Disk Majority<br />Node Majority<br />Node and File Share Majority<br />No Majority: Disk Only<br />Microsoft recommends using the Node and File Share Majority model for multi-site clusters.<br />This model provides the best protection for a full-site outage.<br />Full-site outage requires a file share witness in a third geographic location.<br />
  56. 56. Quorum in Multi-Site Clusters<br />Use the Node and File Share Quorum<br />Prevents entire-site outage from impacting quorum.<br />Enables creation of multiple clusters if necessary.<br />Third Site for Witness Server<br />
  57. 57. I Need a Third Site? Seriously?<br />Here’s where Microsoft’s ridiculous quorum notion gets unnecessarily complicated…<br />What happens if you put the quorum’s file share in the primary site?<br />The secondary site might not automatically come online after a primary site failure.<br />Votes in secondary site < Votes in primary site<br />Let’s count on our fingers…<br />
  58. 58. I Need a Third Site? Seriously?<br />Here’s where Microsoft’s ridiculous quorum notion gets unnecessarily complicated…<br />What happens if you put the quorum’s file share in the secondary site?<br />A failure in the secondary site could cause the primary site to go down.<br />Votes in secondary site > votes in primary site.<br />More fingers…<br />This problem gets even weirder as time passes and the number of servers changes in each site.<br />
  59. 59. I Need a Third Site? Seriously?<br />Third Site for Witness Server<br />
  60. 60. -DEMO-Multi-Site Clustering<br />
  61. 61. Multi-Site Cluster Tips/Tricks<br />Install servers to sites so that your primary site always contains more servers than backup sites.<br />Eliminates some problems with quorum during site outage.<br />
  62. 62. Multi-Site Cluster Tips/Tricks<br />Manage Preferred Owners & Persistent Mode options.<br />Make sure your servers fail over to servers in the same site first.<br />But also make sure they have options on failing over elsewhere.<br />
  63. 63. Multi-Site Cluster Tips/Tricks<br />
  64. 64. Multi-Site Cluster Tips/Tricks<br />Manage Preferred Owners & Persistent Mode options.<br />Make sure your servers fail over to servers in the same site first.<br />But also make sure they have options on failing over elsewhere.<br />Consider carefully the effects of Failback.<br />Failback is a great solution for resetting after a failure.<br />But Failback can be a massive problem-causer as well.<br />Its effects are particularly pronounced in Multi-Site Clusters.<br />Recommendation: Turn it off, (until you’re ready).<br />
  65. 65. Multi-Site Cluster Tips/Tricks<br />
  66. 66. Multi-Site Cluster Tips/Tricks<br />Resist creating clusters that support other services.<br />A Hyper-V cluster is a Hyper-V cluster is a Hyper-V cluster.<br />
  67. 67. Multi-Site Cluster Tips/Tricks<br />Resist creating clusters that support other services.<br />A Hyper-V cluster is a Hyper-V cluster is a Hyper-V cluster.<br />Use disk “dependencies” as Affinity/Anti-Affinity rules.<br />Hyper-V all by itself doesn’t have an elegant way to affinitize.<br />Setting disk dependencies against each other is a work-around.<br />
  68. 68. Multi-Site Cluster Tips/Tricks<br />Resist creating clusters that support other services.<br />A Hyper-V cluster is a Hyper-V cluster is a Hyper-V cluster.<br />Use disk “dependencies” as Affinity/Anti-Affinity rules.<br />Hyper-V all by itself doesn’t have an elegant way to affinitize.<br />Setting disk dependencies against each other is a work-around.<br />Add Servers in Pairs<br />Ensures that a server loss won’t cause site split brain.<br />This is less a problem with the File Share Witness configuration.<br />
  69. 69. Multi-Site Cluster Tips/Tricks<br />Segregate traffic!!!<br />
  70. 70. Most Important!<br />Ensure that networking remains available when VMs migrate from primary to backup site.<br />
  71. 71. Most Important!<br />Ensure that networking remains available when VMs migrate from primary to backup site.<br />Clustering can span subnets!This is good, but only if you plan for it…<br />Remember that crossing subnets also means changing IP address, subnet mask, gateway, etc, at new site.<br />This can be automatically done by using DHCP and dynamic DNS, or must be manually updated.<br />DNS replication is also a problem. Clients will require time to update their local cache.<br />Consider reducing DNS TTL or clearing client cache.<br />
  72. 72. Implementing Affordable Disaster Recovery with Hyper-V andMulti-Site Clustering<br />Greg Shields, MVPPartner and Principal Technologistwww.ConcentratedTech.com<br />
  73. 73. This slide deck was used in one of our many conference presentations. We hope you enjoy it, and invite you to use it within your own organization however you like.<br />For more information on our company, including information on private classes and upcoming conference appearances, please visit our Web site, www.ConcentratedTech.com. <br />For links to newly-posted decks, follow us on Twitter:@concentrateddon or @concentratdgreg<br />This work is copyright ©Concentrated Technology, LLC<br />

×