4. Disaster Recovery Continues To Be Challenging
4
Complex
?
?
?
?
?
?
?
?
Apps
Hosts
Storage
Network
Expensive
Software
Hosts
Storage
Facilities
Over $10K Per App
Unreliable
DR Test Once a Year
5. Traditional Disaster Recovery
5
Infrastructure Challenges: Compute, Networking and Storage
Web
App
DB Web
App
DB
Storage
Compute
Network
Web
App
DB Web
App
DB
Storage
Compute
Network
Deployment/Recovery
Automated and Reliable
Deployment/Recovery
Automated and Reliable
Deployment/Recovery
Manual, Complex, Error Prone
WAN / Internet WAN / Internet
6. Traditional Disaster Recovery
Infrastructure Challenges: Site Connectivity
Protected
VM VM VM
VM VM VM
VM VM VM
Recovery
VM VM VM
VM VM VM
VM VM VM
10.1.1.0/24
10.1.2.0/24
10.1.3.0/24
10.1.1.0/24
10.1.2.0/24
10.1.3.0/24
Data Center
Interconnect (DCI)
(VPLS, Overlay Transport,
L2 Extensions)
Complex and Expensive DCI at the WAN Edge
WAN / Internet
WAN / Internet
Network
Fabric
Network
Fabric
Recreate L2 (Re-IP/Preserve IP Space)
Recreate L3
Recreate FW, LB Policies
6
7. Traditional Disaster Recovery
7
Web
App
DB Web
App
DB
WAN / Internet
Storage
Compute
Network
Network Admin
Compute/Virtualization
Admin
Networking
and Security
Policies
Compute and
Storage
Recovery
Policies
Scripts/APIs/Tools
Scripts/APIs/Tools
Recovery Plan
Recovery Plan
Recovery Management ?
9. vSphere Site Recovery Manager (SRM) Components
9
Storage
Servers
VMware vSphere
vCenter Server
Site
Recovery
Manager
Virtual Machines
Site Recovery Manager
• Manages recovery plans
• Automates failovers and failbacks
• Tightly integrated with vCenter and replication
Storage-Based Replication (3rd party)
• Provided by replication vendor
• Integrated via replication adapters created,
certified and supported by replication vendor
vSphere Replication
• Part of vSphere platform
• Replicates virtual machines between
vSphere clusters
Replication Options
Required at both protected
and recovery sites
Compute
Storage
Networking ?
10. Applying Benefits of Network Virtualization to DR
1. Decouple
Physical
Virtual
2. Reproduce 3. Automate
Network Operations
Cloud Operations
Virtual
Physical
Build the recovery site
independent of protected Site
(Equipment and Topology)
Recreate Application Networking
and Security decopled
of underlying infrastructure
Automate the entire recovery process
with APIs, vRO and other tools
1 2 3
10
11. Disaster Recovery with SRM + NSX
11
vCenter
Server
SRM
Protected Site
vSphere
Storage
NSX
Recovery Site
vSphere
Storage
NSX
NSX
Manager
vCenter
Server
SRM
NSX
Manager
12. Disaster Recovery NSX + SRM
12
Active-Standby Application Pair
Active-Standby
Application Pair
Web
App
DB Web
App
DB
Replicated
VC+SRM VC+SRM
SRM Pair
Active N-S Stand-by N-S
Cross-vC NSX
Protected Site Recovery Site
Site Local N-S Site Local N-S
WAN/Internet
Web
App
DB
13. A Mixed Deployment
13
Active-Active, DR and Stretched Application Deployment
vCenter-A
Web App DB
vCenter-B
<150ms
Local Storage Local Storage
Cross vC NSX Logical Networks with L2/L3/DFW
N-S Connectivity
Web App DB
Active-Active
N-S Connectivity
SRM SRM
WAN/Internet
Web App DB Web App DB
Full Failover
Web
App
DB Partial Failover
Web
App
15. L3 Network/IP Fabric
NSX Logical Networks (Pre NSX 6.2)
vC with NSX
Manager
vC with NSX
Manager
vC with NSX
Manager
Logical
Switch
Local VC Inventory Local VC Inventory Local VC Inventory
vCenter A vCenter B vCenter C
NSX Controller
Cluster
NSX Controller
Cluster
NSX Controller
Cluster
Distributed Logical Router
Single NSX Domain can span
more than one site
Distributed Logical Router Distributed Logical Router
Logical
Switch
Logical
Switch
15
16. L3 Network/IP Fabric
Cross vCenter NSX (6.2)
16
Local VC Inventory Local VC Inventory Local VC Inventory
vCenter & NSX Manager A
Universal Object Configuration
(NSX UI & API) Universal Synchronization Service
Universal
Controller
Cluster
Primary Secondary
vCenter & NSX Manager B vCenter & NSX Manager H
Secondary
Universal Logical
Switches
Universal Distributed Logical Router
Universal
DFW
USS USS USS
17. vCenter
Server
L3
Network
Site A Site B
VM1 VM2 VM3
Universal Logical
Switch A
Universal Distributed Logical Router
Site A NSX
Edge GW
Uplink Net A
Site B NSX
Edge GW
Uplink Net B
Locale ID:
NSX-A
Locale ID:
NSX-B
Cross vC NSX Egress Optimized Routing
Routing with Locale-ID
vCenter
Server
Control VM
w/ Local Egress
Control VM
w/ Local Egress
Route Updates
with Locale ID
Cross vCenter Local Egress with Locale-ID
17
19. Design Consideration
• Partial Failover (VM1 Fails)
– Use the Existing N-S from Protected Site
• Cost of cross site link traversal is low
• No need to change any physical infrastructure/connectivity
• No rerouting is needed or advertise specific route
• Zero touch partial failure
• Full Failover (VM1, VM2 Fails)
– Use the Recovery Site N-S
• Egress Localization managed through NSX Locale-ID
• Ingress Managed through – GSLB, Route Advertisement
19
U-DLR
NSX ESG
VM1 VM1
Failed over
myapp.com
10.1.22.10 /29
10.1.22.10 /29
U-LS
NSX ESG
VM2 VM2
Protected
Site
Recovery
Site
20. DR with NSX+SRM: Initial Set-up
20
Web
Web DB
App App DB
SRM
Locale-ID Set to Protected Site
Active N-S Stand-by N-S
Protected Recovery
Site Local
Router
Site Local
Router
Locale-ID Set to Protected Site
U-DFW
U-DFW
U-DFW
U-DFW
Web U-LS
App U-LS
DB U-LS
U-DFW
U-DFW
Universal DLR
Universal
Control VM
Universal
Control VM
NSX ESG
with ECMP
U-DLR LIF U-DLR LIF
NSX ESG
with ECMP
SRM
U-DLR
Control VM
Allow prefix
list
10.1.1.0/24
10.1.2.0/24
10.1.3.0/24
U-DLR
Control VM
Deny prefix
list
10.1.1.0/24
10.1.2.0/24
10.1.3.0/24
21. DR with NSX+SRM: Planned Migration/Partial Failure
21
Universal DLR
DB
Web Web App DB
Universal
Logical Switch
U-DFW
Universal
Control VM
Universal
Control VM
NSX ESG
with ECMP
Active N-S Stand-by N-S
U-DLR LIF U-DLR LIF
Protected Recovery
Site Local
Router
Site Local
Router
App
U-DFW
U-DFW
Web U-LS
App U-LS
DB U-LS
U-DFW
U-DFW
U-DFW
NSX ESG
with ECMP
SRM SRM
U-DLR
Control VM
Allow prefix
list
10.1.1.0/24
10.1.2.0/24
10.1.3.0/24
U-DLR
Control VM
Deny prefix
list
10.1.1.0/24
10.1.2.0/24
10.1.3.0/24
22. DR with NSX+SRM: Complete Application Failure
22
Universal DLR
Web
DB App
Web App DB
Universal
Logical Switch
U-DFW
Universal
Control VM
Universal
Control VM
NSX ESG
with ECMP
Active N-S Stand-by N-S
U-DLR LIF U-DLR LIF
Protected Recovery
Site Local
Router
Site Local
Router
U-DFW
U-DFW
Web U-LS
App U-LS
DB U-LS
U-DFW
U-DFW
U-DFW
NSX ESG
with ECMP
Locale-ID Set to Recovery
Site
App reachability no longer
advertised
Restored N-S
Active N-S
SRM SRM
U-DLR Control VM
Deny prefix list
10.1.1.0/24
10.1.2.0/24
10.1.3.0/24
U-DLR Control VM
Allow prefix list
10.1.1.0/24
10.1.2.0/24
10.1.3.0/24
23. DR with NSX+SRM with GSLB
23
Web
Web DB
App App DB
SRM
Active N-S Stand-by N-S
Protected Recovery
Site Local
Router
Site Local
Router
U-DFW
U-DFW
U-DFW
U-DFW
Web U-LS
App U-LS
DB U-LS
U-DFW
U-DFW
Universal DLR
Universal
Control VM
Universal
Control VM
NSX
ESG
U-DLR LIF
U-DLR LIF
NSX
ESG
SRM
Primary NSX
Manager
Secondary NSX
Manager
10.1.1.0/24
10.1.2.0/24
10.1.3.0/24
Web-VM IP: 10.1.3.1 Placeholder Web-VM IP: 10.1.3.1
GSLB
Web-VIP: 20.1.1.1
(SNAT: 10.1.3.1) Web-VIP: 30.1.1.1
(SNAT: 10.1.3.1)
Protected Site www.myweb.com VIP: 20.1.1.1 (Primary) IP: 10.1.3.1
DR Site www.myweb.com VIP: 30.1.1.1 (Post DR) IP: 10.1.3.1
Transit VLAN
DNS Queries
Transit U-LS
Transit U-LS
Transit VLAN
24. VC with NSX
Manager
Secondary
Site Failure and Recovery Steps
24
Universal Controller
Cluster
VC with NSX
Manager
CONFIDENTIAL
Universal DLR
Primary NSX Manager
or Universal Controller
Cluster extended outage
1
No
Data
Plane
Impact*
No
Data
Plane
Impact*
Promote existing Secondary
to Primary
2
Universal Controller
Cluster
Deploy new Universal
Controller Cluster
3
4
Universal CC config
pushed to ESXi Hosts
managed by Secondary
Universal
Control VM
Universal
Control VM
5
Recover old Primary NSX
Manager as a Secondary
Primary Secondary
Primary
26. DR Automation
• Cross vCenter NSX Synchronizes Logical Networks – No Automation Needed
– L2, L3, Firewall/Security
• What about N-S Edge Service Gateway ?
– Not synchronized between Primary and Secondary NSX
– N-S Components are site specific (Physical Connectivity etc)
– Can be automated if needed
• NSX Components Recovery (Complete Site Failure)
– Manual / APIs, vRO Workflows
– Doesn’t impact the RTO
26
27. Summary: NSX vs. Traditional DR Solutions
27
Traditional DR Solutions NSX
Tied to Physical Infrastructure (Primary/DR) Decoupled from Physical Infrastructure
Address preservation in the Physical Infrastructure Logical Networks preserve IP Addressing
Requires L2 extension to preserve IP addressing Overlay networks with distributed L2, L3 and Firewall
Manual DR Set-up and Run book API based DR and Run book Automation
No integration with Compute/Storage Integration/Validation with Vmware SRM
29. DR with NSX+SRM: Initial Set-up
29
Universal DLR
Web DB
App Web App DB
Universal
Logical Switch
SRM
SRM
Locale-ID Set to Protected Site
U-DFW
Universal
Control VM
Universal
Control VM
Active N-S Stand-by N-S
U-DLR LIF U-DLR LIF
Protected Site
Site Local
Router
Site Local
Router
Disconnect the U-
DLR-LIF
Locale-ID Set to Protected Site
U-DFW
U-DFW
Web U-LS
App U-LS
DB U-LS
U-DFW
U-DFW
U-DFW
Recovery
Site
34. DR with NSX+SRM: Set-up Configured
34
Universal DLR
Web DB
App Web App DB
Universal
Logical Switch
SRM
SRM
Locale-ID Set to Protected Site
U-DFW
Universal
Control VM
Universal
Control VM
NSX Edge
(Advertising reachability)
Active N-S Stand-by N-S
U-DLR LIF U-DLR LIF
Protected Site
Site Local
Router
Site Local
Router
Locale-ID Set to Protected Site
U-DFW
U-DFW
Web U-LS
App U-LS
DB U-LS
U-DFW
U-DFW
U-DFW
NSX ESG
Recovery Site
39. Protected Site
DR with NSX+SRM: Set-up Configured
39
Universal DLR
Web DB
App Web App DB
Universal
Logical Switch
SRM
SRM
Locale-ID Set to Protected Site
U-DFW
Universal
Control VM
Universal
Control VM
NSX Edge
(Advertising reachability)
Active N-S Stand-by N-S
U-DLR LIF U-DLR LIF
Recovery Site
Site Local
Router
Site Local
Router
Locale-ID Set to Protected Site
U-DFW
U-DFW
Web U-LS
App U-LS
DB U-LS
U-DFW
U-DFW
U-DFW
NSX ESG
42. Protected
Perimeter ESG
Web
Universal Web Logical
Switch
DB
Universal Database Logical Switch
10.114.220.2
10.114.220.18
Protected Site Recovery Site
Physical Network
App
Universal Application Logical Switch
10.114.220.10 /29
Recovery
Perimeter ESG
Universal Distributed Logical Router
Web
DB
App
10.114.220.26
10.114.220.25
10.114.208.86 / 28
10.114.220.1
10.114.220.9
10.114.220.17
OSPF Normal
Area 51
OSPF Normal
Area 51
OSPF Backbone
Area 0
OSPF Backbone
Area 0
10.114.212.198 / 28
VLAN 1010
VLAN 1020
10.114.208.82 / 28
10.114.220.193 / 28
10.114.220.2
10.114.220.18
10.114.220.10
10.114.220.34
10.114.220.33
10.114.220.1
10.114.220.9
10.114.220.17
/29
/29
/29
/29 /29
Proto Add: 10.114.220.27/29
Proto Add:
10.114.220.35/29
Mgmt IP: 10.114.9.203 Mgmt IP: 10.114.8.17
UDLR - Control VM1
UDLR - Control VM2
Physical Design – Initial Flow
ToR-Protected ToR-Recovery
43. Protected
Perimeter ESG
Universal Web Logical
Switch
DB
Universal Database Logical Switch
10.114.220.18
Protected Site Recovery Site
Physical Network
Universal Application Logical Switch
Recovery
Perimeter ESG
Universal Distributed Logical Router
Web
DB
App
10.114.220.26
10.114.220.25
10.114.208.86 / 28
10.114.220.1
10.114.220.9
10.114.220.17
OSPF Normal
Area 51
OSPF Normal
Area 51
OSPF Backbone
Area 0
OSPF Backbone
Area 0
10.114.212.198 / 28
VLAN 1010
VLAN 1020
10.114.208.82 / 28
10.114.220.193 / 28
10.114.220.2
10.114.220.18
10.114.220.10
10.114.220.34
10.114.220.33
10.114.220.1
10.114.220.9
10.114.220.17
/29
/29
/29
/29 /29
Proto Add: 10.114.220.27/29
Proto Add:
10.114.220.35/29
Mgmt IP: 10.114.9.203 Mgmt IP: 10.114.8.17
UDLR - Control VM1
UDLR - Control VM2
Physical Design – Stretched Application
ToR-Protected ToR-Recovery
44. 10.114.220.18
10.114.220.10 /29
10.114.220.2
Protected
Perimeter ESG
Web
Universal Web Logical
Switch
DB
Universal Database Logical Switch
Protected Site Recovery Site
Physical Network
App
Universal Application Logical Switch
Recovery
Perimeter ESG
Universal Distributed Logical Router
Web
DB
App
10.114.220.26
10.114.220.25
10.114.208.86 / 28
10.114.220.1
10.114.220.9
10.114.220.17
OSPF Normal
Area 51
OSPF Normal
Area 51
OSPF Backbone
Area 0
OSPF Backbone
Area 0
10.114.212.198 / 28
OSPF UP
10.114.208.82 / 28
10.114.220.193 / 28
10.114.220.2
10.114.220.18
10.114.220.10s
10.114.220.34
10.114.220.33
10.114.220.1
10.114.220.9
10.114.220.17
/29
/29
/29
/29 /29
Proto Add: 10.114.220.27/29
Proto Add:
10.114.220.35/29
Mgmt IP: 10.114.9.203 Mgmt IP: 10.114.8.17
UDLR - Control VM1
UDLR - Control VM2
Physical Design – Complete Application Failure
ToR-Protected ToR-Recovery
Web
App
DB
Set Locale-ID to
Recovery Site
45. Takeaways
• NSX simplifies Disaster Recovery
• NSX and SRM Integration – SDDC Approach to DR
• Cross vC Networking and Security with NSX 6.2
• No need for an expensive WAN Edge Connectivity for Applications
• DR Runbook automation with vRO
45
Editor's Notes
We all know that Disaster Recovery is an expensive and complex task. There are too many moving pieces Storage, Network and Compute MUST be put together to recover an Application and Infrastructure stack where the application is running. This is not only expensive and complex but extremely hard to put together reliably.
The of course the challenge of ensuring correctness and reliability – How to ensure that in the event of DR – Things will work as designed ?
Most customers continue to look for ways to streamline this and have only been able to put the DR around a fraction of their tier-1 applications. In this session we will focus on how you can reduce the complexity around Networking the DR and create a single unified architecture for Compute, Storage and Recovery
Let’s look at each of these components of Disaster Recovery and how they are recovered.
On compute the easiest one – The infrastructure is complete virtualized, you can have Dell on one side and HP on the other and App doesn’t care. You can recover one with click using tools like SRM. Very automated and very reliable
Let’s look at Storage – It isn’t as simplified as networking, you still need same infrastructure if it uses Array based replication. Nevertheless, it is still simple and automated – also with SRM
Now let’s look at the Networking – First you have Data Center Fabric that determines L2, L3 boundary of your application and then you have all the L4-L7 services such as Firewall and Load Balancer. Now if you you were build the same application network on the other side, you have preserve the application’s IP Address and since FW/LB Policies are written based on IP Address – those have to be preserved as well. All of this likely via. Scripts/CLI – Almost impossible to ensure correctness when you build the similar infrastructure on the recovery site.
Does that sound like an accurate description of your networking and security challenges ?
To add to networking and security challenges -
You not only have retain the application topology, policies between the data centers but also ensure:
Application IP Space is preserved and
Application can connect to other components to support partial failover of some components (or planned migration) – Stretched Subnet
The technologies available to ensure IP Address preservation and connectivity actsuch as VPLS or OTV or L2overL3 are designed for Networking Nerd
Not mention these are typically built on top of WAN Edge Routers – Creating Cost and Complexity – That is out realm of a DR/Virtualization Admin
In addition to Technology Challenges there are organization challenges that complicats the matter. You generally have Virtualization/Compute admins responsible for Compute/Storage piece of the infra. He/She may be using SRM with a specific Recovery plan.
On the other for the network piece since there several moving prices there is Network Admin responsible for IP Configuration, Security Policies and WAN Edge Connectivity.
This means you end with a Siloed Recovery Management Process – Leading to more error, recovery failure.
Let me take a pause and see if these problems seem like what your DR environment ?
9
Now if you apply the benefits of DR to the network virtualization –
Click:
You can first create your Recovery Site independent of protected. There is NOT need produce exactly the same infrastructure for the sites. Believe me I have heard deployment doing that.
Click
Since you can create a the entire Logical Topology decoupled from the underlying infrastructure – you do start creating same logical network and security across primary and secondary site.
Click
API driven consumption allows Runbook and DR Automation – with tools like vRO etc
1. Now you are applying the benefits of both Compute and Network Virtualization to create a single-simplified solution integrating all the pieces – Network, Compute and Storage to create a DR plan.
2. Just like you have vC and SRM pair in the Protected and Recovery site you have NSX paired as well. I will explain in more details how this all work together.
The biggest different with approach is just like place holders VM with SRM now you have a placeholder network ready for connectivity when the application failover. You are NO LONGER tied to Physical Infrastructure for recovery
Let’s take a more detailed look at how this integration is achieved.
1. At very high level you have two sites and you have an Application that is active and serving the users. Of course you have basic Storage Replication (Array based or vSphere). Each site is capable of sending the N-S traffic independently to avoid a single point of failure
Click
2. Now deploy SRM, you have VC-SRM Pairs at each site, creating a set-of place holder VMs ready to takeover if the Protected VMs fail.
Click
3. Let’s deploy NSX in this picture in Cross vC mode – it creates L2, L3 and Security for the application at both Primary and Secondary site. Cross vCenter NSX is a new feature introduced in NSX 6.2 and I will briefly explain what it does later.
Click
4. When the application fails over you entire logical topology and security policy already exists. SRM will map the application to this place holder network and application will start using the Recovery site to server the use and with right routing construct you flip the traffic to this site now.
What I showed you earlier is what I all Full Failover scenario. Where the entire application fails with all its components – Web, App and DB in this case.
Click.
Most DR scenarios are generally not that simple – many customer we spoke to actually use what we call partial failover – to either better utilize capacity or other reasons. In this scenario Web and App are now running in recovery site. While DB is still running in the protected site. This would require NOT only network to exist in the DR site but also cross site Connectivity so that application can talk to DB while fully preserving the security policies. With NSX 6.2 you do this as long as you have L3 Fabric connecting the two sites.
Click
Last there is one more scenario that I will cover for completion – this is a case where an application and all its components are active on BOTH the sites. A lot of new application are fully capable of operating like this. Again there is NO reason why you can’t do this with NSX.
In summary whether it is Full Failover, Partial Failover or Active-Active – you can build that DR Deisgn with NSX
On top a simplistic DR scenario are additional nuances – a customer may throw words like Active-Active or stretched application along with DR and things suddenly start looking complicated in terms of what to position ? Is this DR or Is this active etc or both ?
There are three scenarios in the mix here:
1. Active-Standby: Classic DR
2. Active-active scenario would be where you have two completely independent instance of application behind two GSLB active VIPs. These instances run independently and are NOT paired so if one instance failed the other instance will take up the load. (Like the bottommost application pair shown above)
2 Stretched Networks: In many instance a customer will ask some kind of cross site connectivity so application with components running on two sites. This is also a stretched deployment (sometime this is accomplished with stretched L2 etc.) This is the blue scenario above.
Then of course there is DR – where you have Active and Standby Instances as seen in the Green Scenario above.
This quickly adds complexity to the discussion:
-Now you have Stretched Networks
-DR
-and Active-Active with GSLB
Not to mention the challenges around Failure Scenarios, N-S Route advertisement ? Local Egress and Ingress Optimization ?
We will not go through all of this but during the course of this presentation we will cover some of these and how to position NSX in these deployments.
Important think to keep in mind is – With Overlay, NSX, Programmability and Automation you can solve a lot of these problems elegantly without the “Stretched L2” challenges. That’s where our strengths are, we don’t want to lead these discussion as a DCI replacement
Before I go into details of how you can build each of these pieces – Let me go over a NSX 6.2 and what are the new feature that enable all the goodness that I talked about earlier.
If you are familiar with NSX you will clearly understand what this is all about. This was the world before NSX 6.2 -
Before NSX 6.2 the NSX Manager with-in each vCenter can only create Logical Network across the Hosts in the Local vC Inventory.
Logical Netwoks with-in each of these vCs will be like islands. If you need to create a Logical network across the VM in hosts in two different vCs – There is now way to do it.
With NSX 6.2 that’s changing ……
Application boundaries are constrained by vC. One to One relationship between NSX and vC means – Logical Networks cannot span across vCs
This really constrained the deployment because now there are silos between vCs – This is true for single site as well as multiste. Of course you can solve by having a single vC across all your deployments across all the sites – however that often not possible
So why are customers deploy multiple vC –
Scale Limitation of a Single vC
Multiple Sites – With one vC per site
NSX Multi-vC break these application silos – and allow Application to span across VC – in one or multiple locations.
In NSX 6.2
NSX Manager and VC still have 1:1 relationship [Primary-Secondary Relationship]
However we now we have new roles for NSX Managers – Primary and Secondary
[Universal Controller] The Controller becomes a shared object and gets imported to every secondary – Every VC host now sees a unified control plane via Universal Controller this allows establishment of
Universal Logical Switches, Routers and DFW Rules
The name Space separation between Universal Multi-VC entities and any local entities that each NSX Manager may have
This doesn’t mean you can’t local entities – these continue to work as before but now all Local and universal entities are
[Universal Sync Service] All Universal configuration is performed at Primary and is then synchronized across the secondary NSX Manager. This allows some good redundancy and fault tolerance (we will see later how) – thus if Primary fails one of the Secondary NSX Manager can be designated as new primary without any loss of configuration This functionality plays a very critical role on recovery from complete site failure.
We are not chaging 1:1 Relationship between VC and NSX – however we are creating
Universal Entities – Universal Logical Switch, Logical Router and Universal DFW
These entities cross the vCenter Boundary
So how does this work - So you have a Universal Controller Cluster – that facilitates these Global Entities and Local Managers which continue to be 1:1 related to vC
NSX 6.2 introduces the concept of Locale ID which allows route selection based on the Physical Location of the Host. This feature controls your N-S Traffic coming in and out of site !
Click
2. If you enable Local Egress and assign each host with a Locale-ID say NSX-A and NSX-B – which can
be done per host or per Cluster or per DLR
Click
3. [Locale Matching] Each site has it own Universal Control VM, which learns local routes from the Local ESG and send to the Controller, which then send routes to ESX host matching the Locale – so Hosts in Site-A will only received routes leaned by Control VM in Site-A and son on.
3. This allows all the VMs in the Site-A to egress via ESG from Site-A and VMs on Site-B to egress via ESG on Site-B
For E-W perspective each site sees a single view of routing via DLR but from N-S perspective VMs on each site selects the Route specific to that site via ESG of that site
Now you have for single Universal DLR you have egress ESG for each site. The E-W traffic between the VM continues to work just like regular DLR.
We will see later how this capability helps with selection a specific Egress based on the Failover and Recovery.
It is important that you understand these concepts since the entire design based on these features and concepts.
When Local Egress is enabled, the NSX Controller will only send routes to ESXi hosts with a matching Locale ID
Using a site specific uplink, each site can have a local routing configuration. This allows NSX 6.2 to support up to 8 sites with local egress
Locale ID can also be set on a per UDLR, per Cluster or per Host level if the same NSX Manager is used across multiple sites
How is this relevant to Disaster Recovery –
For the same DLR, you can select Egress based on the physical location
You can either have a single Egress or Multiple Egress
Lastly you can flip the Egress by Updating the locale-id with an API – You can do it per host, per DLR or Per Route even
Now you understand the big picture, the nuts and bolts of underlying technology – let’s see the details of how this all works together in a DR scenario !
This is a Multi-VC Illustration of DR –
-As explained earlier the Logical network is replicated and exists on both the sites
-L2, L3, DFW across the site yet Subnet is always advertised for the Primary Site (or any single site)
Web-App and DB are on single site and customer is asking how do I do a granular DR of failing over one VM at a time ?
That’s where Multi-VC comes in picture – You failover “App” but you still have E-W across the app DB and Web
The tricky part is when you failover Web ?
What do you do to automate N-S – The answer is you continue with the same Egress – keeping the design simple
Of course you can do more fancy Local N-S but that would require a routing gimmicks and complicated design
Now the interesting part – how do address “Site down scenario ?” That’s where you can leverage the Stand-by Edge and re-program the Local so have the routing from the secondary site.
This allows granular failover without re-advertising the routes from the protected site
-The E-W continues to function with-out trombone or centralize routing
-N-S/Stateful services remains at the single site – so the regardless of where the VM resides before and after failover N-S routing works without re-advertising the routes.
Accomplishes the following
Active-Active for E-W and Active-Stand-by for N-S – Simplified yet addresses most cusrtomer concerns
DR with granular failover
Single Site Subnet advertisement for a stretched subnets
Active-Active for E-W
This is a Multi-VC Illustration of DR –
-As explained earlier the Logical network is replicated and exists on both the sites
-L2, L3, DFW across the site yet Subnet is always advertised for the Primary Site (or any single site)
Web-App and DB are on single site and customer is asking how do I do a granular DR of failing over one VM at a time ?
That’s where Multi-VC comes in picture – You failover “App” but you still have E-W across the app DB and Web
The tricky part is when you failover Web ?
What do you do to automate N-S – The answer is you continue with the same Egress – keeping the design simple
Of course you can do more fancy Local N-S but that would require a routing gimmicks and complicated design
Now the interesting part – how do address “Site down scenario ?” That’s where you can leverage the Stand-by Edge and re-program the Local so have the routing from the secondary site.
This allows granular failover without re-advertising the routes from the protected site
-The E-W continues to function with-out trombone or centralize routing
-N-S/Stateful services remains at the single site – so the regardless of where the VM resides before and after failover N-S routing works without re-advertising the routes.
Accomplishes the following
Active-Active for E-W and Active-Stand-by for N-S – Simplified yet addresses most cusrtomer concerns
DR with granular failover
Single Site Subnet advertisement for a stretched subnets
Active-Active for E-W
This is a Multi-VC Illustration of DR –
-As explained earlier the Logical network is replicated and exists on both the sites
-L2, L3, DFW across the site yet Subnet is always advertised for the Primary Site (or any single site)
Web-App and DB are on single site and customer is asking how do I do a granular DR of failing over one VM at a time ?
That’s where Multi-VC comes in picture – You failover “App” but you still have E-W across the app DB and Web
The tricky part is when you failover Web ?
What do you do to automate N-S – The answer is you continue with the same Egress – keeping the design simple
Of course you can do more fancy Local N-S but that would require a routing gimmicks and complicated design
Now the interesting part – how do address “Site down scenario ?” That’s where you can leverage the Stand-by Edge and re-program the Local so have the routing from the secondary site.
This allows granular failover without re-advertising the routes from the protected site
-The E-W continues to function with-out trombone or centralize routing
-N-S/Stateful services remains at the single site – so the regardless of where the VM resides before and after failover N-S routing works without re-advertising the routes.
Accomplishes the following
Active-Active for E-W and Active-Stand-by for N-S – Simplified yet addresses most cusrtomer concerns
DR with granular failover
Single Site Subnet advertisement for a stretched subnets
Active-Active for E-W
This is a Multi-VC Illustration of DR –
-As explained earlier the Logical network is replicated and exists on both the sites
-L2, L3, DFW across the site yet Subnet is always advertised for the Primary Site (or any single site)
Web-App and DB are on single site and customer is asking how do I do a granular DR of failing over one VM at a time ?
That’s where Multi-VC comes in picture – You failover “App” but you still have E-W across the app DB and Web
The tricky part is when you failover Web ?
What do you do to automate N-S – The answer is you continue with the same Egress – keeping the design simple
Of course you can do more fancy Local N-S but that would require a routing gimmicks and complicated design
Now the interesting part – how do address “Site down scenario ?” That’s where you can leverage the Stand-by Edge and re-program the Local so have the routing from the secondary site.
This allows granular failover without re-advertising the routes from the protected site
-The E-W continues to function with-out trombone or centralize routing
-N-S/Stateful services remains at the single site – so the regardless of where the VM resides before and after failover N-S routing works without re-advertising the routes.
Accomplishes the following
Active-Active for E-W and Active-Stand-by for N-S – Simplified yet addresses most cusrtomer concerns
DR with granular failover
Single Site Subnet advertisement for a stretched subnets
Active-Active for E-W
This is a Multi-VC Illustration of DR –
Abhishek has already gone thru the details of what is happening here so I will not go into that detail.
This is the same setup we are using to show the demo
The key features we will touch in this demo are
SRM
NSX Cross-vC
Universal Distributed Logical Routing and Switching
Local Egress and
Universal application of Distributed Firewall Rules
Lets look at the high level configuration workflow to show how this three tiered application behaves during different failure and migration scenarios.
We will be building the config ground up, hosts, Switching, Routing, FW and then SRM
Lets start with the Base Install. Please keep in mind that we are not going to show every config detail but keep it at a high level to show the building blocks
The versions used in this demo are
vSphere 6.0
NSX 6.2.0
SRM 6.1
Click #1:
The first thing to do after setting up the vSphere environment is to get NSX and SRM installed
We will be using vSphere Replication for datastores
Click #2:
As you can see we have 2 NSX Managers installed at the Protected and Recovery site under the respective vCenter servers.
This is new feature with NSX 6.2 which allows upto 8 sites.
The Primary NSX Manager maintains the Read/Writeable config. Secondary NSX Manager does have some site local config for Universal DLR that we will see shortly
Click #1:
We have an Edge and Compute cluster at each site
Both of them have been prepared for NSX running 6.2.
Now lets look at the Logical Switching Config
Click#1:
We have few Universal Logical Switch created as you can see here
There is one Universal Logical Switch created for each tier of the application so we have Web, App and DB Universal Logical switches
We have site local Universal Logical Switches for Transit Network
Click#1:
When a Logical Switch is created, it shows up as a Portgroup on the selected Virtual Distributed Switch.
With the Universal construct, you will see the exact same portgroups created at all sites. Note that the distributed switch is site local.
So lets review, what we have configured so far
Click#1: NSX Managers at both site
Click#2: Hosts at both site with Locale ID set to Protected Site such that all traffic will egress out of protected site
Click#3: Application and Transit Universal Logical switches
Click#4: Universal Logical Distributed Routers for intra and cross-site East West routing using OSPF
Click#5: Perimeter Edge Routers running OSPF with U-DLR and ToR
- Now lets look at the Edge appliances deployed including the Perimeter edges for both sites and the Universal Distributed Logical Router.
Click#1:
First we deployed the Universal Distributed Router.
There is a lot that can be mentioned around UDLR but as that is being covered in another session I will keep it limited here.
The U-DLR is deployed at the Primary site and it can be seen at the secondary NSX Manager as well.
All the universal config are performed at primary site and only site local routing config is done at secondary sites.
Click#2:
Now we deploy the site local Perimeter edges.
Click #1:
You then deploy site local U-DLR Control VMs.
These site local VMs will learn and distribute routes local to their site.
The Logical Router control VMs will be peering with these edge devices using OSPF.
The Edges are configured to run OSPF with the ToR leaf switches.
Now lets look at the Universal Distributed Logical Router site local configs
Click#1:
Just to make it clear, you can see that each site local control VM is connected to a different Logical switch to advertise OSPF; and
They have different IP addresses as well.
Click#2:
As Abhishek mentioned, Locale ID is a new construct with 6.2 to have site local routes.
Click#3:
The Locale ID can be changed at the Control VM level or at cluster or host Level.
When done at the cluster/host level, the Controller Cluster will send routes to hosts only from Universal Distributed Logical Router control VM with matching Locale ID.
Now Let’s have a quick look at the DFW rules in play.
Click#1: With 6.2, you can create rules marked for universal synchronization across all NSX Managers
Click#2: Universal rules can only have universal object such as IP Sets, MAC sets and Security Groups
Click#3: Sample rules to only allow specific traffic between the tiers. These rules are pushed
So lets review, what we have configured so far
Click#1: NSX Managers at both site
Click#2: Hosts at both site with Locale ID set to Protected Site such that all traffic will egress out of protected site
Click#3: Application and Transit Universal Logical switches
Click#4: Universal Logical Distributed Routers for intra and cross-site East West routing using OSPF
Click#5: Perimeter Edge Routers running OSPF with U-DLR and ToR
Lets now look at the SRM Configuration
Click #1:
You can see we have 2 sites that are already paired
Click #2:
If you look at the network mapping, with the new Auto Mapping feature, the networks are easily mapped between sites
Click #1:
We have 2 Protection Groups
One containing the DB server
Another one with Web and App vms with App VM set to boot first
Click #2:
- If we look at the details of DB priority group we can see the recovery resource pool, recovery host and recovery network settings. This makes it less error prone as well.
Click #3:
You can verify and go thru the Recovery Steps
Then the last thing we will show is how the traffic looks like when the complete application fails over.
Click#1:
To demonstrate complete site failure, here we will be taking out the ESG on Protected site. There are other ways to demonstrate failure one of which is complete site failure including all infrastructure. From NSX perspective, this would involve moving infrastructure components as well. This scenario is covered in much more implementation level detail in the Turning Disaster Recovery into a Reality with NSX presentation. Other then that, what we will show here is going to be the same.
Click#2:
Then we will enable the Recovery site ESG alongwith setting Locale ID on appropriate clusters, forcing traffic to ingress and egress out of recovery site.
Then the last thing we will show is how the traffic looks like when the complete application fails over.
Click#1:
To demonstrate complete site failure, here we will be taking out the ESG on Protected site. There are other ways to demonstrate failure one of which is complete site failure including all infrastructure. From NSX perspective, this would involve moving infrastructure components as well. This scenario is covered in much more implementation level detail in the Turning Disaster Recovery into a Reality with NSX presentation. Other then that, what we will show here is going to be the same.
Click#2:
Then we will enable the Recovery site ESG alongwith setting Locale ID on appropriate clusters, forcing traffic to ingress and egress out of recovery site.
Then the last thing we will show is how the traffic looks like when the complete application fails over.
Click#1:
To demonstrate complete site failure, here we will be taking out the ESG on Protected site. There are other ways to demonstrate failure one of which is complete site failure including all infrastructure. From NSX perspective, this would involve moving infrastructure components as well. This scenario is covered in much more implementation level detail in the Turning Disaster Recovery into a Reality with NSX presentation. Other then that, what we will show here is going to be the same.
Click#2:
Then we will enable the Recovery site ESG alongwith setting Locale ID on appropriate clusters, forcing traffic to ingress and egress out of recovery site.