VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part Two

Implementing a Holistic BC/DR Strategy with
VMware - Part Two
Jeff Hunter, VMware
Ken Werneburg, VMware
BCO5162
#BCO5162

4
What’s the Difference?
Disaster
Avoidance
Disaster
Recovery
Planned vs.
Unplanned

5
Disaster Recovery vs. Business Continuity
Example: Tuesday, August 23, 2011 at 1:51 PM EDT - Magnitude 5.8
earthquake near Mineral, Virginia
Disaster recovery required?
No
Interruption to business continuance?
YES!

6
Fault Tolerance vs. High Availability
 Fault tolerance
• Ability to recover from component loss
• Example: Hard drive failure
 High availability
Uptime percentage in one year Downtime in one year
99 3.65 days
99.9 8.76 hours
99.99 52 minutes
99.999 “five nines” 5 minutes
X

7
RTO, RPO, and MTD
 Recovery Time Objective (RTO)
• How long it should take to recover
 Recovery Point Objective (RPO)
• Amount of data loss that can be incurred
 Maximum Tolerable Downtime (MTD)
• Downtime that can occur before significant loss is incurred
• Examples: Financial, reputation

8
Making an Application Service Highly Available
 vSphere HA
 NEW: vSphere App HA

9
VMware vFabric™ tc Server
vSphere App HA New
Policy-based
Protect off-the-shelf apps

10
vSphere App HA
vSphere HA Cluster
vFabric
Hyperic
Virtual Appliance
vSphere App HA
Virtual Appliance
Hyperic Agents
Running in VMs
vCenter
Server
vSphere vSphere vSphere vSphere
New

12
vSphere HA – Keep In Mind…
 RTO – measured in minutes (not seconds)
 Requires shared storage
 Best practices
• Use admission control – percentage policy
• Test post-failure performance with host maintenance mode
• Isolation response – leave powered on
• Network and storage redundancy
• Also see BCO5047 

13
vSphere Fault Tolerance (FT)
 Zero recovery time, data loss
• Host hardware failure only
• Does not protect against OS and application failure
 Works fine with HA, App HA
 Why not FT?
• Resource requirements – does workload really need it?
• VM has multiple CPUs – see BCO5065 
• No VM snapshots – backups require agent

14
Data Protection (Backup and Restore)
 Agents? No Agents? – Both!
• No agents for majority of workloads – keep it simple
• Agents for certain apps
 vSphere Data Protection (VDP) Advanced
• Backup and recovery for VMware, from VMware
• Based on proven, mature EMC Avamar™
• Agent-less VM backup and restore
• Agents for granular tier-1 application protection

15
vSphere Data Protection New

16
VDP Advanced – Keep In Mind…
 Engineered for SMB environments
 Uses VADP – VM snapshots, CBT
 Utilizes Windows VSS in VMware Tools
 Works fine with HA, not with FT
 RDM – virtual yes, physical no
 Is it DR?
• Maybe – depends on RTO, RPO
• Needs replication offsite, right? – see BCO5041 

17
VDP Advanced – Keep In Mind…
 Best Practices
• Prepopulate DNS, always use FQDN
• Manage VM snapshots
• Avoid deploying to slow storage
• Do not power-off, always shut down gracefully
• Do not schedule backups during maintenance window
• Also see BCO4756 and BCO5041 

18
vCenter Availability
 Run vCenter Server application in a VM
 Run vCenter Server database in a VM
 Run both in same VM?
 Protect with vSphere HA
• vCenter and DB VM restart priority set to High
• Enable guest OS and App monitoring
 App HA can protect SQL Server database

19
vCenter Availability
 Back up vCenter Server VM and database
• Image-level backup for vCenter Server VM
• App-level backup using agent for database backup
 Why not FT for vCenter Server?
• vCenter Server requires minimum of 2 vCPUs
• FT does not protect against application failure
 Replicate vCenter Server, database VMs?

20
vCenter Availability – vCenter Server Heartbeat
 Pros
• Better RTO and RPO – typically ~5 minutes
• Protects against host and guest OS failure
• Checks network connectivity
• Monitors application services and performance
 Cons
• Complexity
• Requires double the resources
• Licensing cost

21
vSphere Replication – DR
 Native tool built into the platform
 Per-VM hypervisor replication, managed in VC
Selectable RPO
from 15 min up
to 24 hours
Selectable
destination
datastore (Disk-
type agnostic)

22
Replication Across Sites
vCenter Server
ESXi
NFC
VRA
ESXi
NFC
VRA
ESXi
NFC
VRA
Storage
Storage
(VMDK1)
vCenter Server
ESXi
NFC
VRA
ESXi
NFC
VRA
ESXi
NFC
VRA
VR
Appliance
VR
Appliance
Storage
Storage
VMDK1
vCenter Server vCenter Server

23
Four Steps for Full Recovery
Right-click,
select “Recover”
Select a target
folder
Select a target
resource
Click Finish
Will validate your choices as you go

24
New Feature – Retain Historical Replicas
vSphere
VR Agent
After recovery, use the snapshot manager to revert
to earlier points
Retention of
multiple
points in
time allows
reversion to
earlier
known
good states

25
MPIT Presented as VM Snapshots after Failover
Use the snapshot manager to revert to earlier points, an interface
all administrators have been comfortable with for many years.

26
vSphere Replication – Interoperability
 Fault tolerance –
Doesn’t work with VR
• FT conflicts at the
vSCSI disk filter level.
 VDP
• Mostly no problem!
• If using VSS… ensure
you are using 5.5!!
 HA, vMotion, DRS
 Storage vMotion
and Storage DRS
• Now supported!

27
vSphere Replication – Best Practices
 RPO
• Only what is necessary!
• Just because you can…
 RTO
• Don’t set one! No testing,
no automation, manual
process.
 VSS – Only if necessary!
 What about bandwidth?
• Very hard to determine.
Do a local loopback first.
 RDMs?
• Don’t use them. If you must, use
virtual compatible.
 Don’t mix ABR and VR!

28
SRM
• A Disaster Recovery engine
• A tool that uses externally replicated data (VR or
array based) to speed the RTO of a BCP
• A product that allows for DR to be tested,
automated, planned, repeatable and customizable
What is it?
• A replication engine
• A tool for systems that need near-instant RPO
• A disaster avoidance stretched cluster
What is it not?

29
Key Components of SRM
Replication
vCenter Server
SRM Server
 One vCenter Server
(Windows or VCVA) per
site, same versions
 One SRM Server per
site, same versions
 vSphere hosts,
recommend same
versions per site (pre
vSphere 5.x only if using
array replication)
vSphere Essentials Plus and higher editions supported
vCenter Server

30
SRM Replication Options
 SRM can utilize BOTH array
based AND vSphere Replication
 SRM will “see” existing
standalone vSphere
Replication protected VMs
 SRM can install vSphere
Replication from scratch
if needed
Hub
LUN 2
Web
Multi-tier App
DB
App
vSphere Replication
Storage-based Replication
LUN 1
Web
DB
App
Multi-tier App

31
Recovery Workflows
• User defined recovery plan
• Minimize errors
Failover Automation
• Isolated test environment
• Increase confidence in DR process
Non-disruptive Failover
Testing
• Zero data loss
• Operational migration
Planned Migration
• Re-protect VM’s, migrate back
Failback Automation

32
SRM Interoperability
 Works with VR –and- ABR
 Backups, VADP or other
are fine
 HA is no problem at all
 vMotion and DRS are fine
 Storage vMotion and
Storage DRS – Sort of…
• Replication Dependent
 FT is “yellow”
• Array replicated only and the FT
status is not recovered
 Web vs vSphere Client

33
SRM – A Few Best Practices
Not
exhaustive
How long is Vmworld?
Big ones: Storage Layout
Test Network Configuration
Test often!
Size vCenter correctly
Biggest
one:
Do a Business Impact
Analysis
RPO, RTO, Cost of downtime,
interdependencies, criticality of
applications, priorities, units of
failover, overlooked
externalities, executive buy-in,
…..

34
SRM Further Detail at VMworld
• BCO5733 - vCenter Site Recovery Manager – Solution Overview and Lessons
from a Fortune 500 Health Care Company Implementation
• BCO5129 - Protection for All - vSphere Replication & SRM Technical Update
• BCO5170 - DR to The Cloud with VMware Site Recovery Manager and
Rackspace Disaster Recovery Planning Services
• BCO5652 - Three Quirky Ways to Simplify DR with Site Recovery Manager
• BCO4905 - Disaster Recovery Solution with Oracle Data Guard and Site
Recovery Manager

35
Protection Groups (PGs)
 More PGs = more granular testing/failover
• DR testing is easier – fewer resource requirements
• Fail-over only what is needed
• More configuration/complexity
 Less protection groups = less complex
• Fewer LUNs, PGs, recovery plans
• Less flexibility
 Find a good balance between flexibility and simplicity
Fewer LUNs/PGs
Less complexity
Less flexibility
More LUNs/PGs
More complexity
More flexibility
Right combination
of complexity and
flexibility
Varies by customer
Majority of outages
are partial (not entire
data center) – design
accordingly

36
Test Network
• Use VLAN or isolated network for test environment
• Default “Auto” setting does not allow VM communication between hosts
• Different vSwitch can be specified in SRM for test versus run
• Specified in Recovery Plan

37
vSphere Infrastructure Navigator

38
VMware – Multiple Levels of Protection
SQL
vSphere HA/FT
Site A

39
SQL
vSphere HA/FT
VDPA
Site A

40
SQL
vSphere HA/FT
VR/SRM
SQL
VDPA
Site A Site B

45
Other VMware Activities Related to This Session
 HOL:
HOL-SDC-1305
Business Continuity and Disaster Recovery In Action
 VMworld Session:
BCO-5160
Implementing a Holistic BC/DR Strategy – Part 1

Architecting the Software-Defined Data Center
Aidan Dalgleish, VMware
David Hill, VMware
Kamau Wanguhu, VMware
VSVC7371
#VSVC7371

VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part Two

More Related Content

What's hot

Similar to VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part Two

More from VMworld

Recently uploaded

VMworld 2013: Implementing a Holistic BC/DR Strategy with VMware - Part Two