Jonathan Frappier – Challenge 2 Design Solution

Virtual Design Master After the Outbreak
After the Outbreak
Project: HA, Backup, and Disaster Recovery Solutions
Focus Area: Provide HA, backup and disaster recovery solutions VMware
vSphere, Active Directory, SQL, Network, Storage, Remote Access and other
managed applications
Prepared By: Jonathan Frappier @jfrappier www.virtxpert.com
Project Quality Plan Version Control
Version Date Author Change Description
.5 8/19/13 Jonathan Frappier Draft
.75 8/19/13 Phoummala Schmitt Technical Consultant for Exchange HA
Solution (verified solution is valid)
1.0 8/22/13 Jonathan Frappier Final

Virtual Design Master Page 2 After the Outbreak
TABLE OF CONTENTS
1 EXECUTIVE SUMMARY............................................................................................ 3
1.1 DEFINITIONS....................................................................................................... 3
1.2 SCOPE............................................................................................................... 3
2 SITE INFRASTRUCTURE UPDATES........................................................................ 4
2.1 EQUIPMENT SPECIFICATIONS .............................................................................. 4
2.2 LAN & WAN...................................................................................................... 4
2.3 PHYSICAL SERVERS ........................................................................................... 4
2.4 STORAGE........................................................................................................... 4
2.5 SLA DEFINITIONS............................................................................................... 4
2.6 UPDATED SITE DIAGRAM .................................................................................... 4
3 REQUIREMENTS, ASSUMPTIONS, CONSTRAINTS & RISKS ................................ 5
3.1 REQUIREMENTS.................................................................................................. 5
3.2 ASSUMPTIONS.................................................................................................... 5
3.3 CONSTRAINTS .................................................................................................... 6
3.4 RISKS ................................................................................................................ 6
4 APPLICATION BACKUP, HA, RPO AND RTO ......................................................... 6
4.1 RPO AND RTO CLASSIFICATIONS ....................................................................... 6
4.2 SOFTWARE SOLUTIONS FOR BACKUP AND REPLICATION ...................................... 6
4.3 IDENTIFIED APPLICATIONS & FAILOVER PROCESS BRIEFS .................................... 7
5 IP ALLOCATION........................................................................................................ 12
6 APPENDICES ............................................................................................................ 12
6.1 HARDWARE MANIFEST........................................................................................ 12
6.2 SOFTWARE MANIFEST ........................................................................................ 13
6.3 REFERENCE ....................................................................................................... 13
6.4 VMWARE CONFIGURATION MAXIMUMS (DOUBLE CLICK TO OPEN).......................... 14
6.5 ORIGINAL SITE DESIGN (DOUBLE CLICK TO OPEN) ................................................ 14
6.6 EXCHANGE 2010 SIZING CALCULATOR (DOUBLE CLICK TO OPEN) ......................... 14

1 EXECUTIVE SUMMARY
The world is in disarray after a virus outbreak that turned many into zombies. You’ve been recruited to
build an infrastructure for a wealthy philanthropist to use in order to put the world back together.
During phase 2 of the build out, appropriate High Availability, backup and disaster recovery solutions
must be designed, documented and implemented.
The disaster recovery solution will failover all virtual server workloads from the primary site to the
secondary site, and all virtual desktop workloads to the tertiary site. The solution will use several
methods for failover and recovery based on the RPO and RTO of the application. This will provide the
most cost effective solution for failing over all applications.
1.1 Definitions
Backup: The process of making a backup copy of important information
Business Continuity Plan (BCP): A plan for how the business will continue to operate in the event of a
disaster. Information in a BCP may include escalation procedures, designation as to who may declare an
emergency, contact and emergency contact information for key personnel, directions on how to access
systems after a DRP has been implemented.
Disaster Recovery Plan (DRP): A disaster recovery plan is the process for restoring access to
infrastructure and applications in the event of a disaster. Disasters may range from small/isolated
incidents such as a server failure that affects access to key systems or a major/catastrophic incident that
causes major damage or prevents access to an entire site.
High Availability (HA): High Availability is the process of enabling application resilency such that
applications remain available, or are quickly restore in the event of an outage. An example of HA may be
a virtual server being restarted if it has become unresponsive, quickly and automatically restoring access
to the application with little to no human intervention.
HA By Design (HABD): HA By Design is the concept of building HA into your normal application design
such that during a disaster, connectivity remains unaffected, or requires very little intervention to be
accessed (for example using an alternate URL for access).
Recovery Point Objective (RPO): Recovery Point Objective defines an acceptable amount of data loss
(or lag) in the event of a disaster. For example a system with an RPO of one (1) hour will need to have
data recoverable within one (1) hour of an outage or disaster.
Recovery Time Objective (RTO): Recovery Time Objective defines how long a system may remain
offline after a disaster. The RTO and RPO do not need to be closely related. For example a system may
have an RPO of only five (5) or ten (10) minutes but during a disaster may not need to be brought back
online for several hours, or even days; an RTO of five (five) days may be acceptable even with a low
RTO. Conversely a systems with a low RTO may contain static data, such as a web server, and thus
have a very high RPO because the data does not change (often).
1.2 Scope
Mr. Billionaire needs you to build him an infrastructure so he can continue his efforts across the globe.
There are 3 locations that must be use because there is not enough power in each of the locations to host
all of the equipment. The primary site supports up to 5000 virtual servers, 1000 virtual servers in the
secondary site and 500 in the tertiary site. The primary site also hosts 3000 virtual desktops available for
full desktop access and mobile application delivery for at least 1500 devices.
A plan for appropriate High Availability solutions, backup and Disaster Recovery must be implemented to
support a major failure in the primary site. Business Continuity, communication and escalation
procedures are outside the scope of this document.

2 Site Infrastructure Updates
2.1 Equipment Specifications
All systems used to support disaster recovery efforts will be the same make, model and configuration as
the original installation. Additionally, this document will only list the required changes to each site;
detailed system details can be found in the original site design (Appendix 5.5).
2.2 LAN & WAN
• LAN
An additional Cisco 6513 will be added to the secondary site to ensure support for up to 50
additional hosts if required.
• WAN
A 100Mbps internet link has been added to the secondary and tertiary site. To support this, a
Cisco 7600 router and Cisco ASA 5540 will be added to each site.
2.3 Physical Servers
The secondary site will need to add an additional 50 physical hosts to support full recovery of the virtual
servers in the primary site. The tertiary site will require an additional 10 physical hosts to support the full
recovery of the virtual desktops from the tertiary site.
2.4 Storage
During the original design, an assumption was made at the tertiary site to install an EMC Celerra NS-480
with 32TB of storage to support failover from the secondary site. The requirements for the DRP calls for
failover of the primary site only, the spare NS-480 in the tertiary site will be upgrade to 64TB to match the
NS-480 in the primary site supporting the VDI workloads.
2.5 SLA Definitions
No service levels were defined; will use 99.9% as a standard.
2.6 Additional Equipment Required
Device Type Manufacturer Model Quantity
Server HP DL580 G5 60 (50 to secondary, 10 to
tertiary)
Add-On NIC Broadcom 5709 Based 120 (2 cards per server)
Add-On HBA EMC Qlogic QLE2462-E-SP 240 (4 cards per server)
Add-On HD OCZ 32GB SSD 1560 (16 drives per sever)
Storage Array EMC 146GB FC 15K 232 (fill remaining DAE’s in
tertiary site 2
nd
Celerra)
Load Balancer F5 BigIP 6800 6 (4 primary site, 2 secondary
site, 2 tertiary site)
2.7 Updated Site Diagram

3 Requirements, Assumptions, Constraints & Risks
3.1 Requirements
The purpose of this project is to define, document and create a highly available infrastructure capable of
surviving a disaster to the primary site.
3.2 Assumptions
• Secondary and tertiary sites have been upgrade to support the failover of select capacity from the
primary listed in Section 2.5.
• 50 physical hosts will be required to meet the assumed server consolidation ratio in the primary
datacenter.
• 10 physical hosts will be required to meet the assumed server consolidation ratio in the
secondary datacenter.
• 5 physical hosts will be required to meet the assumed server consolidation ratio in the tertiary
datacenter.
• Three hundred (300) desktop VMs to a single physical host will be an assumed average
consolidation ratio (300-to-1).
• 10 physical hosts will be required to meet the assumed desktop consolidation ratio.
• The 100Mbps link will provide sufficient bandwidth for normal internal traffic (AD replication,
vCenter management of hosts and system monitoring) as well as replication.

3.3 Constraints
• Hardware is limited to stock on hand at a discovered warehouse; components are believed to be
from 2008.
• Power and cooling at each location are limited.
• Each site has a 100Mbps link for connectivity to the other sites. This may impact RPO for select
systems.
3.4 Risks
• Each site has a 100Mbps link for connectivity to the other sites. This may impact RPO for select
systems.
• The requirements call for a DRP only for the primary site, failures of the secondary or tertiary
sites are accounted for.
• Hardware available is believed to be reliable and in working order.
• The number of host required to meet the assumed consolidation ratio is above the maximum
supported for a single cluster; multiple clusters will have to be used.
4 Application Backup, HA, RPO and RTO
4.1 RPO and RTO Classifications
There will be N categories classifying RPO and RTO for the systems deployed in the primary data center.
RPO Classifications
Class RPO Example Applications
Platinum < 5 minutes Active Directory, Exchange
Gold 5 minutes Web app DB tier
Silver 10 minutes Custom department applications
(HR, Financial Planning)
Bronze 30-60 minutes File servers, Web app front end
tier
Static 1 day Basic utility systems, monitoring
systems, VDI
RTO Classifications
Class RTO Example Applications
Critical < 5 minutes Active Directory, Exchange
Priority 30 minutes Minimum primary web app & DB
tier to restore access
Important 5 hours File servers, Basic utility
systems, monitoring systems,
VDI
Redundant 12 – 24 hours Redundant systems to ensure
HA
Standby 10 days Custom department applications
(HR, Financial Planning)
4.2 Software Solutions for Backup and Replication
We will be using different approaches for backup, replication and recovery for various systems based on
RPO/RTO to be the most cost effective.

Product Use Cases Expected Cost
(MSRP/Published)
VMware HA Auto restarted of VMs on failed
host within the same site.
Included with vSphere Enterprise
Plus
VMware FT Hot spare kept in lock step with
primary VM; limited use case due
to single vCPU requirement but
useful for some web services.
Plus
VMware PowerCLI / vMA Scripting configuration and setup
of hosts and VMs, to be used
with lower tier RTO/RPO
classifications.
Plus
VMware vCenter Heartbeat Used to deliver HA for vCenter
server required components.
$9995
Zerto Virtual Replication v3 Used for select upper tier
RTO/RPO classifications to
automate the replication and
reconfiguration of systems at the
secondary site.
$745 / protected VM.
Unitrends Backup and Recovery Will be used for backup of mid-
tier RTO/RPO classifications
such as specialized applications
that can easily be
scripted/installed.
$nnn / protected ??
EMC Celerra Replication The Celerra will replicate select
LUNs/Storage Groups to the
appropriate DR site.
Included with NS-480
4.3 Identified Applications & Failover Process Briefs
Below are a list of applications and their required RPO, RTO and a brief overview of how those will be
achieved.
• Active Directory: All Domain Controllers will have a system state backup performed in
Unitrends
14
. Because Domain Controllers will be created in all data centers, there will be no
downtime and next to no data loss (if any) in the event of a site failure. Each site will contain at
least one (1) Global Catalog Server and the FSMO roles will be separate based on Microsoft best
practice
13
. In this configuration, I believe that the published best practice for FSMO role
placement will provide the necessary resources for our domain.
RTO: Critical
RPO: Platnium
Achieved in application design
• Windows 2008 R2 Exchange 2010 Client Access Server (CAS)
1
: Hot stand-by CAS servers
will be configured in the secondary site to ensure immediate access is available in the event of a
disaster. The number of CAS servers ready will be ¼ (25%) of the production CAS servers in use
(with a minimum of one running at all times). The CAS server will be configured to respond to an
alternate DNS A record while the primary DNS records are changed and replicated. TTL for the
CAS server DNS records will be set to 5 minutes (300s). CAS servers in the primary site will be
protected by Zerto.
RTO: Redundant
RPO: Bronze
Achieved by placing hot stand-by servers in the secondary site capable of handling workload
during a disaster.

• Windows 2008 R2 Exchange 2010 Transport Server
1
: Hot standby Transport servers will be
configured in the secondary site. Transport servers will have log files backed up by Unitrends,
but will not need to be restored in the event of a disaster.
RTO: Redundant
RPO: Static
Achieved by placing hot stand-by servers in the secondary site; not data is required from the
primary site servers.
• Windows 2008 R2 Exchange 2010 Mailbox Server (BE)
1
: Hot stand-by mailbox servers will be
configured in the secondary site as part of a Database Availability Group (DAG) so that mail can
be replicated in near real time from the primary site to the secondary site due to the organizations
reliance on messaging. Based on the Exchange Server 2010 Role Requirement Calculator
(Appendix 5.6) that the 100Mbps site links between each data center; an estimated 43Mbps is
required for this replication to be maintained at a near 0 hour RPO.
RTO: Critical
RPO: Platinum
Achieved by application design, placing hot spare mailbox servers in the secondary data center
and configuring in a DAG.
Email flow with primary site online. When failed over to the secondary site, an alternate URL will
be available for immediate use while production DNS records are changed and replicated.

Mailbox Server / DAG Design
• Windows 2008 R2 Standard Application Servers: There will be three (3) levels of application
server classifications, the first will include the minimum components required to bring the
application back online and capable of supporting all necessary traffic. For example a single web
server, application server and database server may be able to handle the load for a given
application, but due to its uptime requirements may have several other redundant servers to
support it. In this scenario only 1 web, application and database server would be brought online
initially and the redundant systems would be brought back online once all primary systems were
restored. Some application servers may only be providing utility level services.
RTO: Priority
RPO: Gold
Achieved by continuous replication using Zerto Virtual Replication v3 which will allow the desired
RPO level and assist with RTO by automating the changing of IP addresses in the secondary
data center.
RTO: Redundant
RPO: Bronze
Achieved by backing up systems with Unitrends and replicating Unitrends to the secondary site.
Data will be refreshed from the primary systems within an application cluster. In some scenarios
where tiers are stateless (i.e. no data is stored on the system), templates and scripts to restore
the redundant systems to a working state will be used rather than backing up with Unitrends.

RTO: Standby
RPO: Static
Achieved by automating the installation of specific applications such as system monitoring
(OpsView). For systems that require data retention, systems will have the specific data sets
backed up by Unitrends (i.e. not using a full VM backup, rather just individual directories or
databases).
• VMware vCenter Server and SQL 2008 R2 Database Server (for vCenter)
2
: VMware vCenter
Server, and the required services which are also going to be installed on the vCenter server
(SSO, Inventory Service) and the vCenter Database server will be protected using VMware
vCenter Heartbeat. Give that vCenter is crucial to the operation of the environment, I feel it is
critical to provide a solid solution which is dedicated to this task. At $9995, the cost of the product
give the size and scope of the environment should be justified
15
.
RTO: Critical
RPO: Platnium
Achieved by leveraging vCenter Heartbeat which monitors vCenter and provides the ability to fail
over, and failback. vCenter Heartbeat is capable of operating over a WAN environment, and with
100Mbps dedicated links between the data center I feel this is an ideal solution.

Graphic from www.vmware.com vCenter Heartbeat product page
• VMware View Infrastructure: Because many of our users rely the VMware View infrastructure
for remote access to systems and applications, we have decided to stand-up a warm VMware
View infrastructure in the tertiary data center. This will allow remote staff, some of who may be
required to access systems during a disaster, and alternate method for access. In order to
support users in this fashion, VMware View HTML Access will be configured. There is no data
saved in the view environment, rather it is saved in applications and file servers so there is no
RPO required. VM templates will be replicated from the primary data center.
RTO: Priority
RPO: N/A
Achieved by building a warm stand-by View infrastructure in the tertiary data center.

5 IP Allocation
The following IP allocation will be used across all data centers. Each site will have a class B range, sub-
netted for traffic segmentation purposes. Routing and ACLs where appropriate will be handled by the
switch.
6 APPENDICES
6.1 Hardware Manifest
Device Type Manufacturer Model
Router Cisco 7600
Firewall Cisco ASA 5540
Network Switch Cisco Catalyst 6500
Storage Switch Cisco MDS 9513
Server HP DL580 G5
Add-On NIC Broadcom 5709 Based
Add-On HBA EMC Qlogic QLE2462-E-SP
Add-On HD OCZ 32GB SSD
Storage Array EMC Celerra NS480
Storage Array EMC 146GB FC 15K

Load Balancer F5 BigIP 6800
6.2 Software Manifest
Vendor Software
Microsoft Windows 2008 R2
Microsoft Windows 7 64-bit
Microsoft Office 2010
Microsoft SQL 2010
VMware vSphere Enterprise Plus
VMware Horizon View
VMware Replication
VMware vSphere Data Protection
VMware Log Insight Manager
VMware vMA
VMware vSphere Support Assistant
VMware vShield Endpoint
VMware vCenter Server Heartbeat
Trend Micro Deep Security
Opsview Opsview Enterprise
Unitrends Enterprise Backup
Indeni Dynamic Knowledge Base
Zerto Zerto Virtual Replication v3
6.3 Reference
1 - http://goo.gl/7ohe4 - Microsoft Exchange 2010 on VMware Best Practices
2 - http://goo.gl/F2B4w - Installing vCenter Server 5.1 Best Practices
3 - http://goo.gl/ToZfWc - How old is my server
4 - http://goo.gl/PtlKT4 - Ark.intel.com
5 - http://goo.gl/LFBys - wmarow.com IOPS calculator
6 - http://goo.gl/xcF0h - RAIDcalc
7 - http://goo.gl/zRdhqT - Cisco 5500 Series Release Notes
8 - http://communities.vmware.com/docs/DOC-22981- vSphere 5.1 Hardening Guide
9 - http://goo.gl/WIC7Hb - vSphere 5.1 Documentation / Authentication
10 - http://goo.gl/KGJ7tK - vSphere 5.1 Host Conditions and Trigger States
11 - http://blogs.vmware.com/kb/2012/07/leveraging-multiple-nic-vmotion.html
12 - http://goo.gl/SEQfwH - MDS900 3.0(1) Release notes
13 - http://technet.microsoft.com/en-us/library/cc816945(v=ws.10).aspx – Managing Operations Master
Roles
14 - http://support.unitrends.com/ikm/questions.php?questionid=891 – AD Restore
15 - http://www.vmware.com/products/vcenter-server-heartbeat/features.html - vCenter Heartbeat

6.4 VMware Configuration Maximums (double click to open)
vsphere-51-configur
ation-maximums.pdf
6.5 Original Site Design (double click to open)
jfrappier -
challenge1.pdf
6.6 Exchange 2010 Sizing Calculator (double click to open)
vDM-ExchangeWork
Book.xlsm

Jonathan Frappier – Challenge 2 Design Solution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Jonathan Frappier – Challenge 2 Design Solution

Similar to Jonathan Frappier – Challenge 2 Design Solution (20)

More from tovmug

More from tovmug (15)

Recently uploaded

Recently uploaded (20)

Jonathan Frappier – Challenge 2 Design Solution