HA & DR System Design - Concepts and Solution

Continuity and Resilience (CORE)
ISO 22301 BCM Consulting Firm
Presentations by our partners and
extended team of industry experts
Our Contact Details:
INDIA UAE
Continuity and Resilience
Level 15,Eros Corporate Tower
Nehru Place ,New Delhi-110019
Tel: +91 11 41055534/ +91 11 41613033
Fax: ++91 11 41055535
Email: neha@continuityandresilience.com
P. O. Box 127557
Abu Dhabi, United Arab Emirates
Mobile:+971 50 8460530
Tel: +971 2 8152831
Fax: +971 2 8152888
Email: info@continuityandresilience.com

H A & D R Design Concepts
S Seshadri
Head – IT DR & Service Management
10th Feb, 2014
Dubai
2

Outage Categorization
• Service failures that should/need not be known to end users
need ‘fault protection’ – the operation of such services will be
continuous despite failure scenarios
• Short interruptions (within a few hours) are referred to as
‘minor outages’
• Longer interruptions, when end users’ business services get
delayed for longer durations, are termed as disaster situations
or ‘major outages’
3

Key Questions
1. Which systems should ‘never’ fail – we may need Fault Tolerant
systems in their place
2. What failures should be handled transparently, where an outage
must not occur? Against such failures we need fault protection.
3. How long may a short-term interruption be that happens once a day,
once a week, or once a month? Such interruptions are called minor
outages.
4. How long may a long-term interruption be that happens very seldom
and is related to serious damage to the IT system? For instance, when
will this cause a big business impact, also called a major outage or
disaster?
5. How much data may be lost during a major outage? And in which state
– persistent or ephemeral…
6. What failures are deemed so improbable that they will not be
handled, or what failures are beyond the scope of a project?
4

Business Issues & Cost of IT Outage
• IT Fault Protection has to be driven by business
considerations
• Business Continuity is the overall goal
• Business imperatives manifest through BIA/RA and
MTPoD/RTO/RPO
• IT Outage is not the real issue, but the business
consequences are
• IT Outage affects revenues & costs adversely
• Direct Costs – repairs, penalties, lost revenue
• Indirect Costs – lost & additional work hours
5

Cost Vs Benefit
• IT Recovery has extensive cost implications – both in terms of
Capex and Opex
• Strategies developed should be cost effective
• ‘Technology for the sake of Technology’ approach should be
completely avoided
• Strategies should, as far as possible, be able to address
disruptions and impacts collectively
• Organizational objectives and risk appetite should direct
recovery strategies
• Legal, contractual and regulatory aspects play a major role
(SOX, SAS 70, BASEL II/III…..)
6

IT Service Outage
• Importance of IT Services depends on
– Business relevance
– Revenues
– Functionality that they enable
– Amount of damage due to the outage
– Any regulatory aspect that demands the service
• Outage Categorization is dictated by the importance of the
service and hence the significance of its failure
7

High Availability
• High availability is the characteristic of a system to protect
against or recover from minor outages in a short time frame
with largely automated means.
• HA has 3 essential features
– Outage categorization is ‘minor’- we need to envisage
potential failure scenarios for the service and the minor
outage requirements for them - robustness
– System category should involve Mission Critical & Business
Important and Business Foundation processes which need
to be recovered within a very short time – RTO/RPO
– Component (SPoF) level protection which will facilitate
automatic recovery – redundancy
• HA features are normally built within the primary data center
and data replication is synchronous
8

Continuous Availability
• Continuous Availability is the highest point of High Availability,
wherein, every component failure is protected against, and no ‘after
failure recovery’ takes place
• These are known as Fault Tolerant systems, that provide automatic,
high-speed ‘failover’ in the case of h/w or s/w failures
• They have ‘internal multi-computer systems architecture’ that have
no shared central components, including memory
• Tandem’s ‘non-stop’ systems and Stratus’s fault tolerant computers
are examples of this
• These are used by the leading stock exchanges globally (NSE in India
uses Stratus and BSE, Tandem), and by banks for their ATM related
transaction processing
• These systems scale extremely well to the largest commercial
workloads
• These systems were introduced originally by Airbus for their A-320
planes for on-board flight controls In their long duration flights

HA Components
Essential ingredients of High Availability are:
• Availability
• Reliability
• Serviceability
We will discuss the above three in the following
slides.
10

Availability & Metrics
• Availability – How long a service or system component is
available for use and the features that help the system to stay
operational despite occurrence of failures, eg. NIC, Mirrored
Disks, Redundant Power Supply
• Availability = uptime/uptime+downtime
• Downtime will include scheduled downtime also
• Elapsed time can be measured as wall clock time
• Availability can be expressed in absolute numbers (79 hrs out
of 80 hrs or as a percentage (99.89%)
• Availability = MTBF/MTBF+MTTR (????)
– MTBF: Mean Time Between Failures
– MTTR: Mean Time To Repair
11

Reliability & Metrics
• Reliability is a measure of ‘fault avoidance’
• Refers to the ‘probability that a system will be available over a
time interval T’
• MTBF is a measure of Reliability
• Annual Failure Rate (AFR) is the inverse of MTBF
• Reliability features help to ‘prevent’ and ‘detect’ failures
• H/w reliability has tremendously improved over the last 30
years and they are highly resilient nowadays
Component MTBF (Hours) MTBF (Years) AFR (per year)
Disk Drive 300,000 34 0.0292
Power Supply 150,000 17 0.0584
Fan 250,000 28 0.0350
NIC 200,000 23 0.0438
12

Serviceability
• Measurement that expresses how easily and quickly
a system is serviced and repaired
• The lower the planned service time, the higher is the
availability
• Planned serviceability goes into the architecture as a
design objective
• Actual serviceability should be lower than planned
serviceability
• These clauses have to be carefully built into the
Service Level Agreements with IT vendors
• Murphy’s Law: Anything that can possibly go wrong,
does
13

HA/DR Strategy - Aspects
• Data – what is the architecture concerned with
• Function – how is the data worked with
• Location – where is the data worked with
• People – who works with the data and achieve the
functionality
• Time – when is the data processed
Each of the above aspects are run through 3 levels of abstraction
• Objectives – What will this achieve vis a vis org objectives
• Conceptual Model – Realization of the objectives on a
business process level
• System Model – Logical data model and the application
functions that must be implemented to realize the business
concepts
14

HA/DR Framework (Zachman)
Objectives Conceptual Model System Model
Data
(What)
Business Continuity /
IT Service Continuity
Availability of mission-
critical and important
business services
ICT categories,
dependency diagrams
Function
(How)
Map biz processes to IT
services, RTO, RPO, SLA
ITIL processes, IT
processes, projects
Design patterns – RAS,
redundancy, backup,
replication,
virtualization
Location
(Where)
Internal (IT),
Outsourced
Data Center, Disaster
Recovery Center
All systems, all
categories
People
(Who)
Biz process owner CIO/IT dept IT PM, Architect,
System Engineers,
System Administrators
Time
(When)
Implementation Plan Outage scenarios,
categories
Failure/Change/
Incident/Problem
/Disaster
15

HA/DR System Design
• System Model discussed earlier is the core of this activity
• ‘What’ and ‘How’ of the System Model will lay the foundation
for HA/DR System Design
• Protection against outages of computers, systems and
databases are in scope for HA
• Protection against infra/building/city/ outage,
user/administrative errors are in scope for DR
• Sound processes, solid architecture, careful engineering and
an eye for details are the hall marks of a good HA/DR system
design
16

HA/DR Touch Points
• User Environment
• Administration Environment
• Application
• Middleware
• Network Infrastructure
• Operating System
• Hardware (Servers, Storage, Backups etc)
• Physical Environment (Power, Fire, Floods etc)
17

HA/DR Scoping
• Take into account regulatory aspects (SOX, SAS, Basel II)
• Identify the key applications (from business BIAs)
• Check out the various ICT environments required by these
applications (IT BIA)
• Identify the dependencies
• Carefully identify and document the component categories
that are not required – scope exclusions
• Prepare preliminary system scope – list of component
categories required for HA/DR
• Identify failure scenarios for each of these component
categories
• Document the failure scenarios that are outside the scope
• The component categories and the failure scenarios will
constitute the scope of HA/DR
18

Redundancy & Replication
• Redundancy is the ability to continue operations in the case of
component failures
• Recovery is done through ‘managed component repetition’
• Eliminating ‘single points of failure’ is the goal
• Just adding a second component is not enough
• Replicated component has to be ‘managed’ to take over in
case the original component fails (failover)
• This ‘management’ can be automated or manual
• Replication of the ‘state’ of the component is crucial
• Replication may be a duplicate part, an alternate system (HA)
or an alternate location (DR)
• 100% redundancy through replication is very expensive and
difficult to achieve
19

Data Replication
• Redundancy for Disk Drives means ‘data replication’ and hence very
crucial
• Redundant disks provide multiple storage of data and/or OS
• Data disks carry one of the highest risks
• OS disks usually house the root file system and swap space
• Data Replication can be ‘synchronous’ or ‘asynchronous’
• RPO considerations should dictate data replication approach
• For very low or nil RPO, latency in data replication may not be
tolerated (synchronous vs asynchronous)
• Bandwidth considerations also impact replication
• Data Deduplication technology in recent times along with data
compression has reduced much of the headaches involved with
data replication
• Two main types of date replication
– Host based/Storage based
20

Virtualization
• Virtualization, as a concept, was demonstrated in 1960s ,
when IBM’s Thomas J Watson Research Center simulated
‘multiple pseudo machines’ on a single 7044 MX Mainframe
• Virtualization allows multiple operating system (OS) instances
to run concurrently on a single computer.
• It is a means of separating hardware from a single OS, by
“inserting an abstraction layer” into the software stack.
• Each ‘Guest’ OS is managed by a Virtual Machine Monitor.
• Virtualization Software can also collect a number of separate
resources and “pool” them, even if the devices or resources
remain in separate physical locations.
• The end goal is sharing the resources and capabilities flexibly,
under software control.
• The part of the virtualization package that enables to interact
with and control the VMs is referred to as the Virtual Machine
Monitor (VMM) or Hypervisor software.
21

Virtualization of Resources
• They supply resources in logical units to application programs and free
them from reliance on specific hardware
• Virtualization of Servers allows business to consolidate the workloads
running on multiple servers to just a FEW
• Storage Virtualization hides the physical storage from applications on host
systems, and presents a simplified (logical) view to the applications and
allows them to reference the storage resource by its common name
whereas the actual storage could be on a complex, multilayered,
multipath storage networks.
• RAID is an early example of storage virtualization.
• Virtual CPU is one of the oldest concepts, which has enabled
multiprocessing capability, handled by OS
• Virtual Memory is as old as Virtual CPU – again handled by the OS as part
of Virtual Memory Management
• Working within a virtualized environment may add some options and new
flexibility to your HA and DR plans.
22

Storage Virtualization
• With regard to storage, the objective is to bring together multiple
storage devices under unified command, whether they are from the
same manufacturer or not, and without regard for their physical
locations.
• Once accomplished, the now-unified band of storage systems can
be treated as a single, huge storage capacity that can be
provisioned, managed, backed up to tape, and even replicated to
offsite disaster recovery (DR) or high availability (HA) sites, with
greater visibility, synchronized automation, and reduced
management labour.
• Even archiving, multi-level storage, and information lifecycle
management (ILM) efforts can be made simpler, with older, slower,
or cheaper storage units provisioned to handle the near-line or
archival storage while newer, faster devices handle the current
production processes.
23

Host Clustering
• Increasing availability through redundancy on the host level
by taking several hosts and using them to supply a bunch of
services, where each service is not strictly associated with a
specific computer
• Host Clustering addresses
– Hardware errors
– OS errors
– Application errors
• Failover clusters , which allow a service to migrate from one
host to another in the case of an error. They are the most
used technology for high availability.
• Load-balancing clusters, which run a service on multiple hosts
from the start and handle outages of a host – more relevant
for performance than HA.
24

Middleware
• Generally considered to be the layer between the OS and the
applications
• They are independent of applications but carry application-
specific configuration and used by multiple applications
• Database Servers, Web Servers, Application Servers,
Messaging Servers are some examples
• HA for these will include product specific clustering, data
replication, and even session state replication
• Properly configured failover cluster sufficiently integrated
with the DB Server provides HA
• Redo log file shipping (asynchronous) with commits delayed
by the RPO will provide the best DR
• HA for Web Servers and Messaging Servers are achieved
mostly through Load-balancing Clusters (stateless)
25

HA for Applications
• Application HA is the eventual goal
• Application categories – Off the Shelf, Bought & Customized,
In-house Built
• Failover cluster is an approach most commonly adopted for all
categories of applications
• Applications touch the nerve center of all the following
systems:
– Development
– Acceptance/Integration Test
– Staging & Release
– Production
– Disaster Recovery
• Suitable precautions must be taken while coding/testing
stages to ensure HA
26

Networks
• Network is the backbone of ICT as it provides the linkages and
ability to communicate between component categories
• Various types of networks are
– LAN, VLAN, MAN, WAN, VPN, Intranet, Extranet, Internet
• And there are n/w components that help build and run the
networks – NIC, switches, routers, hubs, firewalls etc.
• Connectivity is the most major element of networks
• Data management on the network is done through encoding, data
compression & encryption/decryption
• Power supply, Heating, Ventilating & Air Conditioning (HVAC) are
two other important considerations
• It is absolutely essential to provide redundancies at each of the
network and component level/s for network HA
• Generally, there is no pay-load based state for any of these – hence
two or more devices would ensure HA
27

Data Back up and Restoration
• A major requisite for HA & DR
• Management of backed up data is equally important
• Restoration of data must work effectively
• Automated mechanisms exist
• System/file/database backups are the key
• Full or incremental backup
• Consistency of the data state is crucial
• Checkpoint functionality is useful in this context
• Storage and handling of backup media is very significant
• Remote (including at the DR site) storage of backups including
Tape Vaulting should be institutionalized
• Testing/recycling and proper maintenance of backup media
• Backup on failover clusters should distinguish between
physical and logical hosts in the cluster
28

HA & DR – Positioning
• HA and DR are two sides of the same coin
• Redundancy, Replication and Robustness are the key
characteristics of both HA & DR
• HA focuses on fault protection and is built on mostly
automated recovery techniques for minor outages
• HA is not built for environmental disasters like floods, fire,
earthquake and manmade incidents like terrorist attacks,
human errors of huge magnitude
• The above additional scenarios and major outages lead to the
need for DR, that focuses only on recovery
• DR is also associated with a large part of manual recovery in
terms of Emergency Management and Damage Assessment &
Recovery apart from IT Recovery
• When the primary data center is unavailable, migration to DR
site will be the only option
29

Disaster Recovery
• Disaster recovery is the ability to continue with services in the
case of major outages, often with reduced capabilities or
performance.
• Disaster recovery handles the disaster when either a single
point of failure is the defect or when many components are
damaged and the whole system is rendered non-functional.
• Operations cannot be resumed on the same system or at the
same site. Instead, a replacement or backup system, usually
located at another place is activated and operations continue
from there.
• Disaster recovery often restores only restricted resources and
thus restricted service levels.
• Continuation of service also does not happen instantly, but
will happen after some outage time.
30

DR in Context
• IT DR is activated when the likely recovery time is above the
least RTO and there is expected data loss
• IT recovery will be limited only by the agreed levels of service
by the business owners
• IT DR activities will be carried out of the DR site, which should
be equipped fully to handle IT services upto agreed levels
• Scaling up the IT services in due course of time will generally
be outside the purview of DR Planning
• Agreed levels of IT services are resumed in the DR site using
the infrastructure and back up data/tapes there
• The roles of primary and DR sites are interchangeable but not
in the strict sense of HA
• In the above scenario, both primary and DR sites will be
functional, even though they may cater to different business
activities/IT services
31

DR and the Cloud
• Cloud is the latest buzz word in outsourced business model
• Leveraging cloud model can optimize DR procedures
• Reduces the high cost of maintaining stand-by sites
• Cloud service providers normally have state of the art systems
and infrastructure, huge bandwidth, exacting security setup,
apart from complying with relevant ISO guidelines and
industry best standards.
• According to recent Aberdeen study report, DR is the leading
‘use case’ for cloud
• The key advantages are recovery times, virtualization and
multi-site availability
• Concerns regarding security, identity and compliance to
various regulations do exist as the cloud model matures
• With data volumes growing at the rate of 10 times every 5
years, cloud computing is likely to see a huge growth
32

DR in the Supply Chain
• Supply Chain is basically a delineation of dependencies
depicting the various actors in the chain of a product or
service from a vendor till reaching a consumer
• IT DR dependencies are manifold – internal customers, ICT
equipments, external vendors and service providers, IT staff,
etc etc…
• DR planning should judiciously take into account the inherent
risks in the supply chain and provision suitable mechanisms to
handle them effectively, so that the DR goal does not derail
• Typically, if Data Center support is outsourced, there is a huge
dependence on the Service Provider – timely availability of
people, spares, replacements etc.
• Supply chain glitches can emerge from as innocuous a thing as
consumables supplies
33

HA & DR System Design - Concepts and Solution

In this document