2. 2
Contents
1) Types of Business Impacts (Outages) & Their Costs
2) Three Foundational Pillars of Business Continuity
3) Problem Statement
4) High Availability & Sustained Resiliency
5) Our Methodology
6) Value Proposition
7) Appendix
3. Types of Business Impacts (Outages) & Their Costs
3
* Sources: www.symantec.com; www.informationweek.com;: www.businesscomputingworld.co.uk; www.evolven.com; www.quorum.net
Corporations implement a Business
Continuity Program (BCP) to address these
types of outages as they directly impact the
bottom line.
* Lost Labor: $46,000,000
(Per 10,000 person company @ 1.6 hr/wk)
* Lost Revenue: $26,500,000,000
(Survey across 200 companies)
* Brand Failure: $ ??
RIM (Blackberry)
Revenue lost for a single outage can be in the Millions ($). Outages
may also start a brand failure (i.e. Blackberry RIM outage ~ $100M)
Important! Most outages are either
Hardware Failure or Human Error – a very
small percentage of the overall outages are
Natural Disasters (5%).
What are these Outages? (Yearly Combined)
77%
What are the Costs? Here are some Data Points:
So, What are the key elements of a BCP Program?
* Losses are estimated at $ 1,200,000,000,000 Trillion dollars annually
4. Three Foundational Pillars of Business Continuity
4
Resiliency Recovery Contingency
Resiliency is a destination
A state where critical business
functions and the supporting
infrastructure are unaffected
by most outages.
Resiliency is the ability of a
corporation to move its
capability and capacity
seamlessly around its
environment.
Recovery is a journey
Also known as “Disaster
Recovery” or DR, it is complex
to maintain and difficult to
implement.
DR has data loss to a point-in-
time and production downtime
as an acceptable outcome.
Contingency is a last resort
Establish a generalized
capability and readiness to
cope with major incidents and
disasters. Not all are known.
Contingencies involve data
loss and production downtime
as an acceptable outcome.
Increasing Resiliency efforts will naturally reduce efforts in Recovery & Contingency.
5. Business
Continuity
Resiliency
Recovery
Contingency
5
Corporations leave a hole in their overall
Business Continuity programs spending
considerable dollars in Recovery & Contingency
(covering only ~5% of outages) with diminishing
return. This leaves a BCP program with built-in
downtime and data loss as acceptable outcomes.
Resize
Recovery &
Contingency
Efforts
Appropriately
Improve &
Increase
Resiliency
Efforts
Resiliency is pro-active in dealing with significant
business survival events (hardware failure, human
error, power outages, pandemic, natural disaster,
social un-rest, etc.).
Problem Statement
Goal: Reset the BCP Balance
Resiliency is a super-set of Recovery and
Contingency, which leverages established process
and procedure used in Recovery and Contingency.
6. 1
Tier 1 - Business
Application
High Availability & Sustained Resiliency
6
Tier 0 - Load
Balancer
Infrastructure
Tier 0 –
Physical Server
Infrastructure
Tier 0 - DB Servers
& DB Infrastructure
Tier 0 -
Directories
(LDAP, AD,..)
Tier 0 - Storage
Infrastructure
Tier 0 –
Virtualization
Infrastructure
Identified
HA/SR Gaps
Identifying &
Resolving
HA/SR Gaps
Reduces
Infrastructure
Failures
Unplanned
Outages
Tangible
Loss
Intangible
Loss
Provider Confidence
Regulatory Fines
A compounding and/or cascading
failure can occur when many
HA/SR gaps are concentrated.
The value is to find those
HA/SR Gaps and address them
High Availability (HA)
Component availability, which can
be Inter-site or Intra-site.
Sustained Resiliency (SR)
Moving capacity & capability
seamlessly around the physical
environment
Resilience: Critical business functions and the supporting infrastructure are designed and engineered in such a way that
they are materially unaffected by most disruptions, for example through the use of redundancy and spare capacity. There
are two (2) methods to do this:
Benefits of HA/SR
2
7. Our Resiliency Methodology
7
Develop Test
Schedule Test
Submit Events
HA/SR Testing
Feedback &
Improvement
Validate
Gap Exposure Risk,
Value Assessment
Application Testing
Capability
Gap Remediation
Investigate
Applications
Submitted for
Assessment or Review
Perform Assessment
and Onboarding
Develop Test
Requirements and
Objectives
Assess
5
1
2
3
4
6
7
8
9
10
11
All applications (infrastructure, services, applications or utilities) that execute all 11 steps along the
Resiliency Methodology would be considered mature in their Resiliency profile, and by extension
would be able to endure business impactful (outage) events.
8. The Value – A Proven Resiliency Program
8
Lowered IT
Effort
Meet SLAs
(SR)
Lower
Outages (HA)
Contingency
Planning
Disaster
Recovery
Resiliency
Most Corporation’s
Current State
Implement Resiliency
Methodology
Meets most audit requirements.
Has great planning, but limited impact on improving the
production environment’s ability to sustain outages.
Increasing more effort on Disaster Recovery &
Contingency Planning results in a diminishing return.
Focusing more on Resiliency, IT teams can reduce
efforts and costs as DR and CP goals are met through
Resiliency implementation.
End result is a resilient application infrastructure.
10. Area of Potential
Data Loss
An Application’s Data
RTO, RPO, RTC, TTTR & BTTR (Visually) – Anatomy of a Recovery Event
10
Timeline of Incident/Outage
RPO
No Data
Business Decision
Window
Business
Resumption
Take
Action
Infra
Ready
Good
Data
Inconsistent
Data
Rebuilding
Data
Good
Data
Data
Available
Fixed
(if applies)
Business Time To Resume(BTTR)
Recovery Time Objective (RTO)
Technology Time To Recover (TTTR)
Return To Capacity (RTC)
Application Recovery
Business RecoveryInfrastructure Recovery
People / Facilities Recovery
Time To Fix
All Hands Fix
Incident
Start
Fix It Outage Time
Recovery Outage Time
Editor's Notes
According to Dunn & Bradstreet, 59% of Fortune 500 companies experience a minimum of 1.6 hours of downtime per week. This means that if you take the average Fortune 500 company (at least 10,000 employees) paid an average of $56 per hour, including benefits. The labor part of downtime costs for an organization this size would be $896,000 weekly, translating into more than $46 million per year. (Assessing The Financial Impact Of Downtime).
For 2,000 corporations, lost labor costs would be $46M x 2,000 = $92,000,000,000 ($92 Billion). There are more than 2,000 companies with more than 10,000 employees.
CA Technologies is the latest to attempt to calculate IT downtime, with a survey of 200 companies across North America and Europe intended to calculate the losses incurred from an IT outage. What it found was more than $26.5 billion in revenue is lost each year from IT downtime, which translates to roughly $150,000 is lost annually for each business.
In September 2010, Virgin Blue's airline's check-in and online booking systems went down. Virgin Blue suffered a hardware failure, on September 26, and subsequent outage of the airline's internet booking, reservations, check-in and boarding systems. The outage severely interrupted the Virgin Blue business for a period of 11 days, affecting around 50,000 passengers and 400 flights, and was restored to normal on October 6. (Virgin Blue IT outage hit profit by up to $20M) - See more at: http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html#sthash.efet2jHX.dpuf
On average, the businesses surveyed said they suffered 14 hours of IT downtime per year. Half of those said IT outages damage their reputation and 18% described the impact on their reputation as 'very damaging.' Headlines about IT failures certainly don't help. (IT Downtime Costs $26.5 Billion In Lost Revenue ) - See more at: http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html#sthash.efet2jHX.dpuf
According to a study by the Ponemon Institute, the minimum, median, mean and maximum cost per minute of unplanned outages was computed based on input from 41 data centers. In the chart below, the most expensive cost of an unplanned outage is over $11,000 per minute. On average, the cost of an unplanned outage per minute is likely to exceed $5,000 per incident. (Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact of Infrastructure Vulnerability) - See more at: http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html#sthash.efet2jHX.dpuf
When an outage occurs, it's a race against time to handle it before it spirals out of control. According to the IT Process Institute, resolution time per outage is around 200 minutes. It's really interesting to see just how much time is being put in to resolve outages, when you consider what is happening to the customer experience and company reputation in this time. The average reported incident length was 90 minutes, resulting in an average cost per incident of approximately $505,500. (Unplanned IT Outages Cost More than $5,000 per Minute) - See more at: http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html#sthash.efet2jHX.dpuf
Resiliency is a destination where critical business functions and the supporting infrastructure are designed and engineered in such a way that they are materially unaffected by most disruptions, for example through the use of redundancy and spare capacity. Data loss is minimal or strictly meets RTO. Deploying an effective Resiliency Program reduces a corporation’s reliance on Recovery and Contingency
Recovery is a journey where arrangements are made to recover or restore critical and less critical business functions that fail for some reason. Known as “Disaster Recovery”. Recovery may involve data loss to a point-in-time and production downtime as an acceptable outcome. Organizations understand the value of having a good DR plan, but recognize it is complex, and difficult to validate
Contingency is a last resort where the organization establishes a generalized capability and readiness to cope effectively with whatever major incidents and disasters occur, including un-foreseen ones. Contingency preparations are a last-resort response if resilience and recovery arrangements should prove inadequate in practice. Contingencies involve data loss and production downtime as an acceptable outcome.
Corporations do not address Resiliency effectively, leaving a hole in their overall Business Continuity programs. They continue to spend considerable dollars in areas with little to no return. Those areas are not 100% effective during significant business disrupting events.
Because Resiliency is not well understood and not effectively implemented – the result is over-compensation on Recovery and Contingency. The over-compensation does not give a great return on investment and they do not always work during survival events. It is reactive at best.
Resiliency is a super-set of Recovery and Contingency. A mature corporation that has Resiliency, has less reliance on Recovery and Contingency. Resiliency ensures that a corporation can move its capability and capacity seamlessly around the environment, without significant impact to production. Resiliency is pro-active in dealing with significant business survival events (power outages, failures, pandemic, natural disaster, social un-rest, etc.).
DR and CP are necessary for audit and BCP planning. Their value is not realized during actual disaster or business disrupting events.
The current IT spend and effort on Disaster Recovery and Contingency Planning do not return significant value to improving IT Infrastructure/Application environments.
DR and CP continue to be necessary for audit and BCP planning; however, Resiliency practiced in production and during loads proving the agility of the Infrastructure/Application environment, adapting to those actual disasters or business disrupting events.
By focusing more on Resiliency (a super-set of DR and CP), IT teams can reduce efforts and costs as DR and CP goals are met through Resiliency implementation.
RTO - The Recovery Time Objective is the maximum time a business can be out of service. A Business Impact Analysis is used by the Business to determine its RTO. By extension, any application or infrastructure required by that business to meet its RTO must be recovered within the same timeframe
RPO - The Recovery Point Objective is a point in time to which systems and data are expected to be recovered after recovery from an outage has been completed. (e.g. end of day, end of previous day, or within minutes of the outage) to limit the loss of data to within tolerable levels as determined by the LOB. Restoring applications to the agreed RPO must form part of the RTO.
RTC - The Return To Capacity is the measurement of the capability to recover people (business) or technology (application and infrastructure). The RTC is measured from the point of invocation/declaration. This does not include the assessment before the declaration. It includes the combination of infrastructure recovery time, application recovery time and any activities that need to be performed prior to active usage of the application such as business verification and reconciliation time. This includes time required in order to restore to the defined Recovery Point Objective.
Key components that occur in an actual event are measured to calculate the RTC as follows:
People/Business Recovery - any or all of the following may be components of the recovery and should be assessed as part of the RTC:
Travel time (to recovery site, to home, etc)
Site set-up time, including clearing of existing users, removal/installation of equipment, etc
Closedown of non-critical activities for work transfer solutions
Activation as required of voice diverts, voice recording, printer diverts, etc
Where needed, failover and/or of infrastructure such as shared data
For technology recovery the following components should be measured as part of the RTC:
If applicable, measure time for support staff to commute
Tape shipment (Iron Mountain, etc.) and/or transfer of any vital records necessary for recovery, if applicable
Infrastructure recovery time (usually performed by GTI, e.g., Network failover, storage failover, DNS pushes)
Server recovery time (may be performed by GTI or more often by Business Aligned Infrastructure, e.g., SA recovery, DB recovery)
Application recovery time (usually performed by AD teams to prepare, finalize and validate the application)
Business verification/reconciliation time (any activities that need to be performed by the business prior to active usage of the application)