This is slide deck of the first of a 3-part series "DCIM for High Availability" presented by GreenField Software. It proposes that while Data Centers have been around for more than four decades, the reason why DCIM Software is now becoming an important tool for DC Managers is the need for maintaining a near 100% Uptime. The Data Center topology has changed as a result of this High Availability Mantra and new tools are required to effectively manage the Modern Data Center.
DCIM Software charts out the relationship maps for assets by identifying various dependencies among them. Threshold-based alerts on critical parameters, combined with impact analysis of Move-Add-Change, mitigates risks of DC failures.
GreenField Software’s Mission is to help Data Centers control capital expenditures reduce operating expenses and mitigate the risks of Data Center failures. Besides DCIM Software, GFS offers Data Center Advisory Services in the areas of best practices, capacity planning, energy efficiency and business continuity of data centers.
4. 4
Multiple Classes of Data CentersMultiple Classes of Data Centers
• Internet Data Center
used by external clients connecting from the Internet
supports servers and devices required for B2C transaction-based applications (e-
commerce).
• Extranet Data Center
provides support and services for external B2B partner transactions.
accessed over secure VPN connections or private WAN links between the partner
network and the enterprise extranet.
• Intranet Data Center
hosts applications and services mostly accessed by internal employees with
connectivity to the internal enterprise network.
ness services.
• Special Purpose Data Center
For specialized application areas like Geological & Geophysical for Oil & Gas
Industry
May or may not be inter-connected
5. 5
Common Objective: Business ContinuityCommon Objective: Business Continuity
• Disaster Recovery Data Center
Each Class may have dedicated or Shared DR Center
Usually located separately from Primary Data Center
• High Availability (HA) Data Center
Each Data Center provided for with significant redundancies
DR Center comes into play only when a Disaster strikes.
Component or system failures within any DC should be either self-healing or
redundancies within the DC should take over
• Insurance Against Power & Network Outages
Reliability through multiple service providers
Internal Back-ups
ness services.
• Securing the Data Center
Against malicious hacking that can bring down the Data Center impacting
business continuity
Implementing Firewalls/ Virtual Firewalls
6. 6
Common Complexity: Multitude of AssetsCommon Complexity: Multitude of Assets
Multitude of Assets
Divided between two
worlds: IT & Facilities
Includes Mission
Critical Applications
Like a manufacturing
operation
Raw Material: Power &
Networks
Processing: Data
Output: Information
Service
Needs: Asset
Management, Resource
Optimization, a la
Manufacturing
Multitude of Assets
Divided between two
worlds: IT & Facilities
Includes Mission
Critical Applications
Like a manufacturing
operation
Raw Material: Power &
Networks
Processing: Data
Output: Information
Service
Needs: Asset
Management, Resource
Optimization, a la
Manufacturing
8. 8
Extreme Redundancies for 99.99% Uptime -> Higher Power ConsumptionExtreme Redundancies for 99.99% Uptime -> Higher Power Consumption
Huge Population of N+1/N+2 Equipment -> Asset Under utilization & Too complex to
manage with spreadsheets & Visio tools
Huge Population of N+1/N+2 Equipment -> Asset Under utilization & Too complex to
manage with spreadsheets & Visio tools
Chain of inter-dependent equipment -> Multiple points of failuresChain of inter-dependent equipment -> Multiple points of failures
Growing Heat Loads, Carbon Emissions & e-waste -> Sustainability IssuesGrowing Heat Loads, Carbon Emissions & e-waste -> Sustainability Issues
KW per Rack increases as more processing capacity is added -> Trade-offs: need to
support more per rack versus extra space & heat loads.
KW per Rack increases as more processing capacity is added -> Trade-offs: need to
support more per rack versus extra space & heat loads.
High Availability is Inversely Proportional to Asset Utilization & Energy EfficiencyHigh Availability is Inversely Proportional to Asset Utilization & Energy Efficiency
Today’s High Availability Data CenterToday’s High Availability Data Center
9. 9
When HA fails - Tale of Two DisastersWhen HA fails - Tale of Two Disasters
AmazonAmazon RBSRBS
Tech fault at RBS and Natwest freezes
millions of UK bank balances
RBS and Natwest have failed to register inbound
payments for up to three days, customers have
reported, leaving people unable to pay for bills,
travel and even food. The banks - both owned
by RBS Group - have confirmed that technical
glitches have left bank accounts displaying the
wrong balances and certain services
unavailable. There is no fix date available.
Amazon cloud outage takes down
Netflix, Instagram, Pinterest, & more
With the critical Amazon outage, which is the
second this month, we wouldn’t be surprised
if these popular services started looking at
other options, including Rackspace, SoftLayer,
Microsoft’s Azure, and Google’s just-
introduced Compute Engine. Some of
Amazon’s biggest EC2 outages occurred in
April and August of last year.
Which Will Be The Next One?Which Will Be The Next One?
10. 10
What’s the High Availability Mantra?What’s the High Availability Mantra?
Amazon Data Centers (built to Tier 4 standards and with an expected availability of 99.995%) has had
two outages already in 2012 – each over 3 hours!
• Tier 3/Tier 4 just defined by hardware redundancies
• Glaring gaps in operating procedures to prevent fatal human errors
• Lack of purpose-built BCP software to predict failures
• Lack of chain of custody to detect root cause
Amazon Data Centers (built to Tier 4 standards and with an expected availability of 99.995%) has had
two outages already in 2012 – each over 3 hours!
• Tier 3/Tier 4 just defined by hardware redundancies
• Glaring gaps in operating procedures to prevent fatal human errors
• Lack of purpose-built BCP software to predict failures
• Lack of chain of custody to detect root cause
Availability % Downtime per year Downtime per month* Downtime per week
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.9% ("three nines") 8.76 hours 43.8 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
99.99999% ("seven nines") 3.15 seconds 0.259 seconds 0.0605 seconds
11. 11
Delivering the High Availability PromiseDelivering the High Availability Promise
Adequate Redundancies
• Are there any points of failure – besides power and external networks - that can impact
uptime? (Not everything is N+1)
• What are my redundancy paths?
• Are the relationships & dependencies among critical assets clearly defined?
• Can I do an impact analysis on the outage/downtime of any equipment? Can I predict
the cascading effect of such an outage on other assets/applications in the data center?
Preventing Failures
• Can any failure be predicted to take proactive measures? Do I get alerts on threshold
breaches so that I can take preventive actions before a failure happens?
• Is there a history of a Move-Add-Change (MAC) that I should be aware of?
• What is the impact of a MAC on space, power, cooling?
• Where can new devices/servers be best placed? Floor -> Rack -> Cage. How this can be
determined based on current infrastructure and other dependencies to avoid a failure?
• How do I prevent a fatal human error?
13. 13
The High Availability ChallengeThe High Availability Challenge
Asset Over Provisioning Lack of HA Management Tool
IT assets tracked by Systems
Management Tool
Facilities assets tracked by BMS
Two not inter-operable: Unable to
determine missing link for HA
Unable to track redundancy paths
HA fails if any equipment or
software in critical path fails
HA fails if there’s fatal human error
Health and history of equipment, or
previous MAC impact, not tracked
IT assets tracked by Systems
Management Tool
Facilities assets tracked by BMS
Two not inter-operable: Unable to
determine missing link for HA
Unable to track redundancy paths
HA fails if any equipment or
software in critical path fails
HA fails if there’s fatal human error
Health and history of equipment, or
previous MAC impact, not tracked
Too many assets; two classes of assets
Absence of Software Portfolio (even if
hardware assets are tracked)
Move-Add-Change: Decisions not
based on simulations, analysis
Absence of change management
Absence of workflow approvals
Unable to predict failures
No chain of custody
Too many assets; two classes of assets
Absence of Software Portfolio (even if
hardware assets are tracked)
Move-Add-Change: Decisions not
based on simulations, analysis
Absence of change management
Absence of workflow approvals
Unable to predict failures
No chain of custody
Need to Predict FailuresNeed to Predict Failures
14. 14
Beyond HA: Infrastructure & Operational ChallengesBeyond HA: Infrastructure & Operational Challenges
Energy Problems Operational Problems
Low level asset tracking
Under utilization of many computing
resources
Running of old inefficient equipment
Decisions not based on analysis
Cooling not optimized
Floor & Rack Space: Non-optimal
placements of equipment
Increasing demand for rack space
Absence of capacity planning
Low level asset tracking
Under utilization of many computing
resources
Running of old inefficient equipment
Decisions not based on analysis
Cooling not optimized
Floor & Rack Space: Non-optimal
placements of equipment
Increasing demand for rack space
Absence of capacity planning
Higher power consumption & growing
power bills
Not monitoring power use at device
levels
Dissemination of enormous heat
Creation of hot spots
Drastic reduction in expected life of
computing equipment
Failing of a data center
Increase in CO2 emission
Higher power consumption & growing
power bills
Not monitoring power use at device
levels
Dissemination of enormous heat
Creation of hot spots
Drastic reduction in expected life of
computing equipment
Failing of a data center
Increase in CO2 emission
16. 16
Solution That Bridges the Gap Between IT & FacilitiesSolution That Bridges the Gap Between IT & Facilities
Data Center Infrastructure Management (DCIM) SoftwareData Center Infrastructure Management (DCIM) Software
IT System
Performance
Management
IT System
Performance
Management
Building
Management
System
Building
Management
System
Data Center
Infrastructure
Management
Data Center
Infrastructure
Management
17. 17
Solution That Addresses The High Availability ChallengeSolution That Addresses The High Availability Challenge
DCIM Helps to Predict FailuresDCIM Helps to Predict Failures
Asset Over Provisioning Lack of HA Management Tool
IT assets tracked by Systems
Management Tool
Facilities assets tracked by BMS
Two not inter-operable: Unable to
determine missing link for HA
Unable to track redundancy paths
HA fails if any equipment or software
in critical path fails
HA fails if there’s fatal human error
Health and history of equipment, or
previous MAC impact, not tracked
IT assets tracked by Systems
Management Tool
Facilities assets tracked by BMS
Two not inter-operable: Unable to
determine missing link for HA
Unable to track redundancy paths
HA fails if any equipment or software
in critical path fails
HA fails if there’s fatal human error
Health and history of equipment, or
previous MAC impact, not tracked
Too many assets; two classes of assets
Absence of Software Portfolio (even if
hardware assets are tracked)
Move-Add-Change: Decisions not
based on simulations, analysis
Absence of change management
Absence of workflow approvals
Unable to predict failures
No chain of custody
Too many assets; two classes of assets
Absence of Software Portfolio (even if
hardware assets are tracked)
Move-Add-Change: Decisions not
based on simulations, analysis
Absence of change management
Absence of workflow approvals
Unable to predict failures
No chain of custody
18. 18
Solution That Addresses Infra & Operational ChallengesSolution That Addresses Infra & Operational Challenges
DCIM Improves Energy & Operational EfficienciesDCIM Improves Energy & Operational Efficiencies
Energy Problems Operational Problems
Low level asset tracking
Under utilization of many computing
resources
Running of old inefficient equipment
Decisions not based on analysis
Cooling not optimized
Floor & Rack Space: Non-optimal
placements of equipment
Increasing demand for rack space
Absence of capacity planning
Low level asset tracking
Under utilization of many computing
resources
Running of old inefficient equipment
Decisions not based on analysis
Cooling not optimized
Floor & Rack Space: Non-optimal
placements of equipment
Increasing demand for rack space
Absence of capacity planning
Higher power consumption & growing
power bills
Not monitoring power use at device
levels
Dissemination of enormous heat
Creation of hot spots
Drastic reduction in expected life of
computing equipment
Failing of a data center
Increase in CO2 emission
Higher power consumption & growing
power bills
Not monitoring power use at device
levels
Dissemination of enormous heat
Creation of hot spots
Drastic reduction in expected life of
computing equipment
Failing of a data center
Increase in CO2 emission