The Modern Data Center Topology - The High Availability Mantra

1,905 views

Published on

This is slide deck of the first of a 3-part series "DCIM for High Availability" presented by GreenField Software. It proposes that while Data Centers have been around for more than four decades, the reason why DCIM Software is now becoming an important tool for DC Managers is the need for maintaining a near 100% Uptime. The Data Center topology has changed as a result of this High Availability Mantra and new tools are required to effectively manage the Modern Data Center.

DCIM Software charts out the relationship maps for assets by identifying various dependencies among them. Threshold-based alerts on critical parameters, combined with impact analysis of Move-Add-Change, mitigates risks of DC failures.

GreenField Software’s Mission is to help Data Centers control capital expenditures reduce operating expenses and mitigate the risks of Data Center failures. Besides DCIM Software, GFS offers Data Center Advisory Services in the areas of best practices, capacity planning, energy efficiency and business continuity of data centers.


Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,905
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
77
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

The Modern Data Center Topology - The High Availability Mantra

  1. 1. 1 The Modern Data Center Topology: The High Availability Mantra
  2. 2. 2 TopicsTopics • The Modern Data Center Overview • The High Availability (HA) Mantra • Operating Challenges • A Solution
  3. 3. 3 Modern Data Center Overview
  4. 4. 4 Multiple Classes of Data CentersMultiple Classes of Data Centers • Internet Data Center  used by external clients connecting from the Internet  supports servers and devices required for B2C transaction-based applications (e- commerce). • Extranet Data Center  provides support and services for external B2B partner transactions.  accessed over secure VPN connections or private WAN links between the partner network and the enterprise extranet. • Intranet Data Center  hosts applications and services mostly accessed by internal employees with connectivity to the internal enterprise network. ness services. • Special Purpose Data Center  For specialized application areas like Geological & Geophysical for Oil & Gas Industry May or may not be inter-connected
  5. 5. 5 Common Objective: Business ContinuityCommon Objective: Business Continuity • Disaster Recovery Data Center  Each Class may have dedicated or Shared DR Center  Usually located separately from Primary Data Center • High Availability (HA) Data Center  Each Data Center provided for with significant redundancies  DR Center comes into play only when a Disaster strikes.  Component or system failures within any DC should be either self-healing or redundancies within the DC should take over • Insurance Against Power & Network Outages  Reliability through multiple service providers  Internal Back-ups ness services. • Securing the Data Center  Against malicious hacking that can bring down the Data Center impacting business continuity  Implementing Firewalls/ Virtual Firewalls
  6. 6. 6 Common Complexity: Multitude of AssetsCommon Complexity: Multitude of Assets Multitude of Assets  Divided between two worlds: IT & Facilities  Includes Mission Critical Applications  Like a manufacturing operation  Raw Material: Power & Networks  Processing: Data  Output: Information Service  Needs: Asset Management, Resource Optimization, a la Manufacturing Multitude of Assets  Divided between two worlds: IT & Facilities  Includes Mission Critical Applications  Like a manufacturing operation  Raw Material: Power & Networks  Processing: Data  Output: Information Service  Needs: Asset Management, Resource Optimization, a la Manufacturing
  7. 7. 7 The High Availability Mantra
  8. 8. 8 Extreme Redundancies for 99.99% Uptime -> Higher Power ConsumptionExtreme Redundancies for 99.99% Uptime -> Higher Power Consumption Huge Population of N+1/N+2 Equipment -> Asset Under utilization & Too complex to manage with spreadsheets & Visio tools Huge Population of N+1/N+2 Equipment -> Asset Under utilization & Too complex to manage with spreadsheets & Visio tools Chain of inter-dependent equipment -> Multiple points of failuresChain of inter-dependent equipment -> Multiple points of failures Growing Heat Loads, Carbon Emissions & e-waste -> Sustainability IssuesGrowing Heat Loads, Carbon Emissions & e-waste -> Sustainability Issues KW per Rack increases as more processing capacity is added -> Trade-offs: need to support more per rack versus extra space & heat loads. KW per Rack increases as more processing capacity is added -> Trade-offs: need to support more per rack versus extra space & heat loads. High Availability is Inversely Proportional to Asset Utilization & Energy EfficiencyHigh Availability is Inversely Proportional to Asset Utilization & Energy Efficiency Today’s High Availability Data CenterToday’s High Availability Data Center
  9. 9. 9 When HA fails - Tale of Two DisastersWhen HA fails - Tale of Two Disasters AmazonAmazon RBSRBS Tech fault at RBS and Natwest freezes millions of UK bank balances RBS and Natwest have failed to register inbound payments for up to three days, customers have reported, leaving people unable to pay for bills, travel and even food. The banks - both owned by RBS Group - have confirmed that technical glitches have left bank accounts displaying the wrong balances and certain services unavailable. There is no fix date available. Amazon cloud outage takes down Netflix, Instagram, Pinterest, & more With the critical Amazon outage, which is the second this month, we wouldn’t be surprised if these popular services started looking at other options, including Rackspace, SoftLayer, Microsoft’s Azure, and Google’s just- introduced Compute Engine. Some of Amazon’s biggest EC2 outages occurred in April and August of last year. Which Will Be The Next One?Which Will Be The Next One?
  10. 10. 10 What’s the High Availability Mantra?What’s the High Availability Mantra? Amazon Data Centers (built to Tier 4 standards and with an expected availability of 99.995%) has had two outages already in 2012 – each over 3 hours! • Tier 3/Tier 4 just defined by hardware redundancies • Glaring gaps in operating procedures to prevent fatal human errors • Lack of purpose-built BCP software to predict failures • Lack of chain of custody to detect root cause Amazon Data Centers (built to Tier 4 standards and with an expected availability of 99.995%) has had two outages already in 2012 – each over 3 hours! • Tier 3/Tier 4 just defined by hardware redundancies • Glaring gaps in operating procedures to prevent fatal human errors • Lack of purpose-built BCP software to predict failures • Lack of chain of custody to detect root cause Availability % Downtime per year Downtime per month* Downtime per week 99% ("two nines") 3.65 days 7.20 hours 1.68 hours 99.5% 1.83 days 3.60 hours 50.4 minutes 99.8% 17.52 hours 86.23 minutes 20.16 minutes 99.9% ("three nines") 8.76 hours 43.8 minutes 10.1 minutes 99.95% 4.38 hours 21.56 minutes 5.04 minutes 99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes 99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds 99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds 99.99999% ("seven nines") 3.15 seconds 0.259 seconds 0.0605 seconds
  11. 11. 11 Delivering the High Availability PromiseDelivering the High Availability Promise Adequate Redundancies • Are there any points of failure – besides power and external networks - that can impact uptime? (Not everything is N+1) • What are my redundancy paths? • Are the relationships & dependencies among critical assets clearly defined? • Can I do an impact analysis on the outage/downtime of any equipment? Can I predict the cascading effect of such an outage on other assets/applications in the data center? Preventing Failures • Can any failure be predicted to take proactive measures? Do I get alerts on threshold breaches so that I can take preventive actions before a failure happens? • Is there a history of a Move-Add-Change (MAC) that I should be aware of? • What is the impact of a MAC on space, power, cooling? • Where can new devices/servers be best placed? Floor -> Rack -> Cage. How this can be determined based on current infrastructure and other dependencies to avoid a failure? • How do I prevent a fatal human error?
  12. 12. 12 Operating Challenges
  13. 13. 13 The High Availability ChallengeThe High Availability Challenge Asset Over Provisioning Lack of HA Management Tool  IT assets tracked by Systems Management Tool  Facilities assets tracked by BMS  Two not inter-operable: Unable to determine missing link for HA  Unable to track redundancy paths  HA fails if any equipment or software in critical path fails  HA fails if there’s fatal human error  Health and history of equipment, or previous MAC impact, not tracked  IT assets tracked by Systems Management Tool  Facilities assets tracked by BMS  Two not inter-operable: Unable to determine missing link for HA  Unable to track redundancy paths  HA fails if any equipment or software in critical path fails  HA fails if there’s fatal human error  Health and history of equipment, or previous MAC impact, not tracked  Too many assets; two classes of assets  Absence of Software Portfolio (even if hardware assets are tracked)  Move-Add-Change: Decisions not based on simulations, analysis  Absence of change management  Absence of workflow approvals  Unable to predict failures  No chain of custody  Too many assets; two classes of assets  Absence of Software Portfolio (even if hardware assets are tracked)  Move-Add-Change: Decisions not based on simulations, analysis  Absence of change management  Absence of workflow approvals  Unable to predict failures  No chain of custody Need to Predict FailuresNeed to Predict Failures
  14. 14. 14 Beyond HA: Infrastructure & Operational ChallengesBeyond HA: Infrastructure & Operational Challenges Energy Problems Operational Problems  Low level asset tracking  Under utilization of many computing resources  Running of old inefficient equipment  Decisions not based on analysis  Cooling not optimized  Floor & Rack Space: Non-optimal placements of equipment  Increasing demand for rack space  Absence of capacity planning  Low level asset tracking  Under utilization of many computing resources  Running of old inefficient equipment  Decisions not based on analysis  Cooling not optimized  Floor & Rack Space: Non-optimal placements of equipment  Increasing demand for rack space  Absence of capacity planning  Higher power consumption & growing power bills  Not monitoring power use at device levels  Dissemination of enormous heat  Creation of hot spots  Drastic reduction in expected life of computing equipment  Failing of a data center  Increase in CO2 emission  Higher power consumption & growing power bills  Not monitoring power use at device levels  Dissemination of enormous heat  Creation of hot spots  Drastic reduction in expected life of computing equipment  Failing of a data center  Increase in CO2 emission
  15. 15. 15 A Solution
  16. 16. 16 Solution That Bridges the Gap Between IT & FacilitiesSolution That Bridges the Gap Between IT & Facilities Data Center Infrastructure Management (DCIM) SoftwareData Center Infrastructure Management (DCIM) Software IT System Performance Management IT System Performance Management Building Management System Building Management System Data Center Infrastructure Management Data Center Infrastructure Management
  17. 17. 17 Solution That Addresses The High Availability ChallengeSolution That Addresses The High Availability Challenge DCIM Helps to Predict FailuresDCIM Helps to Predict Failures Asset Over Provisioning Lack of HA Management Tool  IT assets tracked by Systems Management Tool  Facilities assets tracked by BMS  Two not inter-operable: Unable to determine missing link for HA  Unable to track redundancy paths  HA fails if any equipment or software in critical path fails  HA fails if there’s fatal human error  Health and history of equipment, or previous MAC impact, not tracked  IT assets tracked by Systems Management Tool  Facilities assets tracked by BMS  Two not inter-operable: Unable to determine missing link for HA  Unable to track redundancy paths  HA fails if any equipment or software in critical path fails  HA fails if there’s fatal human error  Health and history of equipment, or previous MAC impact, not tracked  Too many assets; two classes of assets  Absence of Software Portfolio (even if hardware assets are tracked)  Move-Add-Change: Decisions not based on simulations, analysis  Absence of change management  Absence of workflow approvals  Unable to predict failures  No chain of custody  Too many assets; two classes of assets  Absence of Software Portfolio (even if hardware assets are tracked)  Move-Add-Change: Decisions not based on simulations, analysis  Absence of change management  Absence of workflow approvals  Unable to predict failures  No chain of custody
  18. 18. 18 Solution That Addresses Infra & Operational ChallengesSolution That Addresses Infra & Operational Challenges DCIM Improves Energy & Operational EfficienciesDCIM Improves Energy & Operational Efficiencies Energy Problems Operational Problems  Low level asset tracking  Under utilization of many computing resources  Running of old inefficient equipment  Decisions not based on analysis  Cooling not optimized  Floor & Rack Space: Non-optimal placements of equipment  Increasing demand for rack space  Absence of capacity planning  Low level asset tracking  Under utilization of many computing resources  Running of old inefficient equipment  Decisions not based on analysis  Cooling not optimized  Floor & Rack Space: Non-optimal placements of equipment  Increasing demand for rack space  Absence of capacity planning  Higher power consumption & growing power bills  Not monitoring power use at device levels  Dissemination of enormous heat  Creation of hot spots  Drastic reduction in expected life of computing equipment  Failing of a data center  Increase in CO2 emission  Higher power consumption & growing power bills  Not monitoring power use at device levels  Dissemination of enormous heat  Creation of hot spots  Drastic reduction in expected life of computing equipment  Failing of a data center  Increase in CO2 emission
  19. 19. 19 Anatomy of a DCIM Software: GFS Crane
  20. 20. 20 Thank You http://www.greenfieldsoft.com Email: sales@greenfieldsoft.com See also on slideshare: Data Center Infrastructure Management: ERP for the Data Center Manager

×