UPS failure, accident / human error & cooling system failure are major reasons for data center outage. DCIM can detect downtime issues with constant monitoring.
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Top Three Root Causes of Data Center Outages
1. Top Three Root Causes of Data Center Outages
The thought of unplanned downtime strikes fear in the hearts of every data center operator. The
most recent Ponemon Institute “Cost of Data Center Outages” report, from January 2016, pegged the
average cost of downtime at nearly $8,000 per minute. The maximum data center outage cost nearly
$2.5 million.
Clearly, minimizing the risk of downtime is a high priority for data center operators. The good news is
that the root cause of most outages can be traced to just a handful of problems. What’s more, human
error and data center infrastructure equipment failure are far more likely to cause outages than IT
equipment downtime.
- UPS failure. One-fourth of all outages are caused by UPS system failure, according to the Ponemon
report. UPS devices are indispensable to data center operations, but they’re often forgotten once
they’ve been installed. Battery failure is the chief cause of UPS problems, and rising data center heat
loads can reduce battery life substantially.
- Accident / human error. Human beings are the root cause of 22 percent of outages, either through
accident or negligence. Automation and artificial intelligence can help by eliminating many repetitive
tasks, but there’s no substitute for training and accountability.
- Cooling system failure. Although the number of outages attributable to cooling system failure have
decreased, from 15 percent in 2010 to 11 percent in 2016, the cost of such outages has increased
more than 20 percent over the same time period. Increasing data center heat loads have made
cooling system failure a more significant threat.
Armed with this knowledge, data center operators are in a better position to develop policies and
procedures that reduce risk. Standard operating procedures (SOPs), methods of procedure (MOPs)
and site configuration policies (SCPs) should focus on the most critical workloads and the most likely
causes of an outage. They should be reviewed and updated regularly, ideally by incorporating them
into day-to-day operations.
Emergency operating procedures (EOPs) should also be developed, tested and practiced. If staff can
respond quickly and appropriately to an incident, they often can prevent it from becoming a full-scale
outage.
Data center infrastructure management (DCIM) tools can help IT teams detect issues that could lead
to downtime by monitoring the health of various systems and presenting the data in easy-to-read
dashboards. Best in-class DCIM tools also provide asset management, capacity management and
energy management capabilities, and can present a virtual 3-D view of the data center including room
layouts, rack diagrams and cabling. This information can help IT teams assess the impact of changes
and fine-tune policies and procedures accordingly.
Organizations should also refresh data center infrastructure components regularly to not only reduce
2. risk but increase efficiency. For example, UPS systems should be replaced every five to eight years to
ensure seamless operation, but organizations may want to upgrade more frequently to take
advantage of today’s compact, energy-efficient and feature-rich units.
Data center outages take a financial toll through business disruption, lost revenue and reduced
productivity. The trickle-down effects of brand damage and missed opportunities can haunt
organizations for years to come. However, the right policies, procedures and infrastructure
components can help reduce the frequency, duration and cost of downtime. Contact Enconnex for
help in optimizing your data center infrastructure to maximize availability.