© 2014 Uptime Institute
Is your data center on the verge of a crisis?
Julian Kudritzki
Chief Operating Officer
Uptime Inst...
What Defines a Crisis?
2
Tour of Operational Computer Room
3
Looking for Clues
4
Tour of ‘Live’ Critical Spaces
5
Daily Practices Compromise Uptime,
Safety, and Security
6
•  Overtime hours exceeding 10%
•  Voice mail boxes full
•  Emails not responded to
•  Email inbox size limit exceeded
•  ...
•  Break fix budget exceeded
•  Maintenance budget exceeded
•  Energy cost estimate exceeded or unknown
•  Last minute dep...
•  Cabling is not labeled or worse incorrectly labeled
•  Equipment is not uniquely labeled
•  Loads are consistently out ...
Are you confident in your Facilities team’s capability to
manage a technologically advanced and highly efficient design
to...
•  Initial review
•  Gap analysis against industry best practices
§  Staffing and Organization
§  Maintenance
§  Traini...
Key Elements of Facilities Management
Staffing and Organization
•  Staffing
•  Qualifications
•  Organization
Maintenance
...
Key Elements of Facilities Management
Training
•  Data Center Staff
•  Vendors
Planning, Coordination,
and Management
•  S...
The Uptime Institute over the years has observed
management issues posing the largest risk to uptime
physical infrastructu...
Data Center Owners / Operators / End Users
•  Increased availability and cost savings
•  Multi-site consistency
•  Benchma...
Uptime Institute has been conducting Operational
Sustainability Reviews for approximately 3 years—
based upon decades of s...
Staffing
•  Inadequate staffing
•  Excessive overtime (over 10%)
•  No escalation process
Qualification
•  No list of requ...
Preventive Maintenance (PM)
•  No list of required PM activities
•  PM activities not fully scripted
•  No quality control...
Vendor Support
•  Contracts missing response times, call-in process, detail SOW, or
technician qualifications
Deferred Mai...
Life-Cycle Planning
•  No life-cycle plan
•  Not using MMS data to develop plan
Failure Analysis
•  No record of outages o...
Data Center Staff
•  Undocumented On-the-Job (OJT) programs
•  No formal qualification program
•  No list of training requ...
Load Management
•  Alarm settings not documented
•  Alarms not set on PDUs to ensure maximum loads are not exceeded
Operat...
Site Policies
•  Missing Site Policies
•  Especially Site Configuration Policy
Reference Library
•  No process for keeping...
Facilities
•  Operate and maintain the critical facility infrastructure
•  Support the installation of IT equipment (space...
Functionally Separate Organization
•  Corporate Real Estate (Facilities)
•  IT
•  Security
Communication between organizat...
Factors driving changes to organizational structure
•  Rapid changes in technology and speed at which capacity must be
bro...
The value of industry best practices is in the process of
continuous improvement
•  Discovery leads to learning
•  Learnin...
For more information contact:
Julian Kudritzki
jkudritzki@uptimeinstitute.com
206.706.4143
Questions?
© 2014 Uptime Instit...
Upcoming SlideShare
Loading in …5
×

Is your data center on the verge of a crisis?

246
-1

Published on

What are the symptoms of a poorly managed data center facility? How close are you to an operating failure or catastrophic downtime event? Learn how to spot the warning signs and start improving your facility management program immediately to minimize the risk of downtime, reduce costs, and upgrade your operations.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
246
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Is your data center on the verge of a crisis?

  1. 1. © 2014 Uptime Institute Is your data center on the verge of a crisis? Julian Kudritzki Chief Operating Officer Uptime Institute
  2. 2. What Defines a Crisis? 2
  3. 3. Tour of Operational Computer Room 3
  4. 4. Looking for Clues 4
  5. 5. Tour of ‘Live’ Critical Spaces 5
  6. 6. Daily Practices Compromise Uptime, Safety, and Security 6
  7. 7. •  Overtime hours exceeding 10% •  Voice mail boxes full •  Emails not responded to •  Email inbox size limit exceeded •  Meetings missed or routinely cancelled •  No time for training •  Shortage of qualified staff •  Personnel performing work outside their competency •  Everything is an emergency •  Personnel turnover What Else Is Going On? 7
  8. 8. •  Break fix budget exceeded •  Maintenance budget exceeded •  Energy cost estimate exceeded or unknown •  Last minute deployment requirements •  No organization chart •  No responsibilities matrix •  No records of maintenance activities •  No written policies & procedures •  No preventive maintenance schedule •  Back of the server looks like a spaghetti pot exploded The Issues Add Up 8
  9. 9. •  Cabling is not labeled or worse incorrectly labeled •  Equipment is not uniquely labeled •  Loads are consistently out of balance •  Capacities are not managed or tracked •  Deferred maintenance exceeds 10% •  Housekeeping: if it looks like a mess, it is a mess Maybe you don’t have a crisis, but how do you know how well your data center operation compares to rest of industry? The Issues Add Up 9
  10. 10. Are you confident in your Facilities team’s capability to manage a technologically advanced and highly efficient design to your 24 x 7 uptime requirements? •  Can you easily replace any member of that team? •  Are you protected against poor operations practices migrating from older sites to higher criticality data centers? •  Do you have sites that operate in isolation, ignoring global corporate standards? •  Do you even have corporate global standards? •  If you outsource any aspect of your data center operations, how do you avoid losing responsibility and accountability? •  Do you manage an outsourcing contract. . . . or direct an expert team? Ask the Tough Questions 10
  11. 11. •  Initial review •  Gap analysis against industry best practices §  Staffing and Organization §  Maintenance §  Training §  Planning, Coordination & Management §  Operating Conditions •  Roadmap to operational excellence •  Plan changes •  Implement changes •  Monitor & refine •  Annual review Path to Data Center Operations Success 11
  12. 12. Key Elements of Facilities Management Staffing and Organization •  Staffing •  Qualifications •  Organization Maintenance •  Preventative Maintenance (PM) Program •  Housekeeping Policies •  Maintenance Management System (MMS) •  Vendor Support •  Deferred Maint. Program •  Predictive Maintenance •  Life-Cycle Planning •  Failure Analysis Program 12
  13. 13. Key Elements of Facilities Management Training •  Data Center Staff •  Vendors Planning, Coordination, and Management •  Site Policies •  Financial Management •  Reference Library •  Computer Room Mgmt. Operating Conditions •  Load Management •  Operating Set Points •  Alternating Use of Infrastructure Equipment 13
  14. 14. The Uptime Institute over the years has observed management issues posing the largest risk to uptime physical infrastructure •  Inadequate staffing •  Ineffective or non-existing maintenance and training programs •  Lacking processes and procedures •  Resulting in the majority of outages being caused by ‘human error’ No standard existed to help Owners/Operators determine •  Common language/vocabulary  of  data  center  operations •  Focus of data center management •  Resource allocation •  Resource requirements Genesis of Industry Best Practices 14
  15. 15. Data Center Owners / Operators / End Users •  Increased availability and cost savings •  Multi-site consistency •  Benchmark for continuous monitoring and refinement Colocation / Managed Services Sites •  All of the above plus… •  Customer assurance of consistency •  Competitive differentiator (attain & retain certification) Industry Benchmark •  No need to reply on opinions and anecdotes Value of Industry Best Practices 15
  16. 16. Uptime Institute has been conducting Operational Sustainability Reviews for approximately 3 years— based upon decades of site operations knowledge and experience: •  Operational Sustainability Certifications: Tier + Gold, Silver, or Bronze •  Management & Operations (M&O) Stamps of Approval See http://uptimeinstitute.com/publications for Tier Standard: Operational Sustainability Best Practices Reviews 16
  17. 17. Staffing •  Inadequate staffing •  Excessive overtime (over 10%) •  No escalation process Qualification •  No list of required qualifications •  No experience with data center specific equipment Organization •  Roles and Responsibilities not documented •  Data center organization not integrated Staffing and Organization Significant Findings 17
  18. 18. Preventive Maintenance (PM) •  No list of required PM activities •  PM activities not fully scripted •  No quality control process Housekeeping •  Combustibles in the data center •  No documented housekeeping policy Maintenance Management System (MMS) •  No list of equipment •  Missing critical data: warranty info, maintenance history, performance data, etc. Maintenance Significant Findings 18
  19. 19. Vendor Support •  Contracts missing response times, call-in process, detail SOW, or technician qualifications Deferred Maintenance •  Unable to produce Deferred maintenance report from MMS Predictive Maintenance •  No predictive maintenance program •  Not comparing current results with previous results Maintenance Significant Findings 19
  20. 20. Life-Cycle Planning •  No life-cycle plan •  Not using MMS data to develop plan Failure Analysis •  No record of outages or near misses Maintenance Significant Findings 20
  21. 21. Data Center Staff •  Undocumented On-the-Job (OJT) programs •  No formal qualification program •  No list of training required by position •  No formal training program with lesson plans, etc. Vendors •  No briefing for escorted vendors Training Significant Findings 21
  22. 22. Load Management •  Alarm settings not documented •  Alarms not set on PDUs to ensure maximum loads are not exceeded Operating Set Points •  Cooling set points are not document or part of Change Management Process •  Changing of set points is not controlled Operating Conditions Significant Findings 22
  23. 23. Site Policies •  Missing Site Policies •  Especially Site Configuration Policy Reference Library •  No process for keeping documents up-to-date Capacity Management •  No process for forecasting future space, power, and cooling requirements •  No active tracking of cooling capacity •  Ineffective management of Cold Aisles /Hot Aisles •  Electrical power monitoring (balancing phases) Planning, Coordination, and Management Significant Findings 23
  24. 24. Facilities •  Operate and maintain the critical facility infrastructure •  Support the installation of IT equipment (space, power, & cooling) IT Management •  Operate and maintain IT hardware, software, applications, and network connectivity •  Manage the installation/de-installation of IT equipment Security •  Access Control •  Physical Security Typical Data Center Disciplines 24
  25. 25. Functionally Separate Organization •  Corporate Real Estate (Facilities) •  IT •  Security Communication between organizations was typically poor •  Data center activities conducted without coordination •  Poor future space, power, and cooling planning No individual responsible for all aspects of operating a data center Past Organizational Structures 25
  26. 26. Factors driving changes to organizational structure •  Rapid changes in technology and speed at which capacity must be brought online •  Increased costs associate with IT and Facilities •  Business objectives of continuous computing availability Legacy organizations could not accommodate quickly evolving business requirements •  Slow to respond •  Not integrated Evolving Organizational Structure 26
  27. 27. The value of industry best practices is in the process of continuous improvement •  Discovery leads to learning •  Learning leads to change •  Change leads to improvement •  Regular reviews leads to discovery •  Crises can be avoided Summary 27
  28. 28. For more information contact: Julian Kudritzki jkudritzki@uptimeinstitute.com 206.706.4143 Questions? © 2014 Uptime Institute28
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×