SlideShare a Scribd company logo
1 of 5
Some Rules for Successful Data Center Operations
There are numerous policies and practices that every data center owner or
operator follows, or should be following, but in reality there are only a few rules that
must be adhered to for gaining the best results.
Evaluate what you are doing and why you are doing it.
It is far better to prevent a “Lights Out” occurrence than sitting around a conference
table discussing the forensics of a shutdown.
Years ago, I was doing a walk through in a small data center in the Midwest. I saw
the Emergency Power Off (EPO) on the wall near the door and I asked the facility
manager why it did not have a cover or a sign indicating its purpose. I’ll never forget
his response. He pointed at the familiar red mushroom button and said, “That is what
we call the resume generating switch, if you punch that, you should make sure your
resume is updated because your career is over here.”
We both laughed about that, but it got me to thinking that data center likely is going
to have an unscheduled shutdown one of these days.
Even if the company communicated that message to each employee, there is still a
chance that someone did not get the memo or heard and understood the message
about the EPO and was disgruntled, looking for a new place to work. Without the
cover or sign, there is always a risk someone could accidently lean against the wall
and dump the data center.
A practical solution is to determine the necessity of the EPO, based on NEC code
updates and then consider the risks associated with EPO and how to eliminate or
reduce the risks. Providing a flip cover and posting a sign is only a portion of the
solution. Make sure every new and existing employee understands the white space
is where their paycheck is printed and emphasize the importance of practicing
common sense procedures working in that environment.
However, caution must be applied here. Following processes out of habit leads to
complacency. Complacency may lead to disaster, which brings us to rule #2.
Test your defenses.
Not referring to IT security; that should be covered from both the virtual and physical
side, but have an understanding of the vulnerabilities your data center may suffer.
Are your maintenance practices and performance reviewed periodically? I have
looked through maintenance logs that have been checked and dated, but curiously
the handwriting all looks very similar, almost as if the technician, who is likely either
bored with the paperwork or overwhelmed by other demands just sat down and filled
out a month or two worth of maintenance records. I’m not casting stones at the
technicians, the demands upon their time sometimes requires shortcuts, and
paperwork is one of those shortcuts. The operator should review the logs after each
scheduled maintenance performance to look for trends or anomalies, no matter
whether they are generated from a CMMS system or in a binder. For example, looking
over a UPS report and comparing it to the previous month. Does the recorded voltage
and current (input and output) match the UPS display? Or does the unit need
recalibration? How is the battery health? Are you testing the UPS and generator
together under load conditions? Entire volumes have been written regarding
maintenance of critical infrastructure, but don’t just rely on a completed maintenance
report, do a quality control check and look deeper at each component. A couple of
airlines would agree that a few minutes spent here each month could save your data
center from an unfortunate event that makes headlines and causes company stock
prices to tumble in the short term.
Looking past the maintenance programs for the critical infrastructure, has the
warranty information been archived? Do you know when the end of life for the
systems will sneak up? One of the more difficult conversations to have with a data
center manager is to tell them their Computer Room Air Conditioning (CRAC) or UPS
is approaching its final days and their response is “there is no money in the budget
for a new unit.” This exchange is often coupled with the fact the unit is operating
above design capacity and redundancy. In other words, running to failure.
The additional stress this creates for the operator operating on borrowed time could
be reduced if they set the calendar alarm with this date minus four to five years or
whatever is appropriate for your organizations budget planning. This will allow time
to build the necessary budget for a capital investment without surprising the CFO.
Other defense strategies are reviewing and assessing the disaster recovery (DR) plan
and business continuity plan (BCP). You say you haven’t blown the dust off those
documents since Y2K?
DR and BCP have far reaching impact outside the data center requires that
comprehensive risk assessment study should not be overlooked.
The data center manager should be concerned with anything that could become a
disruption.
Stacks of boxes and paper in the data center? Very common, but also a potential fire
hazard or trip hazard, at the minimum, non-IT related stuff, including old servers,
racks impede work flow. Underfloor smoke detectors? What is the policy for lifting
floor tiles? How often is the under floor area cleaned? Lifting a floor tile could present
some nasty surprises that may include setting off a fire alarm.
These are only a few of the many possible scenarios that may impact your IT
operations.
A best practice for testing your defenses is Management by Walking around (MBWA).
This time tested custom was popular back in the 1980’s but appears to have roots
much further back to Abraham Lincoln’s review of the Union troops during the Civil
War.
It is to your advantage to get inside the white space and see, smell, touch and hear
what is going on. Are there stains on the ceiling tiles? How long have they been there?
Probably should open a tile and take a peek with a flashlight. Does the air smell
musty? Is there water building up in the condensate pans? Do you hear the belts
squealing on the CRACs? Open up the unit and check to see if they are aligned
properly. Do you smell a whiff of Sulphur? The UPS batteries may be telling you
something. What’s the temperature in the space?
Too often we rely on email or texts and completed checklists and don’t take a hands
on approach to identifying risks in the data center environment.
Challenge Your Assumptions
One of my favorite authors, Rudyard Kipling wrote,
“I keep six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.”
One of the epiphanies I had early in the data center world was to know what a red
herring was and how to address it. Historically a red herring was a logical device to
distract an opponent during an argument. Supposedly it comes from training hunting
dogs by the use of a kipper or herring to drag across the ground, thus throwing them
off the trail.
In my particular case, there were no fish. I had met with a technician in the data
center who was very upset that the UPS failed every time he was in the room. I was
set back on my heels by that statement. While I was collecting data to determine the
cause of the loss of power in the data center I ran into a network administrator
working inside a rack. I asked him about the failures and his response was there were
two in the past six months. I countered, “but your colleague just told me that “every”
time he was in the data center there was a UPS failure.” The network guy stroked his
beard, looked down for a second and looked back at me. “Well, the other guy you
met works at another location. He is only here a couple of times a year, so that would
make sense.”
That revelation taught me to keep asking questions and challenge my own
assumption.
The first technician told me the truth. The network administrator also told me the
truth, but he included details that the first observer did not possess. If I had o nly
spoken to the first technician, I would have spent a lot of wasted time tracking a
problem that I assumed was an ongoing, everyday occurrence. The network
administrator pointed me in the right direction and I was able to determine the true
cause of the shutdown. (As it turned out, it was not entirely a UPS issue, as previously
understood, but a site wiring fault)
These are just a few of the water cooler discussions a data center manager should
be having with their team.
RTFM
This acronym is short for Read the Fine Manual or something similar to that.
Getting a budget request approved for new hardware is cause for celebration. Data
center operators often “get by” with outdated infrastructure until the cost of
maintenance exceeds the cost of new equipment. When the truck pulls up to the
loading dock and the field service engineers arrive, it’s a good idea to spend some
time with these people and equipment to get acquainted with your device. I have
been to sites where the end user had boxes of infrastructure equipment that was still
wrapped in plastic and never been installed. I asked myself why this happens and
the following scenario comes to mind.
The data center manager knows he needs to upgrade his power so he meets with a
sales guy and his applications engineer. He is impressed by the capabilities of the
new equipment, and all the exciting features. He tours the factory to see a witness
test and he makes the decision to purchase the product. He may have even had the
foresight to include commissioning in the purchase to insure the unit works as
intended. So after all of these events have passed, he finds himself with all of these
extra parts that were originally described as “features”. The engineer providing the
start-up has a narrow scope of work that is limited to only getting the unit online.
The parts and pieces that comprise the bells and whistles are left in the original boxes.
This may work if you purchased a Star Wars limited edition light saber, but it is not
practical for your data center. Months go by, the stuff is still sitting in the box. The
warranty has expired and the stuff is still sitting in the box. Finally, someone decides
to dig a little deeper before the equipment gets tossed into a dumpster. They discover
these are the monitoring devices that provide information about system loading,
status and other important details. You remember the sales guy and applications
engineer describing these “features” but your expectation was the unit would be
complete at the start up. This reminds me of a furniture advertisement that plays
across New England radio stations. “An informed buyer is our best customer.” The
cure for this heartburn is to research your chosen product down to the minor details
and ask those questions up front.
“Will feature “X” be configured as part of the startup?” and if not, point me to the
resources for me to enable them myself.”
“Can you demonstrate the feature X and how to integrate it into my new device? It
may be a good idea to include other people (IT and Facilities) in this discussion as
well.
If you purchased a monitoring device or sensor, insure that it is compatible with your
existing building monitoring system (BMS) or building automation system (BAS)
before adding it to the bill of sale.
The final word is to know your equipment from Day 1 and share that knowledge
across the disciplines that may be called upon to work with the new device.
Nexttime I’ll addsome commentbaseduponotherreal worldexperience.

More Related Content

What's hot

Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarBeyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarKaren Skiles
 
The Nuts and Bolts of Disaster Recovery
The Nuts and Bolts of Disaster RecoveryThe Nuts and Bolts of Disaster Recovery
The Nuts and Bolts of Disaster RecoveryInnoTech
 
2008 Hazards - Shift handover
2008 Hazards - Shift handover2008 Hazards - Shift handover
2008 Hazards - Shift handoverAndy Brazier
 
How long are background checks good for
How long are background checks good forHow long are background checks good for
How long are background checks good forMike McCarty
 
RESPONDING TO A WATER SUPPLY CONTAMINATION EVENT
RESPONDING TO A WATER SUPPLY CONTAMINATION EVENT RESPONDING TO A WATER SUPPLY CONTAMINATION EVENT
RESPONDING TO A WATER SUPPLY CONTAMINATION EVENT The Texas Network, LLC
 
7 deadly sins of backup and recovery
7 deadly sins of backup and recovery7 deadly sins of backup and recovery
7 deadly sins of backup and recoverygeekmodeboy
 
Business continuity for SMEs
Business continuity for SMEsBusiness continuity for SMEs
Business continuity for SMEsreedgrace1
 
Power Of 30 Seconds: Best Practices for Exceptional Support
Power Of 30 Seconds: Best Practices for Exceptional SupportPower Of 30 Seconds: Best Practices for Exceptional Support
Power Of 30 Seconds: Best Practices for Exceptional Supporttodd.lewis
 

What's hot (14)

Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarBeyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
 
The Nuts and Bolts of Disaster Recovery
The Nuts and Bolts of Disaster RecoveryThe Nuts and Bolts of Disaster Recovery
The Nuts and Bolts of Disaster Recovery
 
Archer step 3 ccs workshop 2019
Archer step 3 ccs workshop 2019Archer step 3 ccs workshop 2019
Archer step 3 ccs workshop 2019
 
Disaster Recovery - Deep Dive
Disaster Recovery - Deep DiveDisaster Recovery - Deep Dive
Disaster Recovery - Deep Dive
 
Archer USMLE Step 3 CCS workshop 2018
Archer USMLE Step 3 CCS workshop 2018Archer USMLE Step 3 CCS workshop 2018
Archer USMLE Step 3 CCS workshop 2018
 
2008 Hazards - Shift handover
2008 Hazards - Shift handover2008 Hazards - Shift handover
2008 Hazards - Shift handover
 
How long are background checks good for
How long are background checks good forHow long are background checks good for
How long are background checks good for
 
RESPONDING TO A WATER SUPPLY CONTAMINATION EVENT
RESPONDING TO A WATER SUPPLY CONTAMINATION EVENT RESPONDING TO A WATER SUPPLY CONTAMINATION EVENT
RESPONDING TO A WATER SUPPLY CONTAMINATION EVENT
 
Intelgrids H4D Stanford 2018
Intelgrids H4D Stanford 2018Intelgrids H4D Stanford 2018
Intelgrids H4D Stanford 2018
 
IBM PROTECTIER: FROM BACKUP TO RECOVERY
IBM PROTECTIER: FROM BACKUP TO RECOVERYIBM PROTECTIER: FROM BACKUP TO RECOVERY
IBM PROTECTIER: FROM BACKUP TO RECOVERY
 
Preventive Maintenance
Preventive MaintenancePreventive Maintenance
Preventive Maintenance
 
7 deadly sins of backup and recovery
7 deadly sins of backup and recovery7 deadly sins of backup and recovery
7 deadly sins of backup and recovery
 
Business continuity for SMEs
Business continuity for SMEsBusiness continuity for SMEs
Business continuity for SMEs
 
Power Of 30 Seconds: Best Practices for Exceptional Support
Power Of 30 Seconds: Best Practices for Exceptional SupportPower Of 30 Seconds: Best Practices for Exceptional Support
Power Of 30 Seconds: Best Practices for Exceptional Support
 

Viewers also liked

Essential elements of data center operations
Essential elements of data center operationsEssential elements of data center operations
Essential elements of data center operationsSchneider Electric
 
Five Benefits of Data Center Colocation
Five Benefits of Data Center ColocationFive Benefits of Data Center Colocation
Five Benefits of Data Center ColocationData Cave
 
Digital Data Center
Digital Data CenterDigital Data Center
Digital Data CenterAtos
 
Data Center Tiers Explained
Data Center Tiers ExplainedData Center Tiers Explained
Data Center Tiers ExplainedData Cave
 
Moving Your Data Center: Keys to planning a successful data center migration
Moving Your Data Center: Keys to planning a successful data center migrationMoving Your Data Center: Keys to planning a successful data center migration
Moving Your Data Center: Keys to planning a successful data center migrationData Cave
 
Data Center Trends 2014
Data Center Trends 2014Data Center Trends 2014
Data Center Trends 2014Belden Inc
 

Viewers also liked (7)

Essential elements of data center operations
Essential elements of data center operationsEssential elements of data center operations
Essential elements of data center operations
 
Five Benefits of Data Center Colocation
Five Benefits of Data Center ColocationFive Benefits of Data Center Colocation
Five Benefits of Data Center Colocation
 
Digital Data Center
Digital Data CenterDigital Data Center
Digital Data Center
 
Data Center Tiers Explained
Data Center Tiers ExplainedData Center Tiers Explained
Data Center Tiers Explained
 
Moving Your Data Center: Keys to planning a successful data center migration
Moving Your Data Center: Keys to planning a successful data center migrationMoving Your Data Center: Keys to planning a successful data center migration
Moving Your Data Center: Keys to planning a successful data center migration
 
Data Center Trends 2014
Data Center Trends 2014Data Center Trends 2014
Data Center Trends 2014
 
Datacenter overview
Datacenter overviewDatacenter overview
Datacenter overview
 

Similar to Some Rules for Successful Data Center Operations

7 questions to ask a computer service tech
7 questions to ask a computer service tech7 questions to ask a computer service tech
7 questions to ask a computer service techroofhong59
 
optimizing_site_performance
optimizing_site_performanceoptimizing_site_performance
optimizing_site_performanceBryan Farrow
 
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docxCTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docxannettsparrow
 
Daniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber SecurityDaniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber SecurityEnergySec
 
IT Problems & Problem Management
IT Problems & Problem ManagementIT Problems & Problem Management
IT Problems & Problem ManagementApalytics
 
IT Performance Problems
IT Performance Problems IT Performance Problems
IT Performance Problems Apalytics
 
E book for hydraulics
E book for hydraulicsE book for hydraulics
E book for hydraulicsjericfreimuth
 
EuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen FinalEuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen FinalAlbert Witteveen
 
Reliability predictions essay FMS Reliability
Reliability predictions  essay FMS ReliabilityReliability predictions  essay FMS Reliability
Reliability predictions essay FMS ReliabilityAccendo Reliability
 
Trigger4th industrialrevolution
Trigger4th industrialrevolutionTrigger4th industrialrevolution
Trigger4th industrialrevolutionLaxman Marathe
 
Trigger for the next industrial revolution
Trigger for the next industrial revolutionTrigger for the next industrial revolution
Trigger for the next industrial revolutionLaxman Marathe
 
Trigger for the next industrial revolution
Trigger for the next industrial revolutionTrigger for the next industrial revolution
Trigger for the next industrial revolutionLaxman Marathe
 
Ebusiness Auditing
Ebusiness AuditingEbusiness Auditing
Ebusiness Auditingnewarttech
 
1documents--ECS_Introduction.docCIS 321 Case Study ‘Equipme.docx
1documents--ECS_Introduction.docCIS 321 Case Study ‘Equipme.docx1documents--ECS_Introduction.docCIS 321 Case Study ‘Equipme.docx
1documents--ECS_Introduction.docCIS 321 Case Study ‘Equipme.docxhyacinthshackley2629
 
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingAlbert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingTEST Huddle
 

Similar to Some Rules for Successful Data Center Operations (20)

7 questions to ask a computer service tech
7 questions to ask a computer service tech7 questions to ask a computer service tech
7 questions to ask a computer service tech
 
salamanca_carlos_report
salamanca_carlos_reportsalamanca_carlos_report
salamanca_carlos_report
 
optimizing_site_performance
optimizing_site_performanceoptimizing_site_performance
optimizing_site_performance
 
Acc 340 Preview Full Course
Acc 340 Preview Full Course Acc 340 Preview Full Course
Acc 340 Preview Full Course
 
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docxCTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
CTTS CASE STUDY - Milestone 2 Problem AnalysisPage 2-7MILEST.docx
 
Acc 340 Preview Full Course
Acc 340 Preview Full CourseAcc 340 Preview Full Course
Acc 340 Preview Full Course
 
Daniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber SecurityDaniel Lance - What "You've Got Mail" Taught Me About Cyber Security
Daniel Lance - What "You've Got Mail" Taught Me About Cyber Security
 
IT Problems & Problem Management
IT Problems & Problem ManagementIT Problems & Problem Management
IT Problems & Problem Management
 
IT Performance Problems
IT Performance Problems IT Performance Problems
IT Performance Problems
 
E book for hydraulics
E book for hydraulicsE book for hydraulics
E book for hydraulics
 
EuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen FinalEuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen Final
 
Final Report GET434
Final Report GET434Final Report GET434
Final Report GET434
 
Reliability predictions essay FMS Reliability
Reliability predictions  essay FMS ReliabilityReliability predictions  essay FMS Reliability
Reliability predictions essay FMS Reliability
 
Baw Golden Nuggets 2006 Small Version
Baw Golden Nuggets 2006 Small VersionBaw Golden Nuggets 2006 Small Version
Baw Golden Nuggets 2006 Small Version
 
Trigger4th industrialrevolution
Trigger4th industrialrevolutionTrigger4th industrialrevolution
Trigger4th industrialrevolution
 
Trigger for the next industrial revolution
Trigger for the next industrial revolutionTrigger for the next industrial revolution
Trigger for the next industrial revolution
 
Trigger for the next industrial revolution
Trigger for the next industrial revolutionTrigger for the next industrial revolution
Trigger for the next industrial revolution
 
Ebusiness Auditing
Ebusiness AuditingEbusiness Auditing
Ebusiness Auditing
 
1documents--ECS_Introduction.docCIS 321 Case Study ‘Equipme.docx
1documents--ECS_Introduction.docCIS 321 Case Study ‘Equipme.docx1documents--ECS_Introduction.docCIS 321 Case Study ‘Equipme.docx
1documents--ECS_Introduction.docCIS 321 Case Study ‘Equipme.docx
 
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingAlbert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
 

Some Rules for Successful Data Center Operations

  • 1. Some Rules for Successful Data Center Operations There are numerous policies and practices that every data center owner or operator follows, or should be following, but in reality there are only a few rules that must be adhered to for gaining the best results. Evaluate what you are doing and why you are doing it. It is far better to prevent a “Lights Out” occurrence than sitting around a conference table discussing the forensics of a shutdown. Years ago, I was doing a walk through in a small data center in the Midwest. I saw the Emergency Power Off (EPO) on the wall near the door and I asked the facility manager why it did not have a cover or a sign indicating its purpose. I’ll never forget his response. He pointed at the familiar red mushroom button and said, “That is what we call the resume generating switch, if you punch that, you should make sure your resume is updated because your career is over here.” We both laughed about that, but it got me to thinking that data center likely is going to have an unscheduled shutdown one of these days. Even if the company communicated that message to each employee, there is still a chance that someone did not get the memo or heard and understood the message about the EPO and was disgruntled, looking for a new place to work. Without the cover or sign, there is always a risk someone could accidently lean against the wall and dump the data center. A practical solution is to determine the necessity of the EPO, based on NEC code updates and then consider the risks associated with EPO and how to eliminate or reduce the risks. Providing a flip cover and posting a sign is only a portion of the solution. Make sure every new and existing employee understands the white space is where their paycheck is printed and emphasize the importance of practicing common sense procedures working in that environment. However, caution must be applied here. Following processes out of habit leads to complacency. Complacency may lead to disaster, which brings us to rule #2. Test your defenses. Not referring to IT security; that should be covered from both the virtual and physical side, but have an understanding of the vulnerabilities your data center may suffer. Are your maintenance practices and performance reviewed periodically? I have looked through maintenance logs that have been checked and dated, but curiously the handwriting all looks very similar, almost as if the technician, who is likely either bored with the paperwork or overwhelmed by other demands just sat down and filled out a month or two worth of maintenance records. I’m not casting stones at the technicians, the demands upon their time sometimes requires shortcuts, and paperwork is one of those shortcuts. The operator should review the logs after each scheduled maintenance performance to look for trends or anomalies, no matter
  • 2. whether they are generated from a CMMS system or in a binder. For example, looking over a UPS report and comparing it to the previous month. Does the recorded voltage and current (input and output) match the UPS display? Or does the unit need recalibration? How is the battery health? Are you testing the UPS and generator together under load conditions? Entire volumes have been written regarding maintenance of critical infrastructure, but don’t just rely on a completed maintenance report, do a quality control check and look deeper at each component. A couple of airlines would agree that a few minutes spent here each month could save your data center from an unfortunate event that makes headlines and causes company stock prices to tumble in the short term. Looking past the maintenance programs for the critical infrastructure, has the warranty information been archived? Do you know when the end of life for the systems will sneak up? One of the more difficult conversations to have with a data center manager is to tell them their Computer Room Air Conditioning (CRAC) or UPS is approaching its final days and their response is “there is no money in the budget for a new unit.” This exchange is often coupled with the fact the unit is operating above design capacity and redundancy. In other words, running to failure. The additional stress this creates for the operator operating on borrowed time could be reduced if they set the calendar alarm with this date minus four to five years or whatever is appropriate for your organizations budget planning. This will allow time to build the necessary budget for a capital investment without surprising the CFO. Other defense strategies are reviewing and assessing the disaster recovery (DR) plan and business continuity plan (BCP). You say you haven’t blown the dust off those documents since Y2K? DR and BCP have far reaching impact outside the data center requires that comprehensive risk assessment study should not be overlooked. The data center manager should be concerned with anything that could become a disruption. Stacks of boxes and paper in the data center? Very common, but also a potential fire hazard or trip hazard, at the minimum, non-IT related stuff, including old servers, racks impede work flow. Underfloor smoke detectors? What is the policy for lifting floor tiles? How often is the under floor area cleaned? Lifting a floor tile could present some nasty surprises that may include setting off a fire alarm. These are only a few of the many possible scenarios that may impact your IT operations. A best practice for testing your defenses is Management by Walking around (MBWA). This time tested custom was popular back in the 1980’s but appears to have roots much further back to Abraham Lincoln’s review of the Union troops during the Civil War.
  • 3. It is to your advantage to get inside the white space and see, smell, touch and hear what is going on. Are there stains on the ceiling tiles? How long have they been there? Probably should open a tile and take a peek with a flashlight. Does the air smell musty? Is there water building up in the condensate pans? Do you hear the belts squealing on the CRACs? Open up the unit and check to see if they are aligned properly. Do you smell a whiff of Sulphur? The UPS batteries may be telling you something. What’s the temperature in the space? Too often we rely on email or texts and completed checklists and don’t take a hands on approach to identifying risks in the data center environment. Challenge Your Assumptions One of my favorite authors, Rudyard Kipling wrote, “I keep six honest serving-men (They taught me all I knew); Their names are What and Why and When And How and Where and Who.” One of the epiphanies I had early in the data center world was to know what a red herring was and how to address it. Historically a red herring was a logical device to distract an opponent during an argument. Supposedly it comes from training hunting dogs by the use of a kipper or herring to drag across the ground, thus throwing them off the trail. In my particular case, there were no fish. I had met with a technician in the data center who was very upset that the UPS failed every time he was in the room. I was set back on my heels by that statement. While I was collecting data to determine the cause of the loss of power in the data center I ran into a network administrator working inside a rack. I asked him about the failures and his response was there were two in the past six months. I countered, “but your colleague just told me that “every” time he was in the data center there was a UPS failure.” The network guy stroked his beard, looked down for a second and looked back at me. “Well, the other guy you met works at another location. He is only here a couple of times a year, so that would make sense.” That revelation taught me to keep asking questions and challenge my own assumption. The first technician told me the truth. The network administrator also told me the truth, but he included details that the first observer did not possess. If I had o nly spoken to the first technician, I would have spent a lot of wasted time tracking a problem that I assumed was an ongoing, everyday occurrence. The network administrator pointed me in the right direction and I was able to determine the true
  • 4. cause of the shutdown. (As it turned out, it was not entirely a UPS issue, as previously understood, but a site wiring fault) These are just a few of the water cooler discussions a data center manager should be having with their team. RTFM This acronym is short for Read the Fine Manual or something similar to that. Getting a budget request approved for new hardware is cause for celebration. Data center operators often “get by” with outdated infrastructure until the cost of maintenance exceeds the cost of new equipment. When the truck pulls up to the loading dock and the field service engineers arrive, it’s a good idea to spend some time with these people and equipment to get acquainted with your device. I have been to sites where the end user had boxes of infrastructure equipment that was still wrapped in plastic and never been installed. I asked myself why this happens and the following scenario comes to mind. The data center manager knows he needs to upgrade his power so he meets with a sales guy and his applications engineer. He is impressed by the capabilities of the new equipment, and all the exciting features. He tours the factory to see a witness test and he makes the decision to purchase the product. He may have even had the foresight to include commissioning in the purchase to insure the unit works as intended. So after all of these events have passed, he finds himself with all of these extra parts that were originally described as “features”. The engineer providing the start-up has a narrow scope of work that is limited to only getting the unit online. The parts and pieces that comprise the bells and whistles are left in the original boxes. This may work if you purchased a Star Wars limited edition light saber, but it is not practical for your data center. Months go by, the stuff is still sitting in the box. The warranty has expired and the stuff is still sitting in the box. Finally, someone decides to dig a little deeper before the equipment gets tossed into a dumpster. They discover these are the monitoring devices that provide information about system loading, status and other important details. You remember the sales guy and applications engineer describing these “features” but your expectation was the unit would be complete at the start up. This reminds me of a furniture advertisement that plays across New England radio stations. “An informed buyer is our best customer.” The cure for this heartburn is to research your chosen product down to the minor details and ask those questions up front. “Will feature “X” be configured as part of the startup?” and if not, point me to the resources for me to enable them myself.”
  • 5. “Can you demonstrate the feature X and how to integrate it into my new device? It may be a good idea to include other people (IT and Facilities) in this discussion as well. If you purchased a monitoring device or sensor, insure that it is compatible with your existing building monitoring system (BMS) or building automation system (BAS) before adding it to the bill of sale. The final word is to know your equipment from Day 1 and share that knowledge across the disciplines that may be called upon to work with the new device. Nexttime I’ll addsome commentbaseduponotherreal worldexperience.