Albert Witteveen - With Cloud Computing Who Needs Performance Testing
Some Rules for Successful Data Center Operations
1. Some Rules for Successful Data Center Operations
There are numerous policies and practices that every data center owner or
operator follows, or should be following, but in reality there are only a few rules that
must be adhered to for gaining the best results.
Evaluate what you are doing and why you are doing it.
It is far better to prevent a “Lights Out” occurrence than sitting around a conference
table discussing the forensics of a shutdown.
Years ago, I was doing a walk through in a small data center in the Midwest. I saw
the Emergency Power Off (EPO) on the wall near the door and I asked the facility
manager why it did not have a cover or a sign indicating its purpose. I’ll never forget
his response. He pointed at the familiar red mushroom button and said, “That is what
we call the resume generating switch, if you punch that, you should make sure your
resume is updated because your career is over here.”
We both laughed about that, but it got me to thinking that data center likely is going
to have an unscheduled shutdown one of these days.
Even if the company communicated that message to each employee, there is still a
chance that someone did not get the memo or heard and understood the message
about the EPO and was disgruntled, looking for a new place to work. Without the
cover or sign, there is always a risk someone could accidently lean against the wall
and dump the data center.
A practical solution is to determine the necessity of the EPO, based on NEC code
updates and then consider the risks associated with EPO and how to eliminate or
reduce the risks. Providing a flip cover and posting a sign is only a portion of the
solution. Make sure every new and existing employee understands the white space
is where their paycheck is printed and emphasize the importance of practicing
common sense procedures working in that environment.
However, caution must be applied here. Following processes out of habit leads to
complacency. Complacency may lead to disaster, which brings us to rule #2.
Test your defenses.
Not referring to IT security; that should be covered from both the virtual and physical
side, but have an understanding of the vulnerabilities your data center may suffer.
Are your maintenance practices and performance reviewed periodically? I have
looked through maintenance logs that have been checked and dated, but curiously
the handwriting all looks very similar, almost as if the technician, who is likely either
bored with the paperwork or overwhelmed by other demands just sat down and filled
out a month or two worth of maintenance records. I’m not casting stones at the
technicians, the demands upon their time sometimes requires shortcuts, and
paperwork is one of those shortcuts. The operator should review the logs after each
scheduled maintenance performance to look for trends or anomalies, no matter
2. whether they are generated from a CMMS system or in a binder. For example, looking
over a UPS report and comparing it to the previous month. Does the recorded voltage
and current (input and output) match the UPS display? Or does the unit need
recalibration? How is the battery health? Are you testing the UPS and generator
together under load conditions? Entire volumes have been written regarding
maintenance of critical infrastructure, but don’t just rely on a completed maintenance
report, do a quality control check and look deeper at each component. A couple of
airlines would agree that a few minutes spent here each month could save your data
center from an unfortunate event that makes headlines and causes company stock
prices to tumble in the short term.
Looking past the maintenance programs for the critical infrastructure, has the
warranty information been archived? Do you know when the end of life for the
systems will sneak up? One of the more difficult conversations to have with a data
center manager is to tell them their Computer Room Air Conditioning (CRAC) or UPS
is approaching its final days and their response is “there is no money in the budget
for a new unit.” This exchange is often coupled with the fact the unit is operating
above design capacity and redundancy. In other words, running to failure.
The additional stress this creates for the operator operating on borrowed time could
be reduced if they set the calendar alarm with this date minus four to five years or
whatever is appropriate for your organizations budget planning. This will allow time
to build the necessary budget for a capital investment without surprising the CFO.
Other defense strategies are reviewing and assessing the disaster recovery (DR) plan
and business continuity plan (BCP). You say you haven’t blown the dust off those
documents since Y2K?
DR and BCP have far reaching impact outside the data center requires that
comprehensive risk assessment study should not be overlooked.
The data center manager should be concerned with anything that could become a
disruption.
Stacks of boxes and paper in the data center? Very common, but also a potential fire
hazard or trip hazard, at the minimum, non-IT related stuff, including old servers,
racks impede work flow. Underfloor smoke detectors? What is the policy for lifting
floor tiles? How often is the under floor area cleaned? Lifting a floor tile could present
some nasty surprises that may include setting off a fire alarm.
These are only a few of the many possible scenarios that may impact your IT
operations.
A best practice for testing your defenses is Management by Walking around (MBWA).
This time tested custom was popular back in the 1980’s but appears to have roots
much further back to Abraham Lincoln’s review of the Union troops during the Civil
War.
3. It is to your advantage to get inside the white space and see, smell, touch and hear
what is going on. Are there stains on the ceiling tiles? How long have they been there?
Probably should open a tile and take a peek with a flashlight. Does the air smell
musty? Is there water building up in the condensate pans? Do you hear the belts
squealing on the CRACs? Open up the unit and check to see if they are aligned
properly. Do you smell a whiff of Sulphur? The UPS batteries may be telling you
something. What’s the temperature in the space?
Too often we rely on email or texts and completed checklists and don’t take a hands
on approach to identifying risks in the data center environment.
Challenge Your Assumptions
One of my favorite authors, Rudyard Kipling wrote,
“I keep six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.”
One of the epiphanies I had early in the data center world was to know what a red
herring was and how to address it. Historically a red herring was a logical device to
distract an opponent during an argument. Supposedly it comes from training hunting
dogs by the use of a kipper or herring to drag across the ground, thus throwing them
off the trail.
In my particular case, there were no fish. I had met with a technician in the data
center who was very upset that the UPS failed every time he was in the room. I was
set back on my heels by that statement. While I was collecting data to determine the
cause of the loss of power in the data center I ran into a network administrator
working inside a rack. I asked him about the failures and his response was there were
two in the past six months. I countered, “but your colleague just told me that “every”
time he was in the data center there was a UPS failure.” The network guy stroked his
beard, looked down for a second and looked back at me. “Well, the other guy you
met works at another location. He is only here a couple of times a year, so that would
make sense.”
That revelation taught me to keep asking questions and challenge my own
assumption.
The first technician told me the truth. The network administrator also told me the
truth, but he included details that the first observer did not possess. If I had o nly
spoken to the first technician, I would have spent a lot of wasted time tracking a
problem that I assumed was an ongoing, everyday occurrence. The network
administrator pointed me in the right direction and I was able to determine the true
4. cause of the shutdown. (As it turned out, it was not entirely a UPS issue, as previously
understood, but a site wiring fault)
These are just a few of the water cooler discussions a data center manager should
be having with their team.
RTFM
This acronym is short for Read the Fine Manual or something similar to that.
Getting a budget request approved for new hardware is cause for celebration. Data
center operators often “get by” with outdated infrastructure until the cost of
maintenance exceeds the cost of new equipment. When the truck pulls up to the
loading dock and the field service engineers arrive, it’s a good idea to spend some
time with these people and equipment to get acquainted with your device. I have
been to sites where the end user had boxes of infrastructure equipment that was still
wrapped in plastic and never been installed. I asked myself why this happens and
the following scenario comes to mind.
The data center manager knows he needs to upgrade his power so he meets with a
sales guy and his applications engineer. He is impressed by the capabilities of the
new equipment, and all the exciting features. He tours the factory to see a witness
test and he makes the decision to purchase the product. He may have even had the
foresight to include commissioning in the purchase to insure the unit works as
intended. So after all of these events have passed, he finds himself with all of these
extra parts that were originally described as “features”. The engineer providing the
start-up has a narrow scope of work that is limited to only getting the unit online.
The parts and pieces that comprise the bells and whistles are left in the original boxes.
This may work if you purchased a Star Wars limited edition light saber, but it is not
practical for your data center. Months go by, the stuff is still sitting in the box. The
warranty has expired and the stuff is still sitting in the box. Finally, someone decides
to dig a little deeper before the equipment gets tossed into a dumpster. They discover
these are the monitoring devices that provide information about system loading,
status and other important details. You remember the sales guy and applications
engineer describing these “features” but your expectation was the unit would be
complete at the start up. This reminds me of a furniture advertisement that plays
across New England radio stations. “An informed buyer is our best customer.” The
cure for this heartburn is to research your chosen product down to the minor details
and ask those questions up front.
“Will feature “X” be configured as part of the startup?” and if not, point me to the
resources for me to enable them myself.”
5. “Can you demonstrate the feature X and how to integrate it into my new device? It
may be a good idea to include other people (IT and Facilities) in this discussion as
well.
If you purchased a monitoring device or sensor, insure that it is compatible with your
existing building monitoring system (BMS) or building automation system (BAS)
before adding it to the bill of sale.
The final word is to know your equipment from Day 1 and share that knowledge
across the disciplines that may be called upon to work with the new device.
Nexttime I’ll addsome commentbaseduponotherreal worldexperience.