Maximizing Uptime Through Predictive Analysis and Data Integration
1. ZDT Group LLC
5460 Sandstone Ct
Cumming, Ga. 30040
P – 770-886-9555
www.zdtgroup.com
By Joe Soroka Copyright 2009
Maximizing Uptime through Predictive Analysis
Operating Mission Critical Facilities in today’s environment means continuing to strive to
meet and exceed our uptime requirements, as our operating budgets continue to shrink.
Taking a holistic approach is needed to track our facility information to accomplish this.
As I have discussed in previous whitepapers and other presentations, uptime is achieved
through a process that I call “RAMPS”. The tier rating of a facility is only the first thing
that affects uptime. You may have heard stories of people who have designed and built
tier IV facilities and have had outages time after time and where a tier I facility has been
operating for over ten years without a single outage. That’s because there is more than
just the design of the facility that affects the facility uptime, and that is “RAMPS”.
Reliability, Availability, Maintainability, Predictability, and Scalability are the keys to
success. Without paying attention to all facets of the facility, your uptime requirements
cannot be realized.
Reliability Center Maintenance (RCM) can be used to help in meeting the demands of a
shrinking operations budget. If you do not properly implement a RCM program it is
called a “LMP” Lack of Maintenance Program not RCM. However, to correctly implement
an RCM program you need to answer 7 questions per SAE JA1011 Evaluation Criteria for
RCM process. These questions are as follows:
1. What is the equipment supposed to do and what is the performance standard?
2. In what ways can it fail to meet the performance standard?
3. What are the events that will lead to failure?
4. What happens when the piece of equipment fails?
5. In what way does each of the failure modes matter to system operations?
6. What systematic task can be performed to prevent the failure?
7. What must be done if a suitable preventative task cannot be implemented?
Prior to implementing an RCM program, one must have a comprehensive Predictive
Analysis program in affect. By understanding the past performance and current operating
conditions of the equipment, we can develop predictions on failure modes for the future.
Tracking more items with a longer duration will result in a greater accuracy in our
2. ZDT Group LLC
5460 Sandstone Ct
Cumming, Ga. 30040
P – 770-886-9555
www.zdtgroup.com
By Joe Soroka Copyright 2009
predictions. This is why it is important to start at the beginning of the project with
design, moving to construction, commissioning and then ongoing operation and
maintenance. The information that is gained in each of these phases needs to be
captured in a manner that allows you to look at specific details across the entire life of the
equipment.
Data gathered without purpose is just that, data. It is important to define how you need
to use this data, and how it can assist you in your operation - from increasing reliability to
reducing operating costs. When in the design phase, the requirements that the facility
needs to operate at are explained in the Basis of Design (BoD) and the Sequence of
Operation (SoO). It is these two documents that need to be fully developed in the design
phase and updated during the life of the facility, for they act as the road map on how the
facility shall operate. With these documents a baseline is established for the performance
of your equipment, and when the site is started up and commissioned it is this baseline
that the facility needs to operate at. During the startup and commissioning phases these
baseline documents are updated so that changes that may have occurred during
construction and startup phases are captured in the updated BoD and SoO documents.
These documents need to be living documents and should be reviewed and updated as
often as required.
The information gathered during the commissioning process is typically never properly
integrated into the operations. Having the data that was discovered during startup and
commissioning is invaluable. Not having this data would be similar to taking all the
photos of your children prior to their high school graduation and locking them up
somewhere. Then years later, when you are sitting down with your son or daughter’s
future spouse, trying to show them your child’s life story, you have to begin with high
school pictures due to lack of data. All of the data collected during birth, startup, and
commissioning is part of the equipment life story. The whole story needs to be looked at
and shared between the parents and the spouse. As the spouse adds new photos to the
album, the story of the person’s life is evolving and you are able to look through the
album and see the entire story.
Predictability is one part of the Uptime RAMPS that often gets ignored. We may have
some trends we look at, and some testing that allows us to do some forecasting, but many
times having a fully predictive program is lacking. The data gathered and collected
during construction and commissioning should be fully integrated into the operation and
3. ZDT Group LLC
5460 Sandstone Ct
Cumming, Ga. 30040
P – 770-886-9555
www.zdtgroup.com
By Joe Soroka Copyright 2009
maintenance program. This data becomes invaluable to you as you develop a
predictability program. Imagine your data center is fully loaded with servers and you are
operating at the design load of 10kW per cabinet. The CFD modeling you did with your
heat load is proving out to be correct and everything is operating correctly. You decide to
leave early for the fishing trip up in the mountains that you had planned for some time
now. As you are enjoying your well-deserved trout fishing you realize how cool it is and
you think “wow it must be hot back in town”. You pull out your cellphone to check the
weather in town and you see there is no cell coverage. You don’t worry, you’re not on call
and you have a good staff at the site, so you go back to fishing. The fish are biting, there
is great weather, and you decide to stay late on Sunday.
When you come home Sunday you are shocked by the amount of voicemails and emails
you missed. It’s the one from your team that really hits you hard in the stomach. “Boss
we not sure where you are and we hope the fishing is good but we just lost the site, the
load dropped, all servers are down”. You immediately turn your car to the site and start
calling everyone. When you arrive you spend the next couple of hours explaining where
you were and trying to figure out what happened.
The next day you review the incident reports and start to put together a detailed
explanation of why your chiller plant failed, so you can complete a failure analysis report.
When the failure occurred, both your primary chiller and redundant chiller failed due to
high head pressure caused by high condenser water temperature. The cooling tower fans
were off and would not start. The Tower Fan VFDs were working and they tried to put
them in hand and still the tower fans would not operate. It took your team some time to
figure out the problem, but in the time it took to correct the issue the data center floor
overheated and servers started to fail. It was the vibration switches that were mounted
on the fans that tripped off. Three of the four tower fans failed and the fourth fan could
not support the load. It took some time to identify the problem but once it was
discovered the vibration sensors were jumped out the fans started and the chillers were
brought back online, but unfortunately not in time to keep the site up.
You remember that you were the only one on the job when the startup and
commissioning occurred, the rest of the team was hired after the site was turned over.
You remember something about those vibration limit switches during startup 3 years ago.
You look in the startup documents and there is no mention of any issue so you make
some calls and track down the person who did the startup. You reach this person and ask
4. ZDT Group LLC
5460 Sandstone Ct
Cumming, Ga. 30040
P – 770-886-9555
www.zdtgroup.com
By Joe Soroka Copyright 2009
him if he remembers what the issue with the vibration switch was, and as luck would
have it he remembers that project. He tells you “Yes, we had a lot of issues with those
switches during the startup, we replaced one and the others were adjusted.”. There is no
mention of the parts that were replaced during startup. If it was noted and was listed in
your system you would have known you had a failed device and you would have looked
into the issues during startup. No matter if a part fails in the factory, during startup, or
during the life of the equipment tracking, those failures and analyzing the nature of the
failure will allow you to predict future issues. During the PM visits since the site was
turned over to Operations, the vibration switches were never part of the maintenance
procedures so they were not tested or adjusted. This lead to the switches tripping prior to
any significant vibration. The Operation team was never trained on the vibration switch
and did not know they were installed or what their function was. Many time equipment
accessories are missed in training and preventative maintenance.
A week later you are at home. You get that dreaded phone call and rush back to site.
This time it’s the UPS system Module A. It has failed and dropped offline, however the
site remains protected by the other modules. You call out your service rep and they find
that the AC filter capacitors failed. Again you sit-down in your office the next morning to
fill out another failure analysis report, and in the process of reviewing past maintenance
records you notice that the AC filter current that is measured and recorded each visit was
about the same - up to the last PM when it was 35% higher from the previous readings. It
was three months ago when those AC Filter Capacitors were telling you it was time to
replace them, but the data was tracked in a manner that couldn’t be analyzed.
That evening at home you start thinking about how much information you were able to
gather from the past documentation that you used to complete your failure analysis
report. Why not use the data to predict what might fail rather than explaining why it did.
Besides, a failure analysis report is just like a predictive analysis report, it’s just after the
fact. So you start to identify your potential issues and figure out a way to capture and
analyze the data.
When preforming startup and ongoing maintenance the testing scripts need to be written
in a switch level detailed fashion. The system that you use should be in a database so you
can analyze the data and set thresholds. For example, do not write a test script that states
“check the oil pressure” and just have a checkbox next to it. First of all, a checkbox is not
sufficient data. You might have checked it but if the oil pressure was 2PSI, I’m sure that
5. ZDT Group LLC
5460 Sandstone Ct
Cumming, Ga. 30040
P – 770-886-9555
www.zdtgroup.com
By Joe Soroka Copyright 2009
your engine will not like that for very long. Your test script should have stated “Record
the oil pressure” and it should list a tolerance level. You inspect the oil pressure, it’s 35
PSI, and you record it. Your system should have a min max tolerance and if it is outside
the acceptable reading, you should get a failed condition and an issue log should be
automatically updated. Also you should have a % changed for a failed condition. This
averages all of the past readings for the oil pressure and results in a failed condition if the
percentage is x% outside of the average reading. Another approach could be rate of
change, a failed condition which occurs if the rate of change is x% from the last reading.
With these three types of automation added to your logging tool, you will start to capture
and analyze items automatically that will allow you to predict failure rates.
With a database as the backend for your commissioning and operations tool, you can
intelligently analyze critical points over a period of time. A set-point can drift over time
and may not be noticed because of the small changes from week to week, but with the
proper automation tools these points can be analyzed and alarms can be generated
automatically. You could have your building automation system run a script every
morning for the critical points that you are trending, and then write the result to a XML
file or a SOAP report that your automation tool could read as input for a daily PM task.
Now, not only is your team performing site inspection and maintenance at the site, but
your Building Automation System is talking directly with your operations automation
tool (or CMMS), and you are gaining valuable data that is being analyzed automatically
and alerting you of potential issues.
Joe Soroka has been working in the mission-critical field for the past 26 years. He has
worked with many clients over the years with conceptual designs, commissioning,
operation and maintenance and failure analysis. Joe has commissioned over 8 million
square feet of mission-critical facilities. He has worked on developing and reviewing
Operations & Maintenance, Training and Safety programs, including Method Operating
Procedures for mission-critical facilities. If you have any questions you may contact Joe
Soroka at joe@zdtgroup.com or visit www.zdtgroup.com