Risk Assessments and Reliability, What You Need To Know

INFRASTRUCTURE
RELIABILITY AND
RISK
ASSESSMENTS
Steven Shapiro, P.E., ATD
Mission Critical Practice Lead
Morrison Hershfield
Mission Critical

Morrison Hershfield Mission Critical

WHAT YOU NEED TO KNOW
AGENDA

• RISK ASSESSMENT

• INFRASTRUCTURE RELIABILITY
COOLING POWER

Morrison Hershfield Mission Critical – Infrastructure and Risk Assessments

RISK ASSESSMENTS

• WHY

• SITE EVALUATION

• METRICS


Causes of Critical Failures

• Location
• Design
• Redundancy level
• Construction
• Quality of equipment
• Age Lurking Vulnerabilities
• Operations & Maintenance program
• Personnel training
• Level of operator coverage
• Thoroughness of the commissioning program

5
WHY


• Equipment failure
• Operator error
• Natural disaster
• Design error
• Installation error
• Commissioning or test deficiency
• Maintenance oversight
• Equipment design

WHY Morrison Hershfield Mission Critical – Infrastructure and Risk Assessments


• Root cause not always easy to ascertain
• Combination of factors (Cascading Failures)
• Latent failures
• Most occur during change of state events
• More maintenance does not necessarily mean higher availability
• Non-Fault tolerant systems

WHY
FILURES Morrison Hershfield Mission Critical – Infrastructure and Risk Assessments

Commissioning or
Test Deficiency
4%

System Design Equipment
Natural Disaster 20% Design
3% 13%
Maintenance
Oversight
4%
Equipment Failure
28%
Installation Error
10% Human Error
18%

WHY Morrison Hershfield Mission Critical – Infrastructure and Risk Assessment

WHY DO RISK ASSESSMENT

• Alignment of business mission and facility performance expectation

• Quantifies the risk and exposure of the critical facilities to failure

• Identifies vulnerabilities and single points of failure

• First step in creating an action plan for site hardening

• Benchmark against the industry

• Assists in developing business case for capital expenditures

RISK ASSESSMENT Morrison Hershfield Mission Critical – Infrastructure and Risk Assessments

SITE EVALUATION

STEP 1

• Quantify reliability expectations
• Develop resiliency metrics


SITE EVALUATION

STEP 2
• Develop PRA model (Probabilistic Risk Assessment)

• Identify Single Points of Failure within critical systems
• Evaluate redundancy of critical systems
• Capacity and expendability analysis
• Adequacy of Engineered Systems
• Operation and maintenance policies, practices and procedures
• Adequacy of maintenance and testing programs
• Evaluate risks associated with site location
• Overall Risk Analysis
• Evaluate the adequacy of operations and maintenance programs


SITE EVALUATION

STEP 2 cont.
• Harmonics analysis
• EMF studies
• Short circuit & coordination studies
• Air flow modeling-CFD


SITE EVALUATION

STEP 3
• Perform gap analysis
STEP 4
• Recommendations for upgrade/alteration to optimize facility
performance
• Budget and schedule development
• Assess risk during implementation
• Benchmark findings with industry standards


RISK ASSESSMENT METRICS

• Probability of Failure/Reliability
• Availability
• MTTF
• MTTR
• Susceptibility to natural disasters
• Fault tolerance
• Single Points of Failure
• Maintainability
• Operational readiness
• Maintenance program


INFRASTRUCTURE RELIABILITY

• RELIABILITY / AVAILABLITY

• RELIABILITY MODELING

• RELIABILITY CONSIDERATIONS

RELIABILITY Morrison Hershfield Mission Critical – Infrastructure and Risk Assessments

RELIABILITY

• “Reliability” is used as an umbrella definition

• May Refer to Availability, Durability, Quality

• Five 9’s ????

• Reliability = Probability of Successful Operation


RELIABILITY AND AVAILABILITY

• Reliability predicts how likely is the system to fail.

• Availability is a measure (or a future prediction) of what percentage
of the time the system will operating properly


AVAILABILITY

Five 9’s refers to Availability

Availability (A) = Average fraction of time Something is in service
and performing intended function.

99.999% availability means:
• 5.3 minutes of downtime each year
or
• 1.77 hours of downtime every 20 years

Availability does not specify how often an outage occurs


AVAILABILITY

Availability (A) = MTBF/(MTBF + MTTR)

MTTF: Mean Time To Failure
MTBF: Mean Time Between Failures
MTTR: Mean Time to Repair or Downtime
MTBF=MTTF+MTTR


RELIABILITY BATHTUB CURVE

Failure Rate

early wear-out
life useful life period

0.5
Time (t) Years YEARS 12 14


RELIABILITY MODELING

• Used to compare system designs and assist in the evaluation of
risk versus the cost to mitigate the risk.

• Failure and Repair data comes from IEEE 493, Recommended
Practice for Design of Reliable Industrial and Commercial Power
Systems (IEEE Gold Book)



Components used for reliability modeling of the electrical system shown
here:

• Utility power
• Generator
• Circuit breakers
• Switchboards
• Cables
• Automatic Transfer Switch
• UPS module
• Battery
• Static Bypass Switch
• Rack Power



Reliability Block
Diagram (RBD)



Shown below are the results of the calculations

Hours Hours


THE TRADITIONAL CLASSIFICATION SYSTEM
The Uptime Institute
Tier 1 – Basic Non-Redundant Data Center
Single path for power and cooling distribution without redundant
components

Tier 2 – Basic Redundant Data Center
Single path for power and cooling distribution with redundant
components

Tier 3 – Concurrently Maintainable Data Center
Multiple paths for power and cooling distribution with only one path
active and with redundant components

Tier 4 – Fault Tolerant Data Center
Multiple active power and cooling distribution paths with redundant
components and fault tolerant


Tier Definitions

TIER REQUIREMENTS
Tier I Tier II Tier III Tier IV
1 Active
Number of Delivery Paths 1 1 2 Active
1 Passive
Redundancy N N+1 N+1 2N Minimum
Compartmentalization No No No Yes
Concurrent Maintainability No No Yes Yes
Fault Tolerance No No No Yes
Availability 99.67 99.75 99.982 99.95
Downtime in Hr/Yr 28.8 22 1.6 0.4


Data Center Cost

From the UI

• Tier I - $10,000 US/kW of Useable UPS Power Output

• Tier II - $11,000 US/kW of Useable UPS Power Output

• Tier III - $20,000 US/kW of Useable UPS Power Output

• Tier IV - $22,000 US/kW of Useable UPS Power Output

• Plus $225 US/SF of Computer Room


HOW MUCH REDUNDANCY IS ENOUGH?


Reliability Considerations

Assumptions

• Various configurations examined for single or dual utility feeders, UPS,
Generators, STS’s, single or dual cords

• Compare Reliability at 2000 KW and 4000 KW Load

• 5 Year Probability of Failure


Single utility feeder, parallel redundant UPS and
generators, single cord IT equipment

2N UPS, N+1 Generators, ASTSs, Dual Cord Rack

Two Utility Feeders, 2(N+1) UPS, 2(N+1) Generators,
ASTSs, Dual Cord Rack

Distributed Redundant UPS, N+2 Generators, Two
Utility Feeders, ASTSs and Dual Cord Rack

Emergency Diesel Generators

fail to start

fail after ½ hour

fail after 8 hours

fail after 24 hours

Study Performed by Idaho National Engineering Laboratory – February 1996 at Nuclear Power Plants



• 2(N+1) UPS/Generator with dual utility feeders - most reliable
topology
• 2(N+1) UPS > 2N UPS by small margin
• 2N > Distributed Redundant by small margin
• Significant improvement if a second utility feeder
is provided
• N+2 and/or 2N generator systems are more reliable than N+1
• Hybrid configuration in a hybrid facility is sometimes the best solution



• Assess the condition of the mechanical plant in conjunction with the
electrical system
• The facility reliability will be driven by the least reliable component
(typically the electrical infrastructure)


System Reliability Block

Electrical System Electrical Mechanical

Electrical systempow ering the Mechanical systemsupporting critical
critical load load


System Reliability Block
MTBF Availability Pf (3 years)
Electrical system
alone 330,184 0.99999 8.10%
Mechanical system
alone 178,611 0.999943 11.70%
Electrical system
supporting mechanical 108,500 0.999985 21.40%
Overall mechanical
system 70,087 0.999931 29.20%
Combined electrical
mechanical system 57,819 0.999922 36.90%

Electrical System Electrical Mechanical

Electrical system powering the Mechanical system supporting critical
critical load load


The Cost of Reliability
Reliability

99.9999

99.999

99.99

99.9

99.0

.9
$ $$ $$$ $$$$ $$$$$


Key Takeaways – Risk Assessment

• What Reliability Level Do you Really Need Based on Your Business
Case?

• Minimize Single Points of Failure

• Concurrent Maintainability?

• Fault Tolerance?

• Ensure Adequacy of Operations, Maintenance and Testing Programs

• How to justify the cost to upgrade from present state?


Key Takeaways – Reliability

• Design objective – find optimum compromise between cost and reliability
• Size matters – larger facilities yield lower reliability
• System architecture and design implementation is more important role
than equipment selection
• Segregate system in independent blocks
• Eliminate common source components to minimize fault propagation (i.e.
LBS, hot-tie, manual bus ties)
• Move single points of failures as close to the load as possible
• Always maintain two independent sources of power to the critical load
• Optimize the design of monitoring and controls circuits
• Keep it simple/minimize human intervention/Utilize Automation


Thank you and please feel
QUESTIONS? free to contact me

Steven Shapiro, PE, ATD
SShapiro@MorrisonHershfield.com
914.420.3213
http://www.linkedin.com/in/stevenshapirope
References:
Uptime Institute White Papers:
Tier Myths and Misconceptions
Data Center Site Infrastructure Tier Standard: Topology

Building Areas/Systems Reviewed

‫׀‬ General Construction
‫׀‬ Electrical
‫׀‬ Mechanical
‫׀‬ Plumbing And Fire Protection
‫׀‬ Operation and Maintenance
‫׀‬ Security
‫׀‬ Load Density

48
RISK ASSESSMENT

Site Reliability
• Is Project Compatible With Zoning
• Natural Environment Issues
‫׀‬ Seismic Zone
‫׀‬ Geo Technical Reports
‫׀‬ Sub Surface Conditions
‫׀‬ Tornado/hurricane Risk
‫׀‬ Site Flood Potential
‫׀‬ Fire Potential
‫׀‬ Site Topography
‫׀‬ Weather Extremes
• Man‐Made Environment Issues
‫׀‬ Power/Data and Communication/Water Supply/Sanitary Sewer Availability
‫׀‬ ISP Connectivity to Mirror and DR Sites
‫׀‬ Proximity of Hazardous Operational Facilities, i.e. Nuclear Power Plants, Military Bases,
Chemical Plants, Tank Farms, Water/Sewage Treatment Plants, Dams/Reservoirs, Gas
Stations, etc.
‫׀‬ Distance to Airports & Freeways
‫׀‬ Distance to Emergency Services, i.e. Fire and Police Departments, Hospital

49
RISK ASSESSMENT

Building Utilities and Physical Issues
‫ ׀‬General building systems and area characteristics
‫ ׀‬Life safety and environmental
Electrical Systems
‫ ׀‬Utility feeders
‫ ׀‬Service entry
‫ ׀‬Base building electrical distribution system including busways, step‐down
transformers, switchgear and distribution panels
‫׀‬ Uninterruptible power supply (UPS) systems
‫׀‬ Battery systems
‫׀‬ Power Distribution System including the critical computer rooms
‫׀‬ Emergency/standby generator and fuel system
‫׀‬ Normal/standby power transfer switchgear
‫׀‬ Grounding
‫׀‬ Emergency Power Off Systems
‫׀‬ Lightning protection system
‫׀‬ Fire alarm and smoke detection systems

50
RISK ASSESSMENT

• Mechanical Systems
‫׀‬ Critical Systems Chilled Water Plant: Chillers, pumps, piping distribution system,
controls, etc
‫׀‬ Critical Systems Condenser Water System: Cooling towers, pumps, piping, etc
‫׀‬ Critical Systems Air Handling Systems
‫׀‬ Critical Systems Air Distribution
‫׀‬ Critical Systems Secondary Chilled Water Loop
‫׀‬ Fuel Oil Systems
‫׀‬ Boiler Systems
‫׀‬ Compressed Air Systems
• Plumbing Systems
‫׀‬ Domestic Water Systems
‫׀‬ Natural Gas Systems
‫׀‬ Fire Suppression Systems (Water and Gaseous)
• Operation and Maintenance of the Critical Support Systems
‫׀‬ Maintenance procedures and programs
‫׀‬ Normal operating procedures
‫׀‬ Emergency operating procedures
‫׀‬ Training programs and methods
‫׀‬ Spare parts

51
RISK ASSESSMENT

• Building Automation
‫׀‬ Building Automation Systems.
‫׀‬ Physical Security Systems.
‫׀‬ Access control
‫׀‬ Intrusion detection
‫׀‬ CCTV systems
‫׀‬ ID badging systems
‫׀‬ Intercom systems
‫׀‬ Smoke Purge Systems
• Technology Systems
‫׀‬ Entrance Facility Feeds.
‫׀‬ Telephone Company Services.
• Systems Integration:
‫׀‬ The integration, compatibility and interaction of the above systems with each
other, as well as with the other building elements will be reviewed to ensure that
the systems are compatible and fully integrated.
52
RISK ASSESSMENT

Risk Assessments and Reliability, What You Need To Know

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Risk Assessments and Reliability, What You Need To Know

Similar to Risk Assessments and Reliability, What You Need To Know (20)

Recently uploaded

Recently uploaded (20)

Risk Assessments and Reliability, What You Need To Know