SlideShare a Scribd company logo
IT Infrastructure Architecture
Availability Concepts
(chapter 4)
Infrastructure Building Blocks
and Concepts
Introduction
• Everyone expects
their infrastructure
to be available all
the time
• A 100%
guaranteed
availability of an
infrastructure is
impossible
Calculating availability
• Availability can neither be calculated, nor
guaranteed upfront
– It can only be reported on afterwards, when a system
has run for some years
• Over the years, much knowledge and experience
is gained on how to design high available systems
– Failover
– Redundancy
– Structured programming
– Avoiding Single Points of Failures (SPOFs)
– Implementing systems management
Calculating availability
• The availability of a system is usually expressed as
a percentage of uptime in a given time period
– Usually one year or one month
• Example for downtime expressed as a percentage
per year:
Availability %
Downtime
per year
Downtime
per month
Downtime
per week
99.8% 17.5 hours 86.2 minutes 20.2 minutes
99.9% ("three nines") 8.8 hours 43.2 minutes 10.1 minutes
99.99% ("four nines") 52.6 minutes 4.3 minutes 1.0 minutes
99.999% ("five nines") 5.3 minutes 25.9 seconds 6.1 seconds
Calculating availability
• Typical requirements used in service level
agreements today are 99.8% or 99.9% availability
per month for a full IT system
• The availability of the infrastructure must be
much higher
– Typically in the range of 99.99% or higher
• 99.999% uptime is also known as carrier grade
availability
– For one component
– Higher availability levels for a complete system are
very uncommon, as they are almost impossible to
reach
Calculating availability
• It is good practice to agree on the maximum
frequency of unavailability
Unavailability
(minutes)
Number of events
(per year)
0 – 5 <= 35
5 – 10 <= 10
10 – 20 <= 5
20 – 30 <=2
> 30 <= 1
MTBF and MTTR
• Mean Time Between Failures (MTBF)
–The average time that passes between
failures
• Mean Time To Repair (MTTR)
–The time it takes to recover from a failure
MTBF and MTTR
• Some components have higher MTBF than others
• Some typical MTB’s:
Component MTBF (hours)
Hard disk 750,000
Power supply 100,000
Fan 100,000
Ethernet Network Switch 350,000
RAM 1,000,000
MTTR
• MTTR can be kept low by:
– Having a service contract with the supplier
– Having spare parts on-site
– Automated redundancy and failover
MTTR
• Steps to complete repairs:
– Notification of the fault (time before seeing an alarm
message)
– Processing the alarm
– Finding the root cause of the error
– Looking up repair information
– Getting spare components from storage
– Having technician come to the datacenter with the
spare component
– Physically repairing the fault
– Restarting and testing the component
Calculation examples
Availability =
MTBF
MTBF + MTTR
× 100%
Component MTBF (h) MTTR (h) Availability in %
Power supply 100,000 8 0.9999200 99.99200
Fan 100,000 8 0.9999200 99.99200
System board 300,000 8 0.9999733 99.99733
Memory 1,000,000 8 0,9999920 99.99920
CPU 500,000 8 0.9999840 99.99840
Network
Interface
Controller (NIC)
250,000 8 0.9999680 99.99680
Calculation examples
• Serial components: One defect leads to
downtime
• Example: the above system’s availability is:
0.9999200 × 0.9999200 × 0.9999733
× 0.9999920 × 0.9999840 × 0.9999680
= 0.99977 = 𝟗𝟗. 𝟗𝟕𝟕%
(each components’ availability is at least 99.99%)
Calculation examples
• Parallel components: One defect: no downtime!
• But beware of SPOFs!
• Calculate availability:
𝐴 = 1 − (1 − 𝐴1) 𝑛
• Total availability = 1 − (1 − 0.99)2
= 99.99%
Sources of unavailability - human
errors
• 80% of outages impacting mission-critical
services is caused by people and process issues
• Examples:
• Performing a test in the production environment
• Switching off the wrong component for repair
• Swapping a good working disk in a RAID set instead of the
defective one
• Restoring the wrong backup tape to production
• Accidentally removing files
– Mail folders, configuration files
• Accidentally removing database entries
– Drop table x instead of drop table y
Sources of unavailability - software
bugs
• Because of the complexity of most software it
is nearly impossible (and very costly) to create
bug-free software
• Application software bugs can stop an entire
system
• Operating systems are software too
– Operating systems containing bugs can lead to
corrupted file systems, network failures, or other
sources of unavailability
Sources of unavailability - planned
maintenance
• Sometimes needed to perform systems management
tasks:
– Upgrading hardware or software
– Implementing software changes
– Migrating data
– Creation of backups
• Should only be performed on parts of the
infrastructure where other parts keep serving clients
• During planned maintenance the system is more
vulnerable to downtime than under normal
circumstances
– A temporary SPOF could be introduced
– Systems managers could make mistakes
Sources of unavailability - physical
defects
• Everything breaks down eventually
• Mechanical parts are most likely to break first
• Examples:
– Fans for cooling equipment usually break because
of dust in the bearings
– Disk drives contain moving parts
– Tapes are very vulnerable to defects as the tape is
spun on and off the reels all the time
– Tape drives contain very sensitive pieces of
mechanics that can break easily
Sources of unavailability - bathtub
curve
• A component failure is most likely when the
component is new
• When a component still works after the first month, it
is likely that it will continue working without failure
until the end of its life
Sources of unavailability -
environmental issues
• Environmental issues can cause downtime:
– Failing facilities
• Power
• Cooling
– Disasters
• Fire
• Earthquakes
• Flooding
Sources of unavailability - complexity
of the infrastructure
• Adding more components to an overall system
design can undermine high availability
– Even if the extra components are implemented to
achieve high availability
• Complex systems
– Have more potential points of failure
– Are more difficult to implement correctly
– Are harder to manage
• Sometimes it is better to just have an extra spare
system in the closet than to use complex
redundant systems
Redundancy
• Redundancy is the duplication of critical
components in a single system, to avoid a
single point of failure (SPOF)
• Examples:
– A single component having two power supplies; if
one fails, the other takes over
– Dual networking interfaces
– Redundant cabling
Failover
• Failover is the (semi)automatic switch-over to
a standby system or component
• Examples:
– Windows Server failover clustering
– VMware High Availability
– Oracle Real Application Cluster (RAC) database
Fallback
• Fallback is the manual switchover to an
identical standby computer system in a
different location
• Typically used for disaster recovery
• Three basic forms of fallback solutions:
– Hot site
– Cold site
– Warm site
Fallback – hot site
• A hot site is
– A fully configured fallback datacentre
– Fully equipped with power and cooling
– Applications are installed on the servers
– Data is kept up-to-date to fully mirror the production
system
• Requires constant maintenance of the hardware,
software, data, and applications to be sure the
site accurately mirrors the state of the production
site at all times
Fallback - cold site
• Is ready for equipment to be brought in during
an emergency, but no computer hardware is
available at the site
• Applications will need to be installed and
current data fully restored from backups
• If an organization has very little budget for a
fallback site, a cold site may be better than
nothing
Fallback - warm site
• A computer facility readily available with
power, cooling, and computers, but the
applications may not be installed or
configured
• A mix between a hot site and cold site
• Applications and data must be restored from
backup media and tested
– This typically takes a day
Business Continuity
• An IT disaster is defined as an irreparable
problem in a datacenter, making the datacenter
unusable
• Natural disasters:
– Floods
– Hurricanes
– Tornadoes
– Earthquakes
• Manmade disasters:
– Hazardous material spills
– Infrastructure failure
– Bio-terrorism
Business Continuity
• In case of a disaster, the infrastructure could
become unavailable, in some cases for a longer
period of time
• Business Continuity Management includes:
– IT
– Managing business processes
– Availability of people and work places in disaster
situations
• Disaster recovery planning (DRP) contains a set of
measures to take in case of a disaster, when
(parts of) the IT infrastructure must be
accommodated in an alternative location
RTO and RPO
• RTO and RPO are objectives in case of a disaster
• Recovery Time Objective (RTO)
– The maximum duration of time within which a
business process must be restored after a disaster, in
order to avoid unacceptable consequences (like
bankruptcy)
RTO and RPO
• Recovery Point Objective (RPO)
– The point in time to which data must be recovered
considering some "acceptable loss" in a disaster
situation
• RTO and RPO are individual objectives
– They are not related

More Related Content

What's hot

11. operating-systems-part-1
11. operating-systems-part-111. operating-systems-part-1
11. operating-systems-part-1
Muhammad Ahad
 
11. operating-systems-part-2
11. operating-systems-part-211. operating-systems-part-2
11. operating-systems-part-2
Muhammad Ahad
 
01. 02. introduction (13 slides)
01.   02. introduction (13 slides)01.   02. introduction (13 slides)
01. 02. introduction (13 slides)
Muhammad Ahad
 
09. storage-part-1
09. storage-part-109. storage-part-1
09. storage-part-1
Muhammad Ahad
 
10. compute-part-1
10. compute-part-110. compute-part-1
10. compute-part-1
Muhammad Ahad
 
Troubleshooting complex layer 2 issues ppt 16 bsit098
Troubleshooting complex  layer 2 issues ppt 16 bsit098Troubleshooting complex  layer 2 issues ppt 16 bsit098
Troubleshooting complex layer 2 issues ppt 16 bsit098
Quratulain baloch
 
08. networking
08. networking08. networking
08. networking
Muhammad Ahad
 
IP tables and Filtering
IP tables and FilteringIP tables and Filtering
IP tables and Filtering
Aisha Talat
 
10. compute-part-2
10. compute-part-210. compute-part-2
10. compute-part-2
Muhammad Ahad
 
Introduction to Network and System Administration
Introduction to Network and System AdministrationIntroduction to Network and System Administration
Introduction to Network and System Administration
Duressa Teshome
 
Distributed system architecture
Distributed system architectureDistributed system architecture
Distributed system architecture
Yisal Khan
 
Chapter01
Chapter01Chapter01
Chapter01
Muhammad Ahad
 
Chapter07
Chapter07Chapter07
Chapter07
Muhammad Ahad
 
System and network administration network services
System and network administration network servicesSystem and network administration network services
System and network administration network services
Uc Man
 
Software Architecture
Software ArchitectureSoftware Architecture
Software Architecture
Dharmalingam Ganesan
 
Pressman ch-11-component-level-design
Pressman ch-11-component-level-designPressman ch-11-component-level-design
Pressman ch-11-component-level-designOliver Cheng
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systemsViet-Trung TRAN
 
Chapter08
Chapter08Chapter08
Chapter08
Muhammad Ahad
 
Distributed Operating Systems
Distributed Operating SystemsDistributed Operating Systems
Distributed Operating Systems
Ummiya Mohammedi
 
Network monitoring tools
Network monitoring toolsNetwork monitoring tools
Network monitoring tools
Chathurangi Shyalika
 

What's hot (20)

11. operating-systems-part-1
11. operating-systems-part-111. operating-systems-part-1
11. operating-systems-part-1
 
11. operating-systems-part-2
11. operating-systems-part-211. operating-systems-part-2
11. operating-systems-part-2
 
01. 02. introduction (13 slides)
01.   02. introduction (13 slides)01.   02. introduction (13 slides)
01. 02. introduction (13 slides)
 
09. storage-part-1
09. storage-part-109. storage-part-1
09. storage-part-1
 
10. compute-part-1
10. compute-part-110. compute-part-1
10. compute-part-1
 
Troubleshooting complex layer 2 issues ppt 16 bsit098
Troubleshooting complex  layer 2 issues ppt 16 bsit098Troubleshooting complex  layer 2 issues ppt 16 bsit098
Troubleshooting complex layer 2 issues ppt 16 bsit098
 
08. networking
08. networking08. networking
08. networking
 
IP tables and Filtering
IP tables and FilteringIP tables and Filtering
IP tables and Filtering
 
10. compute-part-2
10. compute-part-210. compute-part-2
10. compute-part-2
 
Introduction to Network and System Administration
Introduction to Network and System AdministrationIntroduction to Network and System Administration
Introduction to Network and System Administration
 
Distributed system architecture
Distributed system architectureDistributed system architecture
Distributed system architecture
 
Chapter01
Chapter01Chapter01
Chapter01
 
Chapter07
Chapter07Chapter07
Chapter07
 
System and network administration network services
System and network administration network servicesSystem and network administration network services
System and network administration network services
 
Software Architecture
Software ArchitectureSoftware Architecture
Software Architecture
 
Pressman ch-11-component-level-design
Pressman ch-11-component-level-designPressman ch-11-component-level-design
Pressman ch-11-component-level-design
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
 
Chapter08
Chapter08Chapter08
Chapter08
 
Distributed Operating Systems
Distributed Operating SystemsDistributed Operating Systems
Distributed Operating Systems
 
Network monitoring tools
Network monitoring toolsNetwork monitoring tools
Network monitoring tools
 

Viewers also liked

Chapter06
Chapter06Chapter06
Chapter06
Muhammad Ahad
 
Chapter03
Chapter03Chapter03
Chapter03
Muhammad Ahad
 
Chapter13
Chapter13Chapter13
Chapter13
Muhammad Ahad
 
Chapter14
Chapter14Chapter14
Chapter14
Muhammad Ahad
 
Chapter02
Chapter02Chapter02
Chapter02
Muhammad Ahad
 
Chapter04
Chapter04Chapter04
Chapter04
Muhammad Ahad
 
Chapter05
Chapter05Chapter05
Chapter05
Muhammad Ahad
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
Muhammad Ahad
 

Viewers also liked (8)

Chapter06
Chapter06Chapter06
Chapter06
 
Chapter03
Chapter03Chapter03
Chapter03
 
Chapter13
Chapter13Chapter13
Chapter13
 
Chapter14
Chapter14Chapter14
Chapter14
 
Chapter02
Chapter02Chapter02
Chapter02
 
Chapter04
Chapter04Chapter04
Chapter04
 
Chapter05
Chapter05Chapter05
Chapter05
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
 

Similar to 04. availability-concepts

501 ch 9 implementing controls
501 ch 9 implementing controls501 ch 9 implementing controls
501 ch 9 implementing controls
gocybersec
 
It security for libraries part 3 - disaster recovery
It security for libraries part 3 - disaster recovery It security for libraries part 3 - disaster recovery
It security for libraries part 3 - disaster recovery
Brian Pichman
 
Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)
Peter Tröger
 
KoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBegan
KoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBeganKoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBegan
KoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBegan
Tobias Koprowski
 
Disaster Recover : 10 tips for disaster recovery planning
Disaster Recover : 10 tips for disaster recovery planningDisaster Recover : 10 tips for disaster recovery planning
Disaster Recover : 10 tips for disaster recovery planning
InTechnology Managed Services (part of Redcentric)
 
Real-Time Operating Systems Real-Time Operating Systems RTOS .ppt
Real-Time Operating Systems Real-Time Operating Systems RTOS .pptReal-Time Operating Systems Real-Time Operating Systems RTOS .ppt
Real-Time Operating Systems Real-Time Operating Systems RTOS .ppt
lematadese670
 
KoprowskiT_2AMaDisasterJustBeganAD2018
KoprowskiT_2AMaDisasterJustBeganAD2018KoprowskiT_2AMaDisasterJustBeganAD2018
KoprowskiT_2AMaDisasterJustBeganAD2018
Tobias Koprowski
 
Fdp embedded systems
Fdp embedded systemsFdp embedded systems
Fdp embedded systems
Kavya G
 
MySQL enterprise backup overview
MySQL enterprise backup overviewMySQL enterprise backup overview
MySQL enterprise backup overview郁萍 王
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication System
Scott Moonen
 
Real-Time Systems Intro.pptx
Real-Time Systems Intro.pptxReal-Time Systems Intro.pptx
Real-Time Systems Intro.pptx
20EUEE018DEEPAKM
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
Subhas Kumar Ghosh
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
Amit Kejriwal
 
Weeks [01 02] 20100921
Weeks [01 02] 20100921Weeks [01 02] 20100921
Weeks [01 02] 20100921Mohamed Kamel
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
Govind Kanshi
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
Govind Kanshi
 
L11 system maintenance
L11 system maintenanceL11 system maintenance
L11 system maintenanceOMWOMA JACKSON
 
A Backup Today Saves Tomorrow
A Backup Today Saves TomorrowA Backup Today Saves Tomorrow
A Backup Today Saves Tomorrow
Andrew Moore
 
Dr and ha solutions with sql server azure
Dr and ha solutions with sql server azureDr and ha solutions with sql server azure
Dr and ha solutions with sql server azure
MSDEVMTL
 

Similar to 04. availability-concepts (20)

501 ch 9 implementing controls
501 ch 9 implementing controls501 ch 9 implementing controls
501 ch 9 implementing controls
 
It security for libraries part 3 - disaster recovery
It security for libraries part 3 - disaster recovery It security for libraries part 3 - disaster recovery
It security for libraries part 3 - disaster recovery
 
Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)Dependable Systems - Introduction (1/16)
Dependable Systems - Introduction (1/16)
 
KoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBegan
KoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBeganKoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBegan
KoprowskiT_PASSEastMidsFEB16_2AMaDisasterJustBegan
 
Sw2 1
Sw2 1Sw2 1
Sw2 1
 
Disaster Recover : 10 tips for disaster recovery planning
Disaster Recover : 10 tips for disaster recovery planningDisaster Recover : 10 tips for disaster recovery planning
Disaster Recover : 10 tips for disaster recovery planning
 
Real-Time Operating Systems Real-Time Operating Systems RTOS .ppt
Real-Time Operating Systems Real-Time Operating Systems RTOS .pptReal-Time Operating Systems Real-Time Operating Systems RTOS .ppt
Real-Time Operating Systems Real-Time Operating Systems RTOS .ppt
 
KoprowskiT_2AMaDisasterJustBeganAD2018
KoprowskiT_2AMaDisasterJustBeganAD2018KoprowskiT_2AMaDisasterJustBeganAD2018
KoprowskiT_2AMaDisasterJustBeganAD2018
 
Fdp embedded systems
Fdp embedded systemsFdp embedded systems
Fdp embedded systems
 
MySQL enterprise backup overview
MySQL enterprise backup overviewMySQL enterprise backup overview
MySQL enterprise backup overview
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication System
 
Real-Time Systems Intro.pptx
Real-Time Systems Intro.pptxReal-Time Systems Intro.pptx
Real-Time Systems Intro.pptx
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Weeks [01 02] 20100921
Weeks [01 02] 20100921Weeks [01 02] 20100921
Weeks [01 02] 20100921
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
 
L11 system maintenance
L11 system maintenanceL11 system maintenance
L11 system maintenance
 
A Backup Today Saves Tomorrow
A Backup Today Saves TomorrowA Backup Today Saves Tomorrow
A Backup Today Saves Tomorrow
 
Dr and ha solutions with sql server azure
Dr and ha solutions with sql server azureDr and ha solutions with sql server azure
Dr and ha solutions with sql server azure
 

Recently uploaded

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 

Recently uploaded (20)

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 

04. availability-concepts

  • 1. IT Infrastructure Architecture Availability Concepts (chapter 4) Infrastructure Building Blocks and Concepts
  • 2. Introduction • Everyone expects their infrastructure to be available all the time • A 100% guaranteed availability of an infrastructure is impossible
  • 3. Calculating availability • Availability can neither be calculated, nor guaranteed upfront – It can only be reported on afterwards, when a system has run for some years • Over the years, much knowledge and experience is gained on how to design high available systems – Failover – Redundancy – Structured programming – Avoiding Single Points of Failures (SPOFs) – Implementing systems management
  • 4. Calculating availability • The availability of a system is usually expressed as a percentage of uptime in a given time period – Usually one year or one month • Example for downtime expressed as a percentage per year: Availability % Downtime per year Downtime per month Downtime per week 99.8% 17.5 hours 86.2 minutes 20.2 minutes 99.9% ("three nines") 8.8 hours 43.2 minutes 10.1 minutes 99.99% ("four nines") 52.6 minutes 4.3 minutes 1.0 minutes 99.999% ("five nines") 5.3 minutes 25.9 seconds 6.1 seconds
  • 5. Calculating availability • Typical requirements used in service level agreements today are 99.8% or 99.9% availability per month for a full IT system • The availability of the infrastructure must be much higher – Typically in the range of 99.99% or higher • 99.999% uptime is also known as carrier grade availability – For one component – Higher availability levels for a complete system are very uncommon, as they are almost impossible to reach
  • 6. Calculating availability • It is good practice to agree on the maximum frequency of unavailability Unavailability (minutes) Number of events (per year) 0 – 5 <= 35 5 – 10 <= 10 10 – 20 <= 5 20 – 30 <=2 > 30 <= 1
  • 7. MTBF and MTTR • Mean Time Between Failures (MTBF) –The average time that passes between failures • Mean Time To Repair (MTTR) –The time it takes to recover from a failure
  • 8. MTBF and MTTR • Some components have higher MTBF than others • Some typical MTB’s: Component MTBF (hours) Hard disk 750,000 Power supply 100,000 Fan 100,000 Ethernet Network Switch 350,000 RAM 1,000,000
  • 9. MTTR • MTTR can be kept low by: – Having a service contract with the supplier – Having spare parts on-site – Automated redundancy and failover
  • 10. MTTR • Steps to complete repairs: – Notification of the fault (time before seeing an alarm message) – Processing the alarm – Finding the root cause of the error – Looking up repair information – Getting spare components from storage – Having technician come to the datacenter with the spare component – Physically repairing the fault – Restarting and testing the component
  • 11. Calculation examples Availability = MTBF MTBF + MTTR × 100% Component MTBF (h) MTTR (h) Availability in % Power supply 100,000 8 0.9999200 99.99200 Fan 100,000 8 0.9999200 99.99200 System board 300,000 8 0.9999733 99.99733 Memory 1,000,000 8 0,9999920 99.99920 CPU 500,000 8 0.9999840 99.99840 Network Interface Controller (NIC) 250,000 8 0.9999680 99.99680
  • 12. Calculation examples • Serial components: One defect leads to downtime • Example: the above system’s availability is: 0.9999200 × 0.9999200 × 0.9999733 × 0.9999920 × 0.9999840 × 0.9999680 = 0.99977 = 𝟗𝟗. 𝟗𝟕𝟕% (each components’ availability is at least 99.99%)
  • 13. Calculation examples • Parallel components: One defect: no downtime! • But beware of SPOFs! • Calculate availability: 𝐴 = 1 − (1 − 𝐴1) 𝑛 • Total availability = 1 − (1 − 0.99)2 = 99.99%
  • 14. Sources of unavailability - human errors • 80% of outages impacting mission-critical services is caused by people and process issues • Examples: • Performing a test in the production environment • Switching off the wrong component for repair • Swapping a good working disk in a RAID set instead of the defective one • Restoring the wrong backup tape to production • Accidentally removing files – Mail folders, configuration files • Accidentally removing database entries – Drop table x instead of drop table y
  • 15. Sources of unavailability - software bugs • Because of the complexity of most software it is nearly impossible (and very costly) to create bug-free software • Application software bugs can stop an entire system • Operating systems are software too – Operating systems containing bugs can lead to corrupted file systems, network failures, or other sources of unavailability
  • 16. Sources of unavailability - planned maintenance • Sometimes needed to perform systems management tasks: – Upgrading hardware or software – Implementing software changes – Migrating data – Creation of backups • Should only be performed on parts of the infrastructure where other parts keep serving clients • During planned maintenance the system is more vulnerable to downtime than under normal circumstances – A temporary SPOF could be introduced – Systems managers could make mistakes
  • 17. Sources of unavailability - physical defects • Everything breaks down eventually • Mechanical parts are most likely to break first • Examples: – Fans for cooling equipment usually break because of dust in the bearings – Disk drives contain moving parts – Tapes are very vulnerable to defects as the tape is spun on and off the reels all the time – Tape drives contain very sensitive pieces of mechanics that can break easily
  • 18. Sources of unavailability - bathtub curve • A component failure is most likely when the component is new • When a component still works after the first month, it is likely that it will continue working without failure until the end of its life
  • 19. Sources of unavailability - environmental issues • Environmental issues can cause downtime: – Failing facilities • Power • Cooling – Disasters • Fire • Earthquakes • Flooding
  • 20. Sources of unavailability - complexity of the infrastructure • Adding more components to an overall system design can undermine high availability – Even if the extra components are implemented to achieve high availability • Complex systems – Have more potential points of failure – Are more difficult to implement correctly – Are harder to manage • Sometimes it is better to just have an extra spare system in the closet than to use complex redundant systems
  • 21. Redundancy • Redundancy is the duplication of critical components in a single system, to avoid a single point of failure (SPOF) • Examples: – A single component having two power supplies; if one fails, the other takes over – Dual networking interfaces – Redundant cabling
  • 22. Failover • Failover is the (semi)automatic switch-over to a standby system or component • Examples: – Windows Server failover clustering – VMware High Availability – Oracle Real Application Cluster (RAC) database
  • 23. Fallback • Fallback is the manual switchover to an identical standby computer system in a different location • Typically used for disaster recovery • Three basic forms of fallback solutions: – Hot site – Cold site – Warm site
  • 24. Fallback – hot site • A hot site is – A fully configured fallback datacentre – Fully equipped with power and cooling – Applications are installed on the servers – Data is kept up-to-date to fully mirror the production system • Requires constant maintenance of the hardware, software, data, and applications to be sure the site accurately mirrors the state of the production site at all times
  • 25. Fallback - cold site • Is ready for equipment to be brought in during an emergency, but no computer hardware is available at the site • Applications will need to be installed and current data fully restored from backups • If an organization has very little budget for a fallback site, a cold site may be better than nothing
  • 26. Fallback - warm site • A computer facility readily available with power, cooling, and computers, but the applications may not be installed or configured • A mix between a hot site and cold site • Applications and data must be restored from backup media and tested – This typically takes a day
  • 27. Business Continuity • An IT disaster is defined as an irreparable problem in a datacenter, making the datacenter unusable • Natural disasters: – Floods – Hurricanes – Tornadoes – Earthquakes • Manmade disasters: – Hazardous material spills – Infrastructure failure – Bio-terrorism
  • 28. Business Continuity • In case of a disaster, the infrastructure could become unavailable, in some cases for a longer period of time • Business Continuity Management includes: – IT – Managing business processes – Availability of people and work places in disaster situations • Disaster recovery planning (DRP) contains a set of measures to take in case of a disaster, when (parts of) the IT infrastructure must be accommodated in an alternative location
  • 29. RTO and RPO • RTO and RPO are objectives in case of a disaster • Recovery Time Objective (RTO) – The maximum duration of time within which a business process must be restored after a disaster, in order to avoid unacceptable consequences (like bankruptcy)
  • 30. RTO and RPO • Recovery Point Objective (RPO) – The point in time to which data must be recovered considering some "acceptable loss" in a disaster situation • RTO and RPO are individual objectives – They are not related