7. Page ▪ 7
What do we call high-availability telecom-grade system?
What isn’t HA…
f@%#ed up
8. Page ▪ 8
What do we call high-availability telecom-grade system?
What isn’t HA…
“Good Enough for Us”
9. Page ▪ 9
What do we call high-availability telecom-grade system?
What isn’t HA…
f@%#ed upYou are fired
10. Page ▪ 10
What do we call high-availability telecom-grade system?
Federal Standard
“1037C and MIL-ST-188 define telecommunications
availability as a ratio of the time a module can be
used (if a use request existed) over a period of time.
It is a ratio of uptime to total time”
What HA is…
11. Page ▪ 11
What do we call high-availability telecom-grade system?
Scheduled downtime:
Any event initiated by Operation and
Maintenance personnel
Unscheduled downtime:
▪ Software failure
▪ Hardware failure
▪ Environmental anomaly
Types of downtime
12. Page ▪ 12
What do we call high-availability telecom-grade system?
Availability Downtime per year
90% ("one nine") 36.5 days
99% ("two nines") 3.65 days
99.9% ("three nines") 8.76 hours
99.99% ("four nines") 52.56 minutes
99.999% ("five nines") 5.26 minutes
99.9999% ("six nines") 31.5 seconds
99.99999% ("seven nines") 3.15 seconds
What HA is…
13. Page ▪ 13
What do we call high-availability telecom-grade system?
Events to be handled
▪ HW failures
▪ SW failures
▪ On-line reconfigurations
▪ Network connection problems
▪ Extreme load levels
▪ Natural disasters
What a HA system would cope with
14. Page ▪ 14
What do we call high-availability telecom-grade system?
Available all the time
▪ Literally no service unavailability
▪ Literally no data loss
▪ billing information
▪ user profiles
Characteristics - part 1
15. Page ▪ 15
What do we call high-availability telecom-grade system?
Online upgrade, patching, replacement
▪ Hardware
▪ Operating system
▪ Middle-ware
▪ Application
Characteristics - part 2
16. Page ▪ 16
What do we call high-availability telecom-grade system?
Ability to recover after
▪ SW crashes
▪ HW failures
▪ Overload situations
▪ Network outage
Stability
▪ Till taken out of service
Characteristics – part 3
20. Page ▪ 20
Design for high-availability
Redundancy
▪ ISP connection to:
▪ its redundant peers
▪ to any surrounding system
▪ Every piece of HW it is built from
▪ Every single SW component
▪ Relevant data
▪ Whole node / entity
What must be redundant?
22. Page ▪ 22
Design for high-availability
Active/Active
▪ All entities are handling
requests
▪ In case of failure traffic is
taken over
Types of redundancy
23. Page ▪ 23
Design for high-availability
Active/Passive
▪ Only one of them is online
▪ Failure node brought online if
primary fails
Types of redundancy
24. Page ▪ 24
Design for high-availability
N+1
▪ Single extra failure node
▪ Also called roaming-spare
▪ Takes over the role of the
failing one
Types of redundancy
25. Page ▪ 25
Design for high-availability
N+M
▪ More extra failure node
▪ To increase redundancy
Types of redundancy
26. Page ▪ 26
Design for high-availability
N-to-1
▪ Stand-by node becomes
active temporarily
▪ Also called dedicated spare
▪ Same node becomes failure
node after original node
restored
Types of redundancy
27. Page ▪ 27
Design for high-availability
N-to-N
▪ Combination of N+M and
Active/Active
▪ Load is redistributed among
remaining active nodes
Types of redundancy
28. Page ▪ 28
Design for high-availability
Recovery mechanisms:
▪ Process restart
▪ Processor board restart
▪ Cluster restart
Recovery time:
▪ Short (miliseconds..seconds..minutes)
Ability to recover
33. Page ▪ 33
Verify and Maintain high-availability
Type of stresses
▪ Few-hour overload situations (1.5x
engineered load)
▪ One-hour heavy load (4x
engineered load)
Load Test and Stability Test – level of stress
34. Page ▪ 34
Verify and Maintain high-availability
Specification
▪ Simulates several million
subscribers (5-15 million)
▪ Simulates several tens of thousands
of call set-ups per second (5000-
6000) while handling ongoing
sessions
▪ Simulates large part of the
telephony network
▪ Scalable
▪ Test harness is TTCN 3-based in-
house-developed
Load Test and Stability Test - test environment
35. Page ▪ 35
Verify and Maintain high-availability
Maintain and improve availability
Shorter
runs every
night
Long runs
during the
weekend
36. Page ▪ 36
Final Note
Design for
High-
Availability
Maintain
High-
Availability
High-
Availability
System