Testing high-availability telecom-grade systems by Attila Fekete

Testing high-availability
telecom-grade systems
By Attila Fekete
Test Automation Day 2013 - Rotterdam

YOUR LOGO
Page ▪ 2
About me
János György Kemény
Neumann János
Puskás Tivadar

YOUR LOGO
Page ▪ 3
About me
Hungary

Page ▪ 4
Agenda
Definition of High-Availability
Maintain High-Availability
Final note
1
2
3
4
Design for High-Availability

Definition of High-Availability

Page ▪ 7
What do we call high-availability telecom-grade system?
What isn’t HA…
f@%#ed up

Page ▪ 8
What isn’t HA…
“Good Enough for Us”

Page ▪ 9
What isn’t HA…
f@%#ed upYou are fired

Page ▪ 10
Federal Standard
“1037C and MIL-ST-188 define telecommunications
availability as a ratio of the time a module can be
used (if a use request existed) over a period of time.
It is a ratio of uptime to total time”
What HA is…

Page ▪ 11
Scheduled downtime:
Any event initiated by Operation and
Maintenance personnel
Unscheduled downtime:
▪ Software failure
▪ Hardware failure
▪ Environmental anomaly
Types of downtime

Page ▪ 12
Availability Downtime per year
90% ("one nine") 36.5 days
99% ("two nines") 3.65 days
99.9% ("three nines") 8.76 hours
99.99% ("four nines") 52.56 minutes
99.999% ("five nines") 5.26 minutes
99.9999% ("six nines") 31.5 seconds
99.99999% ("seven nines") 3.15 seconds
What HA is…

Page ▪ 13
Events to be handled
▪ HW failures
▪ SW failures
▪ On-line reconfigurations
▪ Network connection problems
▪ Extreme load levels
▪ Natural disasters
What a HA system would cope with

Page ▪ 14
Available all the time
▪ Literally no service unavailability
▪ Literally no data loss
▪ billing information
▪ user profiles
Characteristics - part 1

Page ▪ 15
Online upgrade, patching, replacement
▪ Hardware
▪ Operating system
▪ Middle-ware
▪ Application
Characteristics - part 2

Page ▪ 16
Ability to recover after
▪ SW crashes
▪ HW failures
▪ Overload situations
▪ Network outage
Stability
▪ Till taken out of service
Characteristics – part 3

Page ▪ 18
No single point of
failure
How to achieve High-Availability?

Page ▪ 19
Design for high-availability
Example

Page ▪ 20
Redundancy
▪ ISP connection to:
▪ its redundant peers
▪ to any surrounding system
▪ Every piece of HW it is built from
▪ Every single SW component
▪ Relevant data
▪ Whole node / entity
What must be redundant?

Page ▪ 21
Redundancy
▪ Active/active
▪ Active/passive
▪ N+1
▪ N+M
▪ N-to-1
▪ N-to-N
Types of redundancy

Page ▪ 22
Active/Active
▪ All entities are handling
requests
▪ In case of failure traffic is
taken over
Types of redundancy

Page ▪ 23
Active/Passive
▪ Only one of them is online
▪ Failure node brought online if
primary fails
Types of redundancy

Page ▪ 24
N+1
▪ Single extra failure node
▪ Also called roaming-spare
▪ Takes over the role of the
failing one
Types of redundancy

Page ▪ 25
N+M
▪ More extra failure node
▪ To increase redundancy
Types of redundancy

Page ▪ 26
N-to-1
▪ Stand-by node becomes
active temporarily
▪ Also called dedicated spare
▪ Same node becomes failure
node after original node
restored
Types of redundancy

Page ▪ 27
N-to-N
▪ Combination of N+M and
Active/Active
▪ Load is redistributed among
remaining active nodes
Types of redundancy

Page ▪ 28
Recovery mechanisms:
▪ Process restart
▪ Processor board restart
▪ Cluster restart
Recovery time:
▪ Short (miliseconds..seconds..minutes)
Ability to recover

Page ▪ 30
Verify and Maintain high-availability
Robustness, Recovery,…
X X
Test Harness
Process A Process A
Process B Process B
New
Process A

Page ▪ 31
Scenarios
▪ SW crashes
▪ HW component failures
▪ Node failures
▪ Connection problems
▪ Maintenance activities
Robustness, Recovery,… - scenarios

Page ▪ 32
Load and stability test environment
Simulator 5
Simulator 4
MTAS
XDMS
Simulator 7
Simulator 3
Simulator 2
Simulator 6
Simulator 1
Simulator 15
SUT
Simulator 12
Simulator 11
Simulator 14
Simulator 10
Simulator 9
Simulator 13
Simulator 8

Page ▪ 33
Type of stresses
▪ Few-hour overload situations (1.5x
engineered load)
▪ One-hour heavy load (4x
engineered load)
Load Test and Stability Test – level of stress

Page ▪ 34
Specification
▪ Simulates several million
subscribers (5-15 million)
▪ Simulates several tens of thousands
of call set-ups per second (5000-
6000) while handling ongoing
sessions
▪ Simulates large part of the
telephony network
▪ Scalable
▪ Test harness is TTCN 3-based in-
house-developed
Load Test and Stability Test - test environment

Page ▪ 35
Maintain and improve availability
Shorter
runs every
night
Long runs
during the
weekend

Page ▪ 36
Final Note
Design for
High-
Availability
Maintain
High-
Availability
High-
Availability
System

Page ▪ 37
Do You Have
Any Questions?

Testing high-availability telecom-grade systems by Attila Fekete

Recommended

Recommended

More Related Content

Similar to Testing high-availability telecom-grade systems by Attila Fekete

Similar to Testing high-availability telecom-grade systems by Attila Fekete (20)

Recently uploaded

Recently uploaded (20)

Testing high-availability telecom-grade systems by Attila Fekete