System reliability is one of most important features for the Network service. With the development of visualized network functions, system reliability and stability becomes more complex than before. Tes ting and verifying OPNFV infrastructure for tolerating the faults and failures is a great challenge. To address this challenge, the presentation presents a test framework which covers the means and types of faults injection as well as the implement of fault injection tools. Some test cases are shown to validate the framework.
How to Remove Document Management Hurdles with X-Docs?
Reliability Testing in OPNFV
1. Reliability Testing In OPNFV
Hao Pang (shamrock.pang@huawei.com)
San Francisco 11/09/2015
2. 2
Agenda
• Overview of reliability
• Reliability research in OPNFV
• Introduction of reliability testing
• Reliability testing framework
• Test case demo
• Work Plan and Related links
3. 3
Overview of Reliability
Reliability is the ability of a system to perform and maintain its functions in routine
circumstances, as well as hostile or unexpected circumstances.
High reliability is often defined as a system availability of 99.999%, which meets
the carrier-level requirement and protects carriers from business and reputation
loss due to system breakdown.
High Reliability
No Single Point of Failure
Overload Control
Automatic Recovery
Fault Isolation
… …
4. 4
Reliability is Critical Important
More and more business
services rely on the Internet
and cloud. A short outage can
cause a huge economic loss.
When the natural disaster
happening, the telecom
reliability supports the nation
security and life emergency.
5. 5
Reliability in NFV and OPNFV
Standard Title
NFV-REL 001 Resiliency Requirements
NFV-REL 002
Report on Scalable Architectures
for Reliability Management
NFV-REL 003 E2E reliability models
NFV-REL 004
Active monitoring and failure
detection
NFV-REL 005 Quality Accountability Framework
Avail
ability
Multi-
Site
Doctor
Pin
Point
Escalat
or
Predict
ion
OPNFV Platform
6. 6
The Goal of Reliability Testing
Availability Numbers of 9 Downtime In a year(minute) Apply Product
99.9% 3 500 Computer / Server
99.99% 4 50 Enterprise-class Device
99.999% 5 5 Common carrier -class Device
99.9999% 6 0.5 Higher carrier-class Device
Customer Requires
Device had better never have fault
If the fault occur, Do not affect the main services
If the main services affected, locate and recover
the fault as soon as possible
Reliability Testing Goal
Test the product ability of not having fault
Test the product ability of fault recovery
Test the product ability of fault detection, location
and tolerance.
7. 7
Type of Reliability Testing
Known Fault
Known Scenario
Unknown Fault
Unknown Scenario
Fault InjectionFault Injection
Reliability Prediction Stability Testing
• Simulate fault directly
• Trigger fault By External trigger
Scenario
• Simulate fault directly
• Trigger it by the Stress
Scenario
• Predict it by Modeling
8. 8
Metrics for Service Availability
【General Metrics】
Failure detection time: the time interval between the failure happens and the failure been detected.
Service recovery time: the time interval between the failure happens and the service finish its recovery. the time
interval from the occurrence of an abnormal event (e.g. failure, manual interruption of service, etc.) until recovery
of the service.
Fault repair Time: the time interval between the failure detected and faulty entity is repaired.
Service failover time: the time from the moment of detecting the failure of the instance providing the service until
the service is provided again by a new instance.
【Carrier Metrics】
Network access success rate: the proportion of the end-user can access the network when requested.
Call Drop Rate: the proportion of mobiles which having successfully accessed suffer an abnormal release,
caused by loss of the radio link.
9. 9
Scope of Reliability Testing (1)
Item Layer Failure Mode Priority
Hardware
Reliability testing
Controller node
Control node server failure H
Memory failure / Memory condition not ok M
CPU failure / CPU condition not ok M
Management network failure M
Storage network failure M
Compute node
Compute node server failure H
Memory failure / Memory condition not ok M
CPU failure / CPU condition not ok M
Management network failure M
Storage network failure M
Service network failure M
Storage node
Storage node server failure M
Hard disk failure M
Networking
HW failure of physical switch/router M
Network Interface failure M
10. 10
Scope of Reliability Testing (2)
Item Layer Failure Mode Priority
Software Reliability
testing
Controller node
HOST OS failure H
VIM important process failure H
VIM important service failure H
Management network failure—OVS M
Storage network failure—OVS M
Compute node
HOST OS failure H
VIM important service failure M
VIM important process failure M
Management network failure—OVS M
Storage network failure—OVS M
Service network failure-OVS M
Hypervisor(KVM 、libvirt、Qemu)failure H
VM
VM failure H
Gest OS failure M
VM service failure L
VM process failure L
System reliability
testing
Robustness test Various Faults random injection testing M
Stability testing Long time testing with load L
12. 12
Reliability Testing Framework: Yardstick
Goal:
a Test framework for verifying the infrastructure compliance when running VNF
applications.
Scope:
Generic test cases to verify the NFVI in perspective of a VNF.
Test stimuli to enable infrastructure testing such as upgrade and recovery.
Covered various aspects: performance, reliability, security and so on.
Design:
Yardstick testing framework is flexible to support various contexts of SUT, and easy to
develop additional plugin for fault injection, system monitor, and result evaluation. It is
also convenient to integrate other existing test frameworks or test tools.
13. 13
Reliability Testing Workflow
cases.yaml
Yardstick
System under
test(SUT)
TaskCommand
Context
HeatContextBMContext
Heat
Nova Neutron
… …
Runner
Output(Process)
Scenarios
VM VM
Host
Monitor
3.fault Inject
output.json
1.input
2.deploy
5.output
6.undeploy
Note: when some fault injection methods break the SUT(e.g. kill one controller), a option step
recovering the broken SUT might be executed before the scenario is over.
Attacker
4.collect data
14. 14
Case Sample: OpenStack Service HA
Controller
Node #1
Controller
Node #2
Controller
Node #3
Yardstick
Monitor
Controllers HA
nova-api
Attacker
nova-api nova-api
① Check
② setup monitor
③ monitor
service
④ setup attacker
⑤ fault
injection
⑥ get
the SLA⑦ Recover
the broken
⑧ test result
16. 16
Test Case Demo Show
1. The nova-api processes
are running on Controller
Node #3.
2. The nova service
is running normally.
Before the Fault Injection
17. 17
Test Case Demo Show
After the Fault Injection
3. The nova-api processes
have been killed.
4. Nova service is still
working normally.
5. ServiceHA test case ran by
using Yardstick and recorded
its test result.
18. 18
Work Plan
TC Name Testing Purpose Priority Status
OpenStack controller
service failure
Verify the services running on the
controller is high available.
H Done
Controller Node
abnormally shutdown
Verify the controllers cluster deployment
is high available.
M Doing
Management network
timeout
Verify the controllers cluster deployment
is high available.
M TODO
VM abnormally
down
Verify the VM running on the compute is
high available.
M TODO
19. 19
Related Links
Wiki of Yardstick project: https://wiki.opnfv.org/yardstick
Requirement of HA project:
https://etherpad.opnfv.org/p/High_Availabiltiy_Requirement_for_OPNFV
HA Test cases in Yardstick: https://etherpad.opnfv.org/p/yardstick_ha