This white paper makes the case for:
Why Resiliency Management Needs to be in the Software Infrastructure. It Covers:
- Fault Management and Resiliency Management
- Seamless Protection for Faster and Simpler Devl
- Multiple Levels of Availability
- Speed of Service Restoration & Redundancy Restoration
- State Management
- Demonstrating Carrier Grade Availability and Resiliency
Boost PC performance: How more available memory can improve productivity
ETSI NFV#13 NFV resiliency presentation - ali kafel - stratus
1. Ali Kafel, VP of Business Development
Ensuring High Availability and Resiliency for NFV
Monday 15th February, 2016,
3.00 - 6.00pm
Croke Park, Dublin 3, Ireland
1
MOVING IT TO THE FIELD
(CO-LOCATED WITH ETSI NFV#13)
The details of this presentation are covered in this White Paper:
http://www.slideshare.net/akafel/nfv-resiliency-whitepaper-ali-kafel-stratus-technologies
2. Why We Need Resiliency vs High Availability
Achieving Resiliency Management for NFV
Proof point – ETSI PoC#35
Agenda
2
3. 3
Stratus Technologies
Intel Platforms
ftServer
Hardware Fault Tolerance
Proprietary Platforms
1980 - Present
Software Fault Tolerance
everRun Enterprise
12,000+ Installed
2008 - Present
Trusted Name in Fault Tolerant Computing for 35 years
Stratus Fault Tolerant Cloud
Resilient Cloud Technologies
Based of proven SW infrastructure
2015-present
4. Why We Need Resiliency vs High Availability
Achieving Resiliency Management for NFV
Proof point – ETSI PoC#35
4
5. 5
Why the need for Resiliency in NFV
• It is no longer about voice services ….. Certain data and video services
need HA and Resiliency more that voice
• Even “mature” cloud technologies still lack HA and Resiliency
uptime hours mins secs
99.9% 8.76 525.6 31536
99.99% 52.56 3154
99.999% 5.256 315.4
99.9999% 0.526 31.54
Down time
6. Reliability
• How long a system performs its intended function.
• MTBF = total time in service / number of failures
Availability
• % of time an equipment is in an operable state ie. Service accessible and
service continuity
• Availability (A) = Uptime / (Uptime + Downtime);
• A = MTBF / (MTBF + MTTR)
Resiliency
• The ability to recover quickly from failures, to return to its original form /
state to maintain operable state + QoS
• Resiliency (R) = Availability (A) + QoS
What you need is R, not just A… because, for example:…
A 99.999% application that fails once a week for just 1 secs and disrupts active services is not
Resilient and not acceptable
A 99.9999% application that causes increases latency during a fault is not acceptable
Defining Reliability, Availability and Resiliency
Stratus Technologies Page 2
7. Resiliency Management cannot be done in the VNFs
…..Because you cannot manage what you cannot see
VNFs
Virtualized Resources
Performance Faults
Resource Depletion
Fault Impacts
External Dependencies
AccessNetworks
Are exposed to
Depend 0n
VNFM
SDNC-OL
SDNC-UL
Shared
Storage
Shared
Network
NFVI Fabric
NODE HW C/N/S
NODE SW
C/N/S
Virtualization SW
vC, vN, Vs
Facility Infra
DCIM
CoreNetworks
Over 80% of system failure
modes are not directly
visible by the VNFs
Infrastructure decoupling hides
the information required to take
actions on faults from VNFs
VIM
HW Faults
SW Faults
Config Faults
Migrations Upgrades
7Stratus Technologies
8. Resiliency management can be “designed In” in multiple ways
but it’s best done in the Software Infrastructure
Applications / VNFs
Operating Environment
Hardware
• Transparent – no code change
• Fast & Simple Deployment
• No special App Software
• Very expensive
• Inefficient utilization
• Special Hardware
• Rigid
Costs&Resources
Pros
Cons
In the Hardware In the Applications In the Software Infrastructure
Applications / VNFs
.
.
.
.
Operating Environment
Hardware
• App specific state can be
Customized
• Can’t detect & manage all infrastructure faults
• Code written for resiliency increased by ~40%
• Most developers don’t have Resiliency experience
• More complex & Longer time to develop
Middleware
Applications / VNFs
Operating Environment
with Resilience Layer
Hardware
• Needs to be adaptable to a wide range of
Application Architectures
• Broader & Faster fault detection and correlation
• Faster and simpler Application development
• Transparent – no code changes
• Multiple levels of Resiliency
Benefits:
• Reduces Development & Verification time
• Lower Risks
• Faster time to market
8Stratus Technologies
9. Why We Need Resiliency vs High Availability
Achieving Resiliency Management for NFV
Proof point – ETSI PoC#35
9
10. Resiliency Management
It’s Complexity, Multi-Dimensional and more than just Fault Management
Detection
(Prediction)
Localization
Isolation
Remediation
(Service
restoration)
Recovery
(Redundancy
restoration)
Resiliency on multiple factors
• Speed of Service restoration & Redund. restoration
• State Management: Service continuity
• “Key state” versus “All state”
• Redundancy mode: Resource consumption / cost
• Application performance impact
10Stratus Technologies
Availability
Management
Configuration
Management
Fault management
11. State Protection
Remembering the preceding events in a given sequence of
interactions within the application
All or partial?
Service Restoration (or Failover)
Insuring that service is restored either through a fast restart or
failover to an active secondary or hotStandy
The speed of Service Restoration depends on the type of
application
Some applications need State Protection, most
applications need fast Service Restoration
Multi-dimensional aspects of Resiliency
Two Key Elements: Service Restoration and State protection
11Stratus Technologies
12. StateProtectionNoStateProtection
StateManagement
Slow (mins)
Start from
reset
Key state
stored on
disk
Re-instantiation after
failure: No Standby
“OSS, Billing”
“Web server”
Multi-dimensional aspects of Resiliency
State Protection versus Service Restoration
Types of State Protection
Full state protection
Key state protection
No state protection
(Stateless)
State Management has
implications on
Transparency
Performance
Resources
“Cold Standby”
Service Restoration Speed
12Stratus Technologies
13. StateProtectionNoStateProtection
Service Restoration Speed
StateManagement
Slow (mins)
Start from
reset
Failover
Medium (secs)
Key state
Stored in RAM
or Disk
Key state
stored on
disk
Pre-instantiated Before failure:
Failover to running Standby
“OSS, Billing” “email, SMS”
“Web server”
“vCE Router
Forwarder”
“Cold Standby” “Warm Standby”
Types of State Protection
Full state protection
Key state protection
No state protection
(Stateless)
State Management has
implications on
Transparency
Performance
Resources
Re-instantiation after
failure: No Standby
Multi-dimensional aspects of Resiliency
State Protection versus Service Restoration
13Stratus Technologies
14. StateProtectionNoStateProtection
StateManagement
Slow (mins)
Fast (msecs)
Start from
reset
Failover +
key state
reload
Failover Full
VM state in
RAM
Failover
Medium (secs)
Key state
Stored in RAM
or Disk
Key state
stored on
disk
Service
Accessibility
Service
Continuity
“Warm Standby” “Hot Standby or
Active-Active”
“OSS, Billing” “email, SMS”
“Voice control,
Router Control”
“Web server”
“vPE Router
Forwarder”
“vCE Router
Forwarder”
“Cold Standby”
Pre-instantiated Before failure:
Failover to running Standby
Types of State Protection
Full state protection
Key state protection
No state protection
(Stateless)
State Management has
implications on
Transparency
Performance
Resources
Re-instantiation after
failure: No Standby
To do Fast Remediation
you need
Pre-instantiation
State management
Service Restoration Speed
Multi-dimensional aspects of Resiliency
State Protection versus Service Restoration
14Stratus Technologies
15. Immense Pain Loss of
Consciousness
Loss of
Bodily Control
Temporary
Brain Loss
Fault Tolerant Systems Provide Service Continuity, Even During Failures
Failure
Cold Restart versus Hot Standby or Active-Active
……it’s like surviving a heart attack versus preventing one
Cold Restart
(Instant HA)
Hot Standby
Or Active-Active
(Fault Tolerant)
msecs secs mins hours days
Fully Protected
Backup Activated -
Unprotected
Restored to Fully Protected Redundancy
Customer Affecting Application Outage NormalApp Restart
All state is Lost
All state is Preserved
15
Re-instantiation after failure:
No Standby
Pre-instantiated Before failure:
Failover to running Standby
Stratus Technologies Confidential
16. State protection
Guaranteeing Globally Consistent State
Different ways to describe StatePointing
• Active-Standby synchronous VM replication
• Also known Checkpointing with I/O barrier, I/O lock-stepping or
buffering
What does it guarantee
• Application transparency
• IO barrier prevents all external communications from the
speculative execution prior to state replication
• Consistent VM memory replica between act-standby and hot-
standby, at the confirmed statepoint
16
17. We call it StatePointing (VM replication)
Providing Service Continuity with fast Service Restoration
VM instances paired between primary and secondary hosts in the cloud infrastructure
State of primary (active) captured regularly and applied to secondary (HotStandby)
StatePoint™ = VM Checkpoint + I/O StateStepping
• Provides globally consistent state
Fast service restoration from the most recent StatePoint upon primary failover to secondary
Automatic redundancy restoration through third host instantiation
Hot Standby Host
SP N-1
If the primary host fails, it
automatically switches to
the secondary host
Active Host
Guest Run
Epoch N-1
Guest Run
Epoch N
SP N-1
SP N
SP N
Guest Run
Epoch N+1
Guest Run
Epoch N+2
Guest Run
Epoch N+1
SP N+1
Third Host
(created
post primary
failure)
17
Guest From
Image
SP N+X
SP N+1 SP N+X
17
Active host
Hot Standby host
18. Act.-Stby. & Egress Network Traffic
n-1 n+1
QEMU Monitor
n
QEMU Monitor
QEMU Monitor
QEMU Monitor
QEMU Monitor
QEMU Monitor
Egress Network Queue Barrier; prevents transmission of queued egress packet(s) until the barrier is removed
PCR
PCR
PCR
Insert n
PCR Pause, Capture, Resume (PCR); phases of Statepoint process when VM execution is suspended
Note: For simplicity, n-2 interactions are not shown.
18
P1
P2
P3
P4
P5
P5
QEMU
(Standby)
Network
Egress
Queue
[snapshots]
QEMU
(Active)
Enqueue
Insert n-1 state I/O barrier
P1
P2
P3
P4
P5
P1
P2
P3
P4
P1
P2
P3
Guest VM
(Active)
Insert n+1
barrier
n-1 I/O barrier
Still on
n-1 I/O barrier
removed
n I/O barrier
still on
n I/O barrier
removed
19. Multiple levels of resiliency
Ensures flexibility and resource optimization based of applications
Deliver Availability as an
infrastructure service to virtual and
cloud ecosystems
Firewall MME IMS Web
Server
While every VNF needs Fault
Management, not all need state
protection
VNF-C
Forwarding
Element
VNF-C
Forwarding
Element
VNF-C
Forwarding
Element
VNF-C
Control
Element
Monolithic
VNFs
De-composed VNFs
(separate control and forwarding
elements)
Stateless Fast Path
Forwarding
Elements
Stateful
Control
Element
Fault
Tolerant
(includes State
protection)
High
Availability
(no State
protection)
Unprotected
Modes of
protection
19Stratus Technologies
20. Commodity
High Volume
Networking
Virtualization
Commodity Hyper Scale
COTS Computing
Commodity
High Volume
Storage
Linux
EPC
Linux
PCRF
Linux
HSS
Linux
IMS
…
Linux
OpticalTransport
ControlPlane
Linux
L3Routing
ControlPlane
Linux
Billing
Linux
CustomerCare
Linux
NOC
Linux
L2Switching
ControlPlane
Virtualized
OSS/BSS
Virtualized
SDN
Orchestration
NFV
Stratus Node Resiliency Services (NRS)
Protection with Application transparency, no code changes
Resiliency Functionality in the NFVI nodes & managed in the MANO
20
Stratus
Resiliency Management
Services (RMS)
MANO
OpenStack
environment
The Stratus Approach has implemented enhancements in KVM and plug-ins in OpenStack to make it seamless for the VNFs
Stratus Technologies
21. SW Infrastructure Resiliency Management
• Fault protection for all applications, no required code changes for most apps
• State Protection, offering globally consistent state
• Multiple levels of Resiliency – Software Defined Availability (SDA)
Control vs. Forwarding element, Stateful vs. stateless, etc
Benefits:
• Reduces Development & Verification time
• Lower Risks
• Faster time to market
Benefits of Resiliency Management
that includes Fault Management, Availability Management
and Configuration Management
21Stratus Technologies
22. Why We Need Resiliency vs High Availability
Achieving Resiliency Management for NFV
Proof point – ETSI PoC#35
22
23. The Stratus led PoC (ETSI PoC#35)
Participants of PoC#35
Availability Management with Stateful Fault Tolerance
• Demonstrated at NFV World Congress May 6-8 in San Jose, CA
OpenStack Summit, May 2015, Vancouver, Canada
SDN World Congress Oct 2015, Dusseldorf, Germany
• Completed 7/31/2015, final reported submitted
http://nfvwiki.etsi.org/index.php?title=Availability_Management_with_Stateful_Fault_Tolerance
Stratus Technologies
24. 24
OpenStack based VIM mechanisms alone are insufficient for supporting
carrier grade resiliency, but Stratus Cloud Technology solves that and
provided stateful failover enabling service continuity with acceptable QoS
• Service Restoration in millisecs
• Redundancy Restoration in seconds
Any non resilient VNF can be made instantaneously Resilient with no code
change (as long as it is OpenStack ready and there is no standard way to
package VNF)
Multiple levels of Resiliency can be easily provided using Software Defined
Resiliency in the Infrastructure, based on application requirement for State
and service restoration speed
What we proved with PoC#35
Stratus Technologies